在Python中逐行向各行追加文本字符串，以避免内存错误--更快的替代方案？

3 人关注

我有一个大文件，命名为例如XXX_USR.txt。我遍历了这个文件夹，其中一些txt文件超过了500MB大。为了避免 MEMORY ERROR ，我需要逐行追加这些文件。然而，我目前的方法太慢了。第一行是由 |SYS 追加的，所有其他行是由 '| ' + amendtext 追加的。替换代码3】是一个变量，它从.txt文件的名称中获取第一个下划线之前的第一个字符串，例如 "XXX"。

File: XXX_USR.txt
INPUT: 
| name | car |
--------------
| Paul |Buick|
|Ringo |WV   |
|George|MG   |
| John |BMW  |
DESIRED OUTPUT:
|SYS  | name | car |
--------------------
| XXX | Paul |Buick|
| XXX |Ringo |WV   |
| XXX |George|MG   |
| XXX | John |BMW  |
我的代码，太慢了，但胜过内存错误。
import os
import glob
from pathlib import Path
cwd = 'C:\\Users\\EricClapton\\'
directory = cwd
txt_files = os.path.join(directory, '*.txt')
for txt_file in glob.glob(txt_files):
    cpath =(Path(txt_file).resolve().stem)
    nametxt = "-".join(cpath.split('_')[0:1])
    amendtext = "|  " + nametxt
    systext = "|   SYS"
    with open(txt_file,'r', errors='ignore') as f:
        get_all=f.readlines()
    with open(txt_file,'w') as f:
        for i,line in enumerate(get_all,1):        
            if i == 1:                              
                f.writelines(systext + line)
            else:
                f.writelines(amendtext + line)


           
            
             sahasrara62
            
            ：


           
            
             可能是重复的
             
              Process very large (>20GB) text file line by line


           
            
             这是一个很好的建议，但它并没有直接解决我的问题，即把两个不同的字符串写进选定的行。


           
            
             sahasrara62
            
            ：


           
            
             这可能是有帮助的
             
              solution
             
             你需要对这段代码进行相应的修改，不能总是期望有一个量身定做的女仆解决方案。


         
          python


          
           已采纳


          
           
            你说的太慢到底是什么意思？它是以秒为单位还是以分钟为单位运行？我可以说，我在我的笔记本电脑上运行了一个类似的情况，对于一个超过1G和35946689行的文件，大约需要29秒。
           
           
            I used the
            
             就地
            
            模块来打开文件，在一个
            
             edit-type
            
            模式，而不是
            
             read
            
            和/或
            
             write
            
            。这消除了在工作中重复存储数据的需要。
           
           with in_place.InPlace(txt_file) as f:
    for line in f:
        f.write(amendtext + line)
另外，不要从IDE中运行它。我可能会减慢进程，并对你能做的事情有限制。
Update:
我想我明白是什么导致了你执行时间的延迟。在你的原始代码中，当循环浏览文件内容时，你在每次迭代中都执行条件检查。

在你更新的代码中，你现在打开文件进行四次读写，并存储其所有内容。下面是更新后的代码，它将处理你对修改第一行的需要，而不需要条件检查。
with in_place.InPlace(txt_file) as f:
    f.write(systext + f.readline())
    for line in f:
        f.write(amendtext + line)
替换代码4】里面的第一行将从你的文本文件中读取第一行，修改它，然后写入。

在这一点上，迭代器将移到下一行，在那里你可以按照你的意愿处理数据。


           
            
             
              
               @Kokokoko 请看我根据你的评论更新的答案。


           
            
             
              
               我按自己的方法做了，但我也试了你的方法，效果很好。谢谢你向我介绍in_place。


          
           
            
             
              最后，
              
               enumerate
              
              方法不适合逐行读取这么大的文件并列举行数。我使用了
              
               readlines
              
              方法来代替。不，我是把文件读成独立的小块，然后用预置的字符串写入和追加文件。
             
             import os
import glob
from pathlib import Path
cwd = 'C:\\Users\\EricClapton\\'
directory = cwd
txt_files = os.path.join(directory, '*.txt')
for txt_file in glob.glob(txt_files):
    cpath =(Path(txt_file).resolve().stem)
    nametxt = "-".join(cpath.split('_')[0:1])
    amendtext = "|  " + nametxt
    systext = "|   SYS"
with open(txt_file,'r', errors='ignore') as f:
    get_all=f.readlines()[:1]
with open(txt_file,'r', errors='ignore') as s:
    get_itdone=s.readlines()[1:]
with open(txt_file, 'w') as k:
    for line in get_all:
        k.write(systext + line)