添加链接
link之家
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

I need to get a line count of a large file (hundreds of thousands of lines) in python. What is the most efficient way both memory- and time-wise?

At the moment I do:

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
    return i + 1

is it possible to do any better?

I would add i=-1 before for loop, since this code doesn't work for empty files. – Maciek Sawicki Dec 27 '11 at 16:13 @Legend: I bet pico is thinking, get the file size (with seek(0,2) or equiv), divide by approximate line length. You could read a few lines at the beginning to guess the average line length. – Anne Feb 7 '12 at 17:02

You can't get any better than that.

After all, any solution will have to read the entire file, figure out how many \n you have, and return that result.

Do you have a better way of doing that without reading the entire file? Not sure... The best solution will always be I/O-bound, best you can do is make sure you don't use unnecessary memory, but it looks like you have that covered.

improve this answer Exactly, even WC is reading through the file, but in C and it's probably pretty optimized. – Ólafur Waage May 10 '09 at 10:38 As far as I understand the Python file IO is done through C as well. docs.python.org/library/stdtypes.html#file-objects – Tomalak May 10 '09 at 10:41 @Tomalak That's a red herring. While python and wc might be issuing the same syscalls, python has opcode dispatch overhead that wc doesn't have. – bobpoekert Jan 11 '13 at 22:53 You can approximate a line count by sampling. It can be thousands of times faster. See: documentroot.com/2011/02/… – Erik Aronesty Jun 14 '16 at 20:30 Other answers seem to indicate this categorical answer is wrong, and should therefore be deleted rather than kept as accepted. – Skippy le Grand Gourou Jan 25 '17 at 13:59 improve this answer its similar to sum(sequence of 1) every line is counting as 1. >>> [ 1 for line in range(10) ] [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] >>> sum( 1 for line in range(10) ) 10 >>> – James Sapam Dec 13 '13 at 5:22 num_lines = sum(1 for line in open('myfile.txt') if line.rstrip()) for filter empty lines – Honghe.Wu Mar 3 '14 at 9:26 as we open a file, will this be closed automatically once we iterate over all the elements? Is it required to 'close()'? I think we cannot use 'with open()' in this short statement, right? – Mannaggia Mar 18 '14 at 15:31 @Mannaggia you're correct, it would be better to use 'with open(filename)' to be sure the file closes when done, and even better is doing this within a try-except block, where the and IOError exception is thrown if the file cannot be opened. – BoltzmannBrain May 20 '15 at 22:58 Another thing to note: This is ~0.04-0.05 seconds slower than the one the original problem gave on a 300 thousand line text file – andrew Dec 3 '15 at 14:05

I believe that a memory mapped file will be the fastest solution. I tried four functions: the function posted by the OP (opcount); a simple iteration over the lines in the file (simplecount); readline with a memory-mapped filed (mmap) (mapcount); and the buffer read solution offered by Mykola Kharechko (bufcount).

I ran each function five times, and calculated the average run-time for a 1.2 million-line text file.

Windows XP, Python 2.5, 2GB RAM, 2 GHz AMD processor

Here are my results:

mapcount : 0.465599966049
simplecount : 0.756399965286
bufcount : 0.546800041199
opcount : 0.718600034714

Edit: numbers for Python 2.6:

mapcount : 0.471799945831
simplecount : 0.634400033951
bufcount : 0.468800067902
opcount : 0.602999973297

So the buffer read strategy seems to be the fastest for Windows/Python 2.6

Here is the code:

from __future__ import with_statement
import time
import mmap
import random
from collections import defaultdict
def mapcount(filename):
    f = open(filename, "r+")
    buf = mmap.mmap(f.fileno(), 0)
    lines = 0
    readline = buf.readline
    while readline():
        lines += 1
    return lines
def simplecount(filename):
    lines = 0
    for line in open(filename):
        lines += 1
    return lines
def bufcount(filename):
    f = open(filename)                  
    lines = 0
    buf_size = 1024 * 1024
    read_f = f.read # loop optimization
    buf = read_f(buf_size)
    while buf:
        lines += buf.count('\n')
        buf = read_f(buf_size)
    return lines
def opcount(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
    return i + 1
counts = defaultdict(list)
for i in range(5):
    for func in [mapcount, simplecount, bufcount, opcount]:
        start_time = time.time()
        assert func("big_file.txt") == 1209138
        counts[func].append(time.time() - start_time)
for key, vals in counts.items():
    print key.__name__, ":", sum(vals) / float(len(vals))
        
            
                    improve this answer
                The entire memory-mapped file isn't loaded into memory. You get a virtual memory space, which the OS swaps into and out of RAM as needed. Here's how they're handled on Windows: msdn.microsoft.com/en-us/library/ms810613.aspx
                    – Ryan Ginstrom
                May 12 '09 at 14:38
                Sorry, here's a more general reference on memory-mapped files: en.wikipedia.org/wiki/Memory-mapped_file And thanks for the vote. :)
                    – Ryan Ginstrom
                May 12 '09 at 14:45
                Even though it's just a virtual memory, it is precisely what limits this approach and therefore it won't work for huge files. I've tried it with ~1.2 Gb file with over 10 mln. lines (as obtained with wc -l) and just got a WindowsError: [Error 8] Not enough storage is available to process this command. of course, this is a edge case.
                    – SilentGhost
                May 12 '09 at 16:24
                +1 for real timing data. Do we know if the buffer size of 1024*1024 is optimal, or is there a better one?
                    – Kiv
                Jun 19 '09 at 20:07
                    

I had to post this on a similar question until my reputation score jumped a bit (thanks to whoever bumped me!).

All of these solutions ignore one way to make this run considerably faster, namely by using the unbuffered (raw) interface, using bytearrays, and doing your own buffering. (This only applies in Python 3. In Python 2, the raw interface may or may not be used by default, but in Python 3, you'll default into Unicode.)

Using a modified version of the timing tool, I believe the following code is faster (and marginally more pythonic) than any of the solutions offered:

def rawcount(filename):
    f = open(filename, 'rb')
    lines = 0
    buf_size = 1024 * 1024
    read_f = f.raw.read
    buf = read_f(buf_size)
    while buf:
        lines += buf.count(b'\n')
        buf = read_f(buf_size)
    return lines

Using a separate generator function, this runs a smidge faster:

def _make_gen(reader):
    b = reader(1024 * 1024)
    while b:
        yield b
        b = reader(1024*1024)
def rawgencount(filename):
    f = open(filename, 'rb')
    f_gen = _make_gen(f.raw.read)
    return sum( buf.count(b'\n') for buf in f_gen )

This can be done completely with generators expressions in-line using itertools, but it gets pretty weird looking:

from itertools import (takewhile,repeat)
def rawincount(filename):
    f = open(filename, 'rb')
    bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None)))
    return sum( buf.count(b'\n') for buf in bufgen )

Here are my timings:

function      average, s  min, s   ratio
rawincount        0.0043  0.0041   1.00
rawgencount       0.0044  0.0042   1.01
rawcount          0.0048  0.0045   1.09
bufcount          0.008   0.0068   1.64
wccount           0.01    0.0097   2.35
itercount         0.014   0.014    3.41
opcount           0.02    0.02     4.83
kylecount         0.021   0.021    5.05
simplecount       0.022   0.022    5.25
mapcount          0.037   0.031    7.46
        
            
                    improve this answer
                I am working with 100Gb+ files, and your rawgencounts is the only feasible solution I have seen so far. Thanks!
                    – soungalo
                Nov 10 '15 at 11:47
                found this in another comment, I guess it is then gist.github.com/zed/0ac760859e614cd03652
                    – Anentropic
                Nov 11 '15 at 18:33
                Thanks @michael-bacon, it's a really nice solution. You can make the rawincount solution less weird looking by using bufgen = iter(partial(f.raw.read, 1024*1024), b'') instead of combining takewhile and repeat.
                    – Peter H.
                Aug 6 at 6:32
                Oh, partial function, yeah, that's a nice little tweak.  Also, I assumed that the 1024*1024 would get merged by the interpreter and treated as a constant but that was on hunch not documentation.
                    – Michael Bacon
                Aug 8 at 16:20
def file_len(fname):
    p = subprocess.Popen(['wc', '-l', fname], stdout=subprocess.PIPE, 
                                              stderr=subprocess.PIPE)
    result, err = p.communicate()
    if p.returncode != 0:
        raise IOError(err)
    return int(result.strip().split()[0])
        
            
                    improve this answer
                You can refer to this SO question regarding that. stackoverflow.com/questions/247234/…
                    – Ólafur Waage
                May 10 '09 at 10:32




    

                Indeed, in my case (Mac OS X) this takes 0.13s versus 0.5s for counting the number of lines "for x in file(...)" produces, versus 1.0s counting repeated calls to str.find or mmap.find. (The file I used to test this has 1.3 million lines.)
                    – bendin
                May 10 '09 at 12:06
                    

Here is a python program to use the multiprocessing library to distribute the line counting across machines/cores. My test improves counting a 20million line file from 26 seconds to 7 seconds using an 8 core windows 64 server. Note: not using memory mapping makes things much slower.

import multiprocessing, sys, time, os, mmap
import logging, logging.handlers
def init_logger(pid):
    console_format = 'P{0} %(levelname)s %(message)s'.format(pid)
    logger = logging.getLogger()  # New logger at root level
    logger.setLevel( logging.INFO )
    logger.handlers.append( logging.StreamHandler() )
    logger.handlers[0].setFormatter( logging.Formatter( console_format, '%d/%m/%y %H:%M:%S' ) )
def getFileLineCount( queues, pid, processes, file1 ):
    init_logger(pid)
    logging.info( 'start' )
    physical_file = open(file1, "r")
    #  mmap.mmap(fileno, length[, tagname[, access[, offset]]]
    m1 = mmap.mmap( physical_file.fileno(), 0, access=mmap.ACCESS_READ )
    #work out file size to divide up line counting
    fSize = os.stat(file1).st_size
    chunk = (fSize / processes) + 1
    lines = 0
    #get where I start and stop
    _seedStart = chunk * (pid)
    _seekEnd = chunk * (pid+1)
    seekStart = int(_seedStart)
    seekEnd = int(_seekEnd)
    if seekEnd < int(_seekEnd + 1):
        seekEnd += 1
    if _seedStart < int(seekStart + 1):
        seekStart += 1
    if seekEnd > fSize:
        seekEnd = fSize
    #find where to start
    if pid > 0:
        m1.seek( seekStart )
        #read next line
        l1 = m1.readline()  # need to use readline with memory mapped files
        seekStart = m1.tell()
    #tell previous rank my seek start to make their seek end
    if pid > 0:
        queues[pid-1].put( seekStart )
    if pid < processes-1:
        seekEnd = queues[pid].get()
    m1.seek( seekStart )
    l1 = m1.readline()
    while len(l1) > 0:
        lines += 1
        l1 = m1.readline()
        if m1.tell() > seekEnd or len(l1) == 0:
            break
    logging.info( 'done' )
    # add up the results
    if pid == 0:
        for p in range(1,processes):
            lines += queues[0].get()
        queues[0].put(lines) # the total lines counted
    else:
        queues[0].put(lines)
    m1.close()
    physical_file.close()
if __name__ == '__main__':
    init_logger( 'main' )
    if len(sys.argv) > 1:
        file_name = sys.argv[1]
    else:
        logging.fatal( 'parameters required: file-name [processes]' )
        exit()
    t = time.time()
    processes = multiprocessing.cpu_count()
    if len(sys.argv) > 2:
        processes = int(sys.argv[2])
    queues=[] # a queue for each process
    for pid in range(processes):
        queues.append( multiprocessing.Queue() )
    jobs=[]
    prev_pipe = 0
    for pid in range(processes):
        p = multiprocessing.Process( target = getFileLineCount, args=(queues, pid, processes, file_name,) )
        p.start()
        jobs.append(p)
    jobs[0].join() #wait for counting to finish
    lines = queues[0].get()
    logging.info( 'finished {} Lines:{}'.format( time.time() - t, lines ) )
        
            
                    improve this answer
                How does this work with files much bigger than main memory?  for instance a 20GB file on a system with  4GB RAM and 2 cores
                    – Brian Minton
                Sep 23 '14 at 21:18
                This is pretty neat code. I was surprised to find that it is faster to use multiple processors. I figured that the IO would be the bottleneck. In older Python versions, line 21 needs int() like    chunk = int((fSize / processes)) + 1
                    – Karl Henselin
                Dec 30 '14 at 19:45
                do it load all the file into the memory?  what about a bigger fire where the size is bigger then the ram on the computer?
                    – pelos
                Dec 21 '18 at 21:30
                The files are mapped into virtual memory, so the size of the file and the amount of actual memory is usually not a restriction.
                    – Martlark
                Dec 23 '18 at 22:51
                    

I would use Python's file object method readlines, as follows:

with open(input_file) as foo:
    lines = len(foo.readlines())

This opens the file, creates a list of lines in the file, counts the length of the list, saves that to a variable and closes the file again.

improve this answer While this is one of the first ways that comes to mind, it probably isn't very memory efficient, especially if counting lines in files up to 10 GB (Like I do), which is a noteworthy disadvantage. – Steen Schütt Apr 17 '14 at 15:36 @TimeSheep Is this an issue for files with many (say, billions) of small lines, or files which have extremely long lines (say, Gigabytes per line)? – robert Jun 3 '18 at 17:40 The reason I ask is, it would seem that the compiler should be able to optimize this away by not creating an intermediate list. – robert Jun 3 '18 at 17:41 @dmityugov Per Python docs, xreadlines has been deprecated since 2.3, as it just returns an iterator. for line in file is the stated replacement. See: docs.python.org/2/library/stdtypes.html#file.xreadlines – Kumba Aug 5 '18 at 22:53 """ Count number of lines in a file.""" f = open(full_path) nr_of_lines = sum(1 for line in f) f.close() return nr_of_lines improve this answer def count_file_lines(file_path): Counts the number of lines in a file using wc utility. :param file_path: path to file :return: int, no of lines num = subprocess.check_output(['wc', '-l', file_path]) num = num.split(' ') return int(num[0])

UPDATE: This is marginally faster than using pure python but at the cost of memory usage. Subprocess will fork a new process with the same memory footprint as the parent process while it executes your command.

improve this answer core utils apparently provides "wc" for windows stackoverflow.com/questions/247234/…. You can also use a linux VM in your windows box if your code will end up running in linux in prod. – radtek Feb 25 at 16:45 Or WSL, highly advised over any VM if stuff like this is the only thing you do. :-) – Bram Vanroy Feb 25 at 16:59 Yeah that works. I'm not a windows guy but from goolging I learned WSL = Windows Subsystem for Linux =) – radtek Feb 25 at 21:39 improve this answer if you want to be surfer of python , say good bye to windows.Believe me you will thank me one day . – TheExorcist Jan 22 '17 at 10:38 I just considered it noteworthy that this will only work on windows. I prefer working on a linux/unix stack myself, but when writing software IMHO one should consider the side effects a program could have when run under different OSes. As the OP did not mention his platform and in case anyone pops on this solution via google and copies it (unaware of the limitations a Windows system might have), I wanted to add the note. – Kim Jan 22 '17 at 12:42

I got a small (4-8%) improvement with this version which re-uses a constant buffer so it should avoid any memory or GC overhead:

lines = 0
buffer = bytearray(2048)
with open(filename) as f:
  while f.readinto(buffer) > 0:
      lines += buffer.count('\n')

You can play around with the buffer size and maybe see a little improvement.

improve this answer Nice. To account for files that don't end in \n, add 1 outside of loop if buffer and buffer[-1]!='\n' – ryuusenshi Nov 14 '13 at 18:37 what if in between buffers one portion ends with \ and the other portion starts with n? that will miss one new line in there, I would sudgest to variables to store the end and the start of each chunk, but that might add more time to the script =( – pelos Dec 19 '18 at 15:47

is probably best, an alternative for this is

num_lines =  len(open('my_file.txt').read().splitlines())

Here is the comparision of performance of both

In [20]: timeit sum(1 for line in open('Charts.ipynb'))
100000 loops, best of 3: 9.79 µs per loop
In [21]: timeit len(open('Charts.ipynb').read().splitlines())
100000 loops, best of 3: 12 µs per loop
        
            
                    improve this answer
                    

A one-line bash solution similar to this answer, using the modern subprocess.check_output function:

def line_count(file):
    return int(subprocess.check_output('wc -l {}'.format(file), shell=True).split()[0])
        
            
                    improve this answer
                This answer should be voted up to a higher spot in this thread for Linux/Unix users. Despite the majority preferences in a cross-platform solution, this is a superb way on Linux/Unix. For a 184-million-line csv file I have to sample data from, it provides the best runtime. Other pure python solutions take on average 100+ seconds whereas subprocess call of wc -l takes ~ 5 seconds.
                    – Shan Dou
                Jun 27 '18 at 16:06
                    

This is the fastest thing I have found using pure python. You can use whatever amount of memory you want by setting buffer, though 2**16 appears to be a sweet spot on my computer.

from functools import partial
buffer=2**16
with open(myfile) as f:
        print sum(x.count('\n') for x in iter(partial(f.read,buffer), ''))

I found the answer here Why is reading lines from stdin much slower in C++ than Python? and tweaked it just a tiny bit. Its a very good read to understand how to count lines quickly, though wc -l is still about 75% faster than anything else.

improve this answer

Just to complete the above methods I tried a variant with the fileinput module:

import fileinput as fi   
def filecount(fname):
        for line in fi.input(fname):
        return fi.lineno()

And passed a 60mil lines file to all the above stated methods:

mapcount : 6.1331050396
simplecount : 4.588793993
opcount : 4.42918205261
filecount : 43.2780818939
bufcount : 0.170812129974

It's a little surprise to me that fileinput is that bad and scales far worse than all the other methods...

improve this answer improve this answer improve this answer But is it? At least on OSX/python2.5 the OP's version is still about 10% faster according to timeit.py. – dF. May 10 '09 at 11:47 I don't know how you tested it, dF, but on my machine it's ~2.5 times slower than any other option. – SilentGhost May 11 '09 at 16:25 You state that it will be the fastest and then state that you haven't tested it. Not very scientific eh? :) – Ólafur Waage May 11 '09 at 18:37 improve this answer Maybe also explain (or add in comment in the code) what you changed and what for ;). Might give people some more inside in your code much easier (rather than "parsing" the code in the brain). – Styxxy Nov 6 '12 at 0:50 The loop optimization I think allows Python to do a local variable lookup at read_f, python.org/doc/essays/list2str – The Red Pea Apr 3 '15 at 15:39 improve this answer improve this answer improve this answer improve this answer Optional second argument for enumerate() is start count according to docs.python.org/2/library/functions.html#enumerate – MarkHu Jul 24 '17 at 22:59 improve this answer

If one wants to get the line count cheaply in Python in Linux, I recommend this method:

import os
print os.popen("wc -l file_path").readline().split()[0]

file_path can be both abstract file path or relative path. Hope this may help.

improve this answer

the result of opening a file is an iterator, which can be converted to a sequence, which has a length:

with open(filename) as f:
   return len(list(f))

this is more concise than your explicit loop, and avoids the enumerate.

improve this answer yep, good point, although I wonder about the speed (as opposed to memory) difference. It's probably possible to create an iterator that does this, but I think it would be equivalent to your solution. – Andrew Jaffe May 10 '09 at 11:53 improve this answer improve this answer
import os
import subprocess
Number_lines = int( (subprocess.Popen( 'wc -l {0}'.format( Filename ), shell=True, stdout=subprocess.PIPE).stdout).readlines()[0].split()[0] )

, where Filename is the absolute path of the file.

improve this answer improve this answer
def count_text_file_lines(path):
    with open(path, 'rt') as file:
        line_count = sum(1 for _line in file)
    return line_count
        
            
                    improve this answer
                Could you please explain what is wrong with it if you think it is wrong? It worked for me. Thanks!
                    – jciloa
                Dec 20 '17 at 17:04
                I would be interested in why this answer was downvoted, too. It iterates over the file by lines and sums them up. I like it, it is short and to the point, what's wrong with it?
                    – cessor
                Mar 16 '18 at 11:23