def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
return i + 1
is it possible to do any better?
–
–
You can't get any better than that.
After all, any solution will have to read the entire file, figure out how many \n
you have, and return that result.
Do you have a better way of doing that without reading the entire file? Not sure... The best solution will always be I/O-bound, best you can do is make sure you don't use unnecessary memory, but it looks like you have that covered.
https://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python/845081#845081
share
improve this answer
–
–
–
–
–
https://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python/1019572#1019572
share
improve this answer
–
–
–
–
–
I believe that a memory mapped file will be the fastest solution. I tried four functions: the function posted by the OP (opcount
); a simple iteration over the lines in the file (simplecount
); readline with a memory-mapped filed (mmap) (mapcount
); and the buffer read solution offered by Mykola Kharechko (bufcount
).
I ran each function five times, and calculated the average run-time for a 1.2 million-line text file.
Windows XP, Python 2.5, 2GB RAM, 2 GHz AMD processor
Here are my results:
mapcount : 0.465599966049
simplecount : 0.756399965286
bufcount : 0.546800041199
opcount : 0.718600034714
Edit: numbers for Python 2.6:
mapcount : 0.471799945831
simplecount : 0.634400033951
bufcount : 0.468800067902
opcount : 0.602999973297
So the buffer read strategy seems to be the fastest for Windows/Python 2.6
Here is the code:
from __future__ import with_statement
import time
import mmap
import random
from collections import defaultdict
def mapcount(filename):
f = open(filename, "r+")
buf = mmap.mmap(f.fileno(), 0)
lines = 0
readline = buf.readline
while readline():
lines += 1
return lines
def simplecount(filename):
lines = 0
for line in open(filename):
lines += 1
return lines
def bufcount(filename):
f = open(filename)
lines = 0
buf_size = 1024 * 1024
read_f = f.read # loop optimization
buf = read_f(buf_size)
while buf:
lines += buf.count('\n')
buf = read_f(buf_size)
return lines
def opcount(fname):
with open(fname) as f:
for i, l in enumerate(f):
return i + 1
counts = defaultdict(list)
for i in range(5):
for func in [mapcount, simplecount, bufcount, opcount]:
start_time = time.time()
assert func("big_file.txt") == 1209138
counts[func].append(time.time() - start_time)
for key, vals in counts.items():
print key.__name__, ":", sum(vals) / float(len(vals))
https://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python/850962#850962
share
improve this answer
–
–
–
–
I had to post this on a similar question until my reputation score jumped a bit (thanks to whoever bumped me!).
All of these solutions ignore one way to make this run considerably faster, namely by using the unbuffered (raw) interface, using bytearrays, and doing your own buffering. (This only applies in Python 3. In Python 2, the raw interface may or may not be used by default, but in Python 3, you'll default into Unicode.)
Using a modified version of the timing tool, I believe the following code is faster (and marginally more pythonic) than any of the solutions offered:
def rawcount(filename):
f = open(filename, 'rb')
lines = 0
buf_size = 1024 * 1024
read_f = f.raw.read
buf = read_f(buf_size)
while buf:
lines += buf.count(b'\n')
buf = read_f(buf_size)
return lines
Using a separate generator function, this runs a smidge faster:
def _make_gen(reader):
b = reader(1024 * 1024)
while b:
yield b
b = reader(1024*1024)
def rawgencount(filename):
f = open(filename, 'rb')
f_gen = _make_gen(f.raw.read)
return sum( buf.count(b'\n') for buf in f_gen )
This can be done completely with generators expressions in-line using itertools, but it gets pretty weird looking:
from itertools import (takewhile,repeat)
def rawincount(filename):
f = open(filename, 'rb')
bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None)))
return sum( buf.count(b'\n') for buf in bufgen )
Here are my timings:
function average, s min, s ratio
rawincount 0.0043 0.0041 1.00
rawgencount 0.0044 0.0042 1.01
rawcount 0.0048 0.0045 1.09
bufcount 0.008 0.0068 1.64
wccount 0.01 0.0097 2.35
itercount 0.014 0.014 3.41
opcount 0.02 0.02 4.83
kylecount 0.021 0.021 5.05
simplecount 0.022 0.022 5.25
mapcount 0.037 0.031 7.46
https://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python/27518377#27518377
share
improve this answer
–
–
–
–
def file_len(fname):
p = subprocess.Popen(['wc', '-l', fname], stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
result, err = p.communicate()
if p.returncode != 0:
raise IOError(err)
return int(result.strip().split()[0])
https://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python/845069#845069
share
improve this answer
–
–
Here is a python program to use the multiprocessing library to distribute the line counting across machines/cores. My test improves counting a 20million line file from 26 seconds to 7 seconds using an 8 core windows 64 server. Note: not using memory mapping makes things much slower.
import multiprocessing, sys, time, os, mmap
import logging, logging.handlers
def init_logger(pid):
console_format = 'P{0} %(levelname)s %(message)s'.format(pid)
logger = logging.getLogger() # New logger at root level
logger.setLevel( logging.INFO )
logger.handlers.append( logging.StreamHandler() )
logger.handlers[0].setFormatter( logging.Formatter( console_format, '%d/%m/%y %H:%M:%S' ) )
def getFileLineCount( queues, pid, processes, file1 ):
init_logger(pid)
logging.info( 'start' )
physical_file = open(file1, "r")
# mmap.mmap(fileno, length[, tagname[, access[, offset]]]
m1 = mmap.mmap( physical_file.fileno(), 0, access=mmap.ACCESS_READ )
#work out file size to divide up line counting
fSize = os.stat(file1).st_size
chunk = (fSize / processes) + 1
lines = 0
#get where I start and stop
_seedStart = chunk * (pid)
_seekEnd = chunk * (pid+1)
seekStart = int(_seedStart)
seekEnd = int(_seekEnd)
if seekEnd < int(_seekEnd + 1):
seekEnd += 1
if _seedStart < int(seekStart + 1):
seekStart += 1
if seekEnd > fSize:
seekEnd = fSize
#find where to start
if pid > 0:
m1.seek( seekStart )
#read next line
l1 = m1.readline() # need to use readline with memory mapped files
seekStart = m1.tell()
#tell previous rank my seek start to make their seek end
if pid > 0:
queues[pid-1].put( seekStart )
if pid < processes-1:
seekEnd = queues[pid].get()
m1.seek( seekStart )
l1 = m1.readline()
while len(l1) > 0:
lines += 1
l1 = m1.readline()
if m1.tell() > seekEnd or len(l1) == 0:
break
logging.info( 'done' )
# add up the results
if pid == 0:
for p in range(1,processes):
lines += queues[0].get()
queues[0].put(lines) # the total lines counted
else:
queues[0].put(lines)
m1.close()
physical_file.close()
if __name__ == '__main__':
init_logger( 'main' )
if len(sys.argv) > 1:
file_name = sys.argv[1]
else:
logging.fatal( 'parameters required: file-name [processes]' )
exit()
t = time.time()
processes = multiprocessing.cpu_count()
if len(sys.argv) > 2:
processes = int(sys.argv[2])
queues=[] # a queue for each process
for pid in range(processes):
queues.append( multiprocessing.Queue() )
jobs=[]
prev_pipe = 0
for pid in range(processes):
p = multiprocessing.Process( target = getFileLineCount, args=(queues, pid, processes, file_name,) )
p.start()
jobs.append(p)
jobs[0].join() #wait for counting to finish
lines = queues[0].get()
logging.info( 'finished {} Lines:{}'.format( time.time() - t, lines ) )
https://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python/6826326#6826326
share
improve this answer
–
–
–
–
I would use Python's file object method readlines
, as follows:
with open(input_file) as foo:
lines = len(foo.readlines())
This opens the file, creates a list of lines in the file, counts the length of the list, saves that to a variable and closes the file again.
https://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python/19248109#19248109
share
improve this answer
–
–
–
–
""" Count number of lines in a file."""
f = open(full_path)
nr_of_lines = sum(1 for line in f)
f.close()
return nr_of_lines
https://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python/845075#845075
share
improve this answer
def count_file_lines(file_path):
Counts the number of lines in a file using wc utility.
:param file_path: path to file
:return: int, no of lines
num = subprocess.check_output(['wc', '-l', file_path])
num = num.split(' ')
return int(num[0])
UPDATE: This is marginally faster than using pure python but at the cost of memory usage. Subprocess will fork a new process with the same memory footprint as the parent process while it executes your command.
https://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python/45334571#45334571
share
improve this answer
–
–
–
https://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python/41501750#41501750
share
improve this answer
–
–
I got a small (4-8%) improvement with this version which re-uses a constant buffer so it should avoid any memory or GC overhead:
lines = 0
buffer = bytearray(2048)
with open(filename) as f:
while f.readinto(buffer) > 0:
lines += buffer.count('\n')
You can play around with the buffer size and maybe see a little improvement.
https://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python/15074925#15074925
share
improve this answer
–
–
is probably best, an alternative for this is
num_lines = len(open('my_file.txt').read().splitlines())
Here is the comparision of performance of both
In [20]: timeit sum(1 for line in open('Charts.ipynb'))
100000 loops, best of 3: 9.79 µs per loop
In [21]: timeit len(open('Charts.ipynb').read().splitlines())
100000 loops, best of 3: 12 µs per loop
https://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python/26375032#26375032
share
improve this answer
A one-line bash solution similar to this answer, using the modern subprocess.check_output
function:
def line_count(file):
return int(subprocess.check_output('wc -l {}'.format(file), shell=True).split()[0])
https://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python/43179213#43179213
share
improve this answer
–
This is the fastest thing I have found using pure python.
You can use whatever amount of memory you want by setting buffer, though 2**16 appears to be a sweet spot on my computer.
from functools import partial
buffer=2**16
with open(myfile) as f:
print sum(x.count('\n') for x in iter(partial(f.read,buffer), ''))
I found the answer here Why is reading lines from stdin much slower in C++ than Python? and tweaked it just a tiny bit. Its a very good read to understand how to count lines quickly, though wc -l
is still about 75% faster than anything else.
https://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python/41068461#41068461
share
improve this answer
Just to complete the above methods I tried a variant with the fileinput module:
import fileinput as fi
def filecount(fname):
for line in fi.input(fname):
return fi.lineno()
And passed a 60mil lines file to all the above stated methods:
mapcount : 6.1331050396
simplecount : 4.588793993
opcount : 4.42918205261
filecount : 43.2780818939
bufcount : 0.170812129974
It's a little surprise to me that fileinput is that bad and scales far worse than all the other methods...
https://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python/2772853#2772853
share
improve this answer
https://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python/28680922#28680922
share
improve this answer
https://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python/845151#845151
share
improve this answer
–
–
–
https://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python/8270828#8270828
share
improve this answer
–
–
https://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python/22551496#22551496
share
improve this answer
https://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python/52365651#52365651
share
improve this answer
https://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python/845802#845802
share
improve this answer
https://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python/5278398#5278398
share
improve this answer
–
https://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python/24003811#24003811
share
improve this answer
If one wants to get the line count cheaply in Python in Linux, I recommend this method:
import os
print os.popen("wc -l file_path").readline().split()[0]
file_path can be both abstract file path or relative path. Hope this may help.
https://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python/25544973#25544973
share
improve this answer
the result of opening a file is an iterator, which can be converted to a sequence, which has a length:
with open(filename) as f:
return len(list(f))
this is more concise than your explicit loop, and avoids the enumerate
.
https://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python/845157#845157
share
improve this answer
–
https://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python/6750195#6750195
share
improve this answer
https://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python/19149178#19149178
share
improve this answer
import os
import subprocess
Number_lines = int( (subprocess.Popen( 'wc -l {0}'.format( Filename ), shell=True, stdout=subprocess.PIPE).stdout).readlines()[0].split()[0] )
, where Filename
is the absolute path of the file.
https://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python/26695897#26695897
share
improve this answer
https://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python/28165046#28165046
share
improve this answer
def count_text_file_lines(path):
with open(path, 'rt') as file:
line_count = sum(1 for _line in file)
return line_count
https://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python/47856283#47856283
share
improve this answer
–
–