添加链接
link之家
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接
Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I'm making a Python script which parses data files. The parsed data is then sent to an Excel file. The data can be rather huge. I'm looking at 10 to 20 columns, but the number of rows can be well over 100.000.

Writing this amount of data to Excel with win32com takes a considerable amount of time. My first step was to iterate over every cell in the Excel file, which is very time consuming. After some digging I found out how I can write a row with a single call, reducing the time needed drastically.

However, when I need to send 100.000 rows of data to Excel, it will still take a lot of time. I'm pretty sure that I can further speed this up, by sending the complete array with a single call. However I'm not able to achieve this so far.

See the code below, which demonstrates the issue. The code shows the differences in time. However, the third step, sending the complete array with a single call to a range is not resulting in the correct data in Excel. What am I doing wrong?

import win32com.client
import time
#create data array
row = range(0,10)
i = 0
data_array = []
while i < 1000:
    data_array.append(row)
    i += 1
#write the array to an excel file
excel = win32com.client.Dispatch("Excel.Application")
excel.Visible = True
excel.ScreenUpdating = False
book = excel.Workbooks.Add()
sh1 = book.Worksheets(1)
sh2 = book.Worksheets(2)
sh3 = book.Worksheets(3)
#double loop, writing individual cells
print "Writing with double loop to inidividual cells."
start = time.time()
row = 0
for line in data_array:
    row += 1
    col = 0
    for field in line:
        col += 1
        sh1.Cells(row, col).Value = field
print "Processing time: " + str(time.time() - start) + " seconds."
#single loop, writing a row to a range
print "Writing with double loop to inidividual cells."
start = time.time()
row = 0
for line in data_array:
    row += 1
    sh2.Range(sh2.Cells(row,1), sh2.Cells(row, len(line))).Value = line
print "Processing time: " + str(time.time() - start) + " seconds."
#no loop, write array to range
print "Writing with double loop to inidividual cells."
start = time.time()
    sh3.Range(sh3.Cells(row,1), sh3.Cells(len(data_array), len(data_array[0]))).Value = data_array
    print "Processing time: " + str(time.time() - start) + " seconds."
except:
    print "Failed to write array to excel file."
excel.ScreenUpdating = True
sheet = None
book = None
excel.Quit()
excel = None

I have been doing some research on this. And have been coming to some interesting conclusions.

There are indeed multiple solutions around to write data to excel from python. I ended up focussing on three modules.

win32com.client

Works slow. However the document can be opened in excel. So the end result is available for the user in excel to start working on. Not great for large amounts of data.

Writing 10,000 rows with 10 columns takes on my computer (core i5) app. 70 seconds.

import win32com.client
import time
#create data array
row = range(0,10)
i = 0
data_array = []
while i < 10000:
    data_array.append(row)
    i += 1
#write the array to an excel file
excel = win32com.client.Dispatch("Excel.Application")
excel.Visible = True
excel.ScreenUpdating = False
book = excel.Workbooks.Add()
sheet = book.Worksheets(1)
#single loop, writing a row to a range
print "Writing using win32com.client"
start = time.time()
row = 0
for line in data_array:
    row += 1
    sheet.Range(sheet.Cells(row,1), sheet.Cells(row, len(line))).Value = line
print "Processing time: " + str(time.time() - start) + " seconds."
print "Completed: " + str(time.time() - start) + " seconds."
excel.ScreenUpdating = True
sheet = None
book = None
excel.Quit()
excel = None

openpyxl

Is a bit faster, but still not great. This module is slow on transfering the data to the excel memory object, but save it really fast. It creates the 10,000 lines with 10 columns in 22.3 seconds, and saves the file in another 0.5 seconds. When I test this with 100,000 rows and 10 columns. The data is created in 228.3 seconds, and saving the file is done in another 2.9 seconds. Fairly slow, but the file saving is fast. Hence openpyxl is probably good for making changes (formatting) to existing data, although I have not testing this yet. Another advantage is that coding with openpyxl is easier then using win32com.client.

import openpyxl
import sys
import time
#create data array
row = range(0,10)
i = 0
data_array = []
while i < 10000:
    data_array.append(row)
    i += 1
#create an excel workbook and sheet object
book = openpyxl.Workbook(optimized_write = True)
#single loop, writing rows
start = time.time()
print "Writing with single loop using openpyxl"
sheet = book.create_sheet()
for line in data_array:
    sheet.append(line)
print "Processing time: " + str(time.time() - start) + " seconds."
#close and save the file.
book.save('openpyxl.xlsx')
print "Total time: " + str(time.time() - start) + " seconds."

I am facing another issue with openpyxl. In my real tool, openpyxl is having problems with saving big amounts of data (> 10,000 rows). This I have not figured out yet, and maybe I won't bother to look further in to it.

PyExcelerate

This one is first of all fast. It creates the 10,000 rows and 10 columns in just 0.17 seconds. Saving the file however takes 2.2 seconds. By far the fastest option of the three. When I try to save 100,000 rows and 10 columns with this module, the excel data is created in just 1.8 seconds. However saving the file now takes 21.7 seconds. Hence, this module is really fast, but there is a penalty in writing the file. Overall still the fastest of them all. Another advantage of PyExcelerate is that coding is very easy, easier again as openpyxl.

import pyexcelerate
import sys
import time
#create data array
row = range(0,10)
i = 0
data_array = []
while i < 10000:
    data_array.append(row)
    i += 1
print sys.version
#create an excel workbook and sheet object
book = pyexcelerate.Workbook()
#single loop, writing rows
start = time.time()
print "Writing with single loop using PyExcelerate"
book.new_sheet("data", data = data_array)
print "Processing time: " + str(time.time() - start) + " seconds."
#close and save the file.
book.save('pyexcelerate.xlsx')
print "Total time: " + str(time.time() - start) + " seconds."

So my conclusion is that PyExcelerate is by far the fastest. The advantage of win32com.client is that the created excel file can be openend in excel to make the created data available for the user to start working with it. Openpyxl could be interesting to make style changes after creating it. However, I have not tested this yet. Hence, it could be beneficial to combine win32com.client, openpyxl and PyExcelerate in one application.

Very interesting, I did not know about pyexcelerate. It seems it also supports cell formatting and so on. The only thing I'd like to add is that the win32com's advantage of opening the file for the user, is a very poor advantage. Opening a file from python is just typing one line -and would work faster than via win32com-. Cheers. – bgusach Feb 14, 2014 at 7:38 How would you open the excel file in excel then? I would be very interested to know that. – Eric Feb 14, 2014 at 20:10 As easy as os.system('start ' + filepath). It opens the excel file with the default programme, which is normally excel :) – bgusach Feb 15, 2014 at 22:29 Sometimes the world is just so easy. :-) Thanks for pointing this out. Works really great. According to the python docs it is: os.startfile(path), which works great. – Eric Feb 25, 2014 at 16:16

I know the post is a bit old.
However, being stuck with win32com for some other reason (macro executions) and needing a similar solution to your 3rd sheet way (complete array at once), I tried the code of your initial post and found a small mistake that prevented it to work.
So to answer your original question "What am I doing wrong?":
You forgot to re-initialize the row variable to 1 before using it in the '#no loop, write array to range' part.
Here is the win32com part of your comparison post updated then:

print "Writing using win32com.client"
start = time.time()
row = 1
sheet.Range(sheet.Cells(row,1), sheet.Cells(row+len(data_array)-1, len(data_array[0]))).Value = data_array
print "Processing time: " + str(time.time() - start) + " seconds."
book.SaveAs(Filename="C:\Temp\Temp.xls", FileFormat=56)
print "Completed: " + str(time.time() - start) + " seconds."

Then, timing is no so bad:
Writing using win32com.client
Processing time: 0.322000026703 seconds.
Completed: 1.73699998856 seconds.
Pretty fast then. Maybe your comparison can be updated since the PyExcelerate numbers are not so different anymore (and my computer is slower).

@ThomasF I get the following ValueError when I use above method: ndarray is not C-contiguous. Any idea why? Reference: stackoverflow.com/questions/61743292/… – Aroosh Rana May 12, 2020 at 13:51

Using the COM for reading files from a excel file is a extreme waste of time. It's like killing flyes with a tank. Take into account that the win32com does complicated calls with the windows API, that talk to excel, retrieves the data and sends it back to python. Why do that when the information is already there as a file?

There are libraries that parse directly the excel file, and as you can imagine they can be x100 times faster, since there is no over-complex calls to the win API.

I've worked a lot successfully with openpyxl, but there are others libraries out there that can be as good or even better.

Just an example for huge data (uses generators instead of loading everything into memory):

from openpyxl import load_workbook
wb = load_workbook(filename='large_file.xlsx', use_iterators=True)
ws = wb.get_sheet_by_name(name='big_data') # ws is now an IterableWorksheet
for row in ws.iter_rows(): # it brings a new method: iter_rows()
     for cell in row:
         print cell.internal_value

Equivalent methods are available to write into cells. You can even format them, although it is not (or used to be) very complete.

Example on how to write a big amount of info to a xlsx file:

from openpyxl import Workbook
from openpyxl.cell import get_column_letter
wb = Workbook()
dest_filename = r'empty_book.xlsx'
ws = wb.active
ws.title = "range names"
for col_idx in xrange(1, 40):
    col = get_column_letter(col_idx)
    for row in xrange(1, 600):
        ws.cell('%s%s'%(col, row)).value = '%s%s' % (col, row)
ws = wb.create_sheet()
ws.title = 'Pi'
ws['F5'] = 3.14
wb.save(filename=dest_filename)
                I'm not trying to read anything from excel. I'm trying to write data to an excel file. The nice thing about win32com is, that the data will be send to excel, and the user has the file open and can start working with it.
– Eric
                Jan 31, 2014 at 14:09
                Nothing you can't do with openpyxl x100 times faster. Using excel and the win32com is a LOT of overhead. Apart from huge speed, working directly with the .xlsx file does not even need that excel is installed.
– bgusach
                Jan 31, 2014 at 14:15
        

Thanks for contributing an answer to Stack Overflow!

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.