Thanks for your reply. The openpyxl part (reading the workbook) works fine. I printed some debug information and found that when it reaches the np.array, after some 10 seconds, the memory usage goes high.
So, I think numpy is unable to manage the memory. Regards, Mahmood On Wednesday, May 10, 2017 7:25 PM, Peter Otten <__pete...@web.de> wrote: Mahmood Naderan via Python-list wrote: > Hello, > > The following code which uses openpyxl and numpy, fails to read large > Excel (xlsx) files. The file si 20Mb which contains 100K rows and 50 > columns. > > > > W = load_workbook(fname, read_only = True) > > p = W.worksheets[0] > > a=[] > > m = p.max_row > > n = p.max_column > > > np.array([[i.value for i in j] for j in p.rows]) > > > > How can I fix that? I have stuck at this problem. For medium sized files > (16K rows and 50 columns) it is fine. The docs at https://openpyxl.readthedocs.io/en/default/optimized.html#read-only-mode promise "(near) constant memory consumption" for the sample script below: from openpyxl import load_workbook wb = load_workbook(filename='large_file.xlsx', read_only=True) ws = wb['big_data'] for row in ws.rows: for cell in row: print(cell.value) If you change only the file and worksheet name to your needs -- does the script run to completion in reasonable time (redirect stdout to /dev/null) and with reasonable memory usage? If it does you may be wasting memory elsewhere; otherwise you might need to convert the xlsx file to csv using your spreadsheet application before processing the data in Python. -- https://mail.python.org/mailman/listinfo/python-list -- https://mail.python.org/mailman/listinfo/python-list