On Thursday, 21 January 2016 at 19:08:38 UTC, data pulverizer
wrote:
On Thursday, 21 January 2016 at 18:46:03 UTC, Justin Whear
wrote:
On Thu, 21 Jan 2016 18:37:08 +0000, data pulverizer wrote:
It's interesting that the output first array is not the same
as the input
byLine reuses a buffer (for speed) and the subsequent split
operation just returns slices into that buffer. So when
byLine progresses to the next line the strings (slices)
returned previously now point into a buffer with different
contents. You should either use byLineCopy or .idup to create
copies of the relevant strings. If your use-case allows for
streaming and doesn't require having all the data present at
once, you could continue to use byLine and just be careful not
to refer to previous rows.
Thanks. It now works with byLineCopy()
Time (s): 1.128
Currently the timing is similar to python pandas:
# Script (Python 2.7.6)
import pandas as pd
import time
col_types = {'col1': str, 'col2': str, 'col3': str, 'col4': str,
'col5': str, 'col6': str, 'col7': str, 'col8': str, 'col9': str,
'col10': str, 'col11': str, 'col12': str, 'col13': str, 'col14':
str, 'col15': str, 'col16': str, 'col17': str, 'col18': str,
'col19': str, 'col20': str, 'col21': str, 'col22': str}
begin = time.time()
x = pd.read_csv('Acquisition_2009Q2.txt', sep = '|', dtype =
col_types)
end = time.time()
print end - begin
$ python file_read.py
1.19544792175