taotao li created ARROW-7043:
--------------------------------
Summary: pyarrow.csv.read_csv, memory consumed much larger than
raw pandas.read_csv
Key: ARROW-7043
URL: https://issues.apache.org/jira/browse/ARROW-7043
Project: Apache Arrow
Issue Type: Test
Components: Python
Affects Versions: 0.15.0
Reporter: taotao li
Hi, thanks great for building Arrow firstly, I find this project from wes's
post : [https://wesmckinney.com/blog/apache-arrow-pandas-internals/]
his ambition on building arrow for fixing problems in pandas really attract my
eyes.
bellow is my problems:
background:
* Our team's analytic work deeply rely on pandas, we often read large csv
files into memory and do kinds of analytic work.
* We have faced problems which mentioned in wes's post, espcially `pandas rule
of thumb: have 5 to 10 times as much RAM as the size of your dataset`
* We are looking for some technics which can help us on load our csv(or other
format, like msgpack, parquet, or something else), using as little as memory.
experiment:
* luckily I find arrow, and I did a simple test.
* input file: a 1.5GB csv file, around 6 million records, 15 columns;
* using pandas bellow:
*
{code:java}
import pandas as pd{code}
*
*
{code:java}
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)