[ https://issues.apache.org/jira/browse/ARROW-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15924468#comment-15924468 ]
Wes McKinney commented on ARROW-539: ------------------------------------ Making partitioned tables with Hive or Impala is pretty difficult, here was the code I used to make one {code:language=python} import ibis import pandas as pd import hdfs hdfs = ibis.hdfs_connect('localhost', port=5070) con = ibis.impala.connect('localhost', port=21050, hdfs_client=hdfs) df = pd.DataFrame({'year': [2009, 2009, 2009, 2010, 2010, 2010], 'month': ['1', '2', '3', '1', '2', '3'], 'value': [1, 2, 3, 4, 5, 6]}) df = pd.concat([df] * 10, ignore_index=True) con.create_database('temp_partition', path='/tmp/my_db') con.create_table('unpartitioned', df, database='temp_partition') db = con.database('temp_partition') unpart_t = db.table('unpartitioned') part_keys = ['year', 'month'] unique_keys = df[part_keys].drop_duplicates() con.create_table('partitioned', schema=unpart_t.schema(), database='temp_partition', partition=part_keys) part_t = db.table('partitioned') for i, (year, month) in enumerate(unique_keys.itertuples(index=False)): select_stmt = unpart_t[(unpart_t.year == year) & (unpart_t.month == month)] part = {'year': year, 'month': month} part_t.insert(select_stmt, partition=part) {code} Now we have {code} >>> hdfs.ls('/tmp/my_db/partitioned') ['_impala_insert_staging', 'year=2009', 'year=2010'] >>> hdfs.ls('/tmp/my_db/partitioned/year=2009') ['month=1', 'month=2', 'month=3'] {code} Finally I ran {code} hdfs.get('/tmp/my_db/partitioned', 'partitioned_parquet') {code} to download from HDFS. see attached tarball > [Python] Support reading Parquet datasets with standard partition directory > schemes > ----------------------------------------------------------------------------------- > > Key: ARROW-539 > URL: https://issues.apache.org/jira/browse/ARROW-539 > Project: Apache Arrow > Issue Type: New Feature > Components: Python > Reporter: Wes McKinney > Attachments: partitioned_parquet.tar.gz > > > Currently, we only support multi-file directories with a flat structure > (non-partitioned). -- This message was sent by Atlassian JIRA (v6.3.15#6346)