[ https://issues.apache.org/jira/browse/ARROW-5430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rok Mihevc updated ARROW-5430: ------------------------------ External issue URL: https://github.com/apache/arrow/issues/21883 > [Python] Can read but not write parquet partitioned on large ints > ----------------------------------------------------------------- > > Key: ARROW-5430 > URL: https://issues.apache.org/jira/browse/ARROW-5430 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.13.0 > Environment: Mac OSX 10.14.4, Python 3.7.1, x86_64. > Reporter: Robin Kåveland > Priority: Minor > Labels: parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Here's a contrived example that reproduces this issue using pandas: > {code:java} > import numpy as np > import pandas as pd > real_usernames = np.array(['anonymize', 'me']) > usernames = pd.util.hash_array(real_usernames) > login_count = [13, 9] > df = pd.DataFrame({'user': usernames, 'logins': login_count}) > df.to_parquet('can_write.parq', partition_cols=['user']) > # But not read > pd.read_parquet('can_write.parq'){code} > Expected behaviour: > * Either the write fails > * Or the read succeeds > Actual behaviour: The read fails with the following error: > {code:java} > Traceback (most recent call last): > File "<stdin>", line 2, in <module> > File > "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pandas/io/parquet.py", > line 282, in read_parquet > return impl.read(path, columns=columns, **kwargs) > File > "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pandas/io/parquet.py", > line 129, in read > **kwargs).to_pandas() > File > "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py", > line 1152, in read_table > use_pandas_metadata=use_pandas_metadata) > File > "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/filesystem.py", > line 181, in read_parquet > use_pandas_metadata=use_pandas_metadata) > File > "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py", > line 1014, in read > use_pandas_metadata=use_pandas_metadata) > File > "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py", > line 587, in read > dictionary = partitions.levels[i].dictionary > File > "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py", > line 642, in dictionary > dictionary = lib.array(integer_keys) > File "pyarrow/array.pxi", line 173, in pyarrow.lib.array > File "pyarrow/array.pxi", line 36, in pyarrow.lib._sequence_to_array > File "pyarrow/error.pxi", line 104, in pyarrow.lib.check_status > pyarrow.lib.ArrowException: Unknown error: Python int too large to convert to > C long{code} > I set the priority to minor here because it's easy enough to work around this > in user code unless you really need the 64 bit hash (and you probably > shouldn't be partitioning on that anyway). > I could take a stab at writing a patch for this if there's interest? -- This message was sent by Atlassian Jira (v8.20.10#820010)