Rauli Ruohonen created ARROW-8816: ------------------------------------- Summary: [Python] Year 2263 or later datetimes get mangled when written using pandas Key: ARROW-8816 URL: https://issues.apache.org/jira/browse/ARROW-8816 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.17.0, 0.16.0 Environment: Tested using pyarrow 0.17.0 and 0.16.0, pandas 1.0.3, python 3.7.5, mojave (macos). Also tested using pyarrow 0.16.0, pandas 1.0.3, python 3.8.2, ubuntu 20.04 (linux). Reporter: Rauli Ruohonen
Using pyarrow 0.17.0, this {code:java} import datetime import pandas as pd def try_with_year(year): print(f'Year {year:_}:') df = pd.DataFrame({'x': [datetime.datetime(year, 1, 1)]}) df.to_parquet('foo.parquet', engine='pyarrow', compression=None) try: print(pd.read_parquet('foo.parquet', engine='pyarrow')) except Exception as exc: print(repr(exc)) print() try_with_year(2_263) try_with_year(2_262) {code} prints {noformat} Year 2_263: ArrowInvalid('Casting from timestamp[ms] to timestamp[ns] would result in out of bounds timestamp: 9246182400000') Year 2_262: x 0 2262-01-01{noformat} and using pyarrow 0.16.0, it prints {noformat} Year 2_263: x 0 1678-06-12 00:25:26.290448384 Year 2_262: x 0 2262-01-01{noformat} The issue is that 2263-01-01 is out of bounds for a timestamp stored using epoch nanoseconds, but not out of bounds for a Python datetime. While pyarrow 0.17.0 refuses to read the erroneous output, it is still possible to read it using other parquet readers (e.g. pyarrow 0.16.0 or fastparquet), yielding the same result as with 0.16.0 above (i.e. only reading has changed in 0.17.0, not writing). It would be better if an error was raised when attempting to write the file instead of silently producing erroneous output. The reason I suspect this is a pyarrow issue instead of a pandas issue is this modified example: {code:java} import datetime import pandas as pd import pyarrow as pa df = pd.DataFrame({'x': [datetime.datetime(2_263, 1, 1)]}) table = pa.Table.from_pandas(df) print(table[0]) try: print(table.to_pandas()) except Exception as exc: print(repr(exc)) {code} which prints {noformat} [ [ 2263-01-01 00:00:00.000000 ] ] ArrowInvalid('Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 9246182400000000'){noformat} on pyarrow 0.17.0 and {noformat} [ [ 2263-01-01 00:00:00.000000 ] ] x 0 1678-06-12 00:25:26.290448384{noformat} on pyarrow 0.16.0. Both from_pandas() and to_pandas() are pyarrow methods, pyarrow prints the correct timestamp when asked to produce it as a string (so it was not lost inside pandas), but the pa.Table.from_pandas(df).to_pandas() round-trip fails. -- This message was sent by Atlassian Jira (v8.3.4#803005)