[jira] [Created] (ARROW-8816) [Python] Year 2263 or later datetimes get mangled when written using pandas

Rauli Ruohonen (Jira) Fri, 15 May 2020 06:59:27 -0700

Rauli Ruohonen created ARROW-8816:
-------------------------------------

             Summary: [Python] Year 2263 or later datetimes get mangled when 
written using pandas
                 Key: ARROW-8816
                 URL: https://issues.apache.org/jira/browse/ARROW-8816
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.17.0, 0.16.0
         Environment: Tested using pyarrow 0.17.0 and 0.16.0, pandas 1.0.3, 
python 3.7.5, mojave (macos). Also tested using pyarrow 0.16.0, pandas 1.0.3, 
python 3.8.2, ubuntu 20.04 (linux).
            Reporter: Rauli Ruohonen



Using pyarrow 0.17.0, this

 
{code:java}
import datetime
import pandas as pd

def try_with_year(year):
    print(f'Year {year:_}:')
    df = pd.DataFrame({'x': [datetime.datetime(year, 1, 1)]})
    df.to_parquet('foo.parquet', engine='pyarrow', compression=None)
    try:
        print(pd.read_parquet('foo.parquet', engine='pyarrow'))
    except Exception as exc:
        print(repr(exc))
    print()

try_with_year(2_263)
try_with_year(2_262)
{code}
 

prints

 
{noformat}
Year 2_263:
ArrowInvalid('Casting from timestamp[ms] to timestamp[ns] would result in out 
of bounds timestamp: 9246182400000')

Year 2_262:
           x
0 2262-01-01{noformat}
and using pyarrow 0.16.0, it prints

 

 
{noformat}
Year 2_263:
                              x
0 1678-06-12 00:25:26.290448384

Year 2_262:
           x
0 2262-01-01{noformat}
The issue is that 2263-01-01 is out of bounds for a timestamp stored using 
epoch nanoseconds, but not out of bounds for a Python datetime.

While pyarrow 0.17.0 refuses to read the erroneous output, it is still possible 
to read it using other parquet readers (e.g. pyarrow 0.16.0 or fastparquet), 
yielding the same result as with 0.16.0 above (i.e. only reading has changed in 
0.17.0, not writing). It would be better if an error was raised when attempting 
to write the file instead of silently producing erroneous output.

The reason I suspect this is a pyarrow issue instead of a pandas issue is this 
modified example:

 
{code:java}
import datetime
import pandas as pd
import pyarrow as pa

df = pd.DataFrame({'x': [datetime.datetime(2_263, 1, 1)]})
table = pa.Table.from_pandas(df)
print(table[0])
try:
    print(table.to_pandas())
except Exception as exc:
    print(repr(exc))
{code}
which prints

 

 
{noformat}
[
  [
    2263-01-01 00:00:00.000000
  ]
]
ArrowInvalid('Casting from timestamp[us] to timestamp[ns] would result in out 
of bounds timestamp: 9246182400000000'){noformat}
on pyarrow 0.17.0 and

 

 
{noformat}
[
  [
    2263-01-01 00:00:00.000000
  ]
]
                              x
0 1678-06-12 00:25:26.290448384{noformat}
on pyarrow 0.16.0. Both from_pandas() and to_pandas() are pyarrow methods, 
pyarrow prints the correct timestamp when asked to produce it as a string (so 
it was not lost inside pandas), but the pa.Table.from_pandas(df).to_pandas() 
round-trip fails.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8816) [Python] Year 2263 or later datetimes get mangled when written using pandas

Reply via email to