[jira] [Commented] (ARROW-11480) [Python]Segmentation fault with date filter

Joris Van den Bossche (Jira) Thu, 04 Feb 2021 02:13:12 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-11480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278744#comment-17278744
 ]


Joris Van den Bossche commented on ARROW-11480:
-----------------------------------------------

Indeed, also the example of ARROW-11379 (where we filter on a partition column, 
not a normal column) also worked in 2.0.0 and is thus a regression.

To make a reproducer with python:

{code}
df = pd.DataFrame({"dates": pd.date_range("2020-01-01", periods=10, freq="D"), 
"col": range(10)})
df.to_parquet("timestamps.parquet", use_deprecated_int96_timestamps=True)

import datetime
import pyarrow.parquet as pq
pq.read_table("timestamps.parquet", filters=[("dates", "<=", 
datetime.datetime(2020, 1, 5))]).to_pandas()
{code}

It seems I need specifically {{use_deprecated_int96_timestamps=True}} to 
trigger the segfault. Without it (so using normal timestamp parquet type), it 
works as expected (so it's in any case a somewhat different issue as 
ARROW-11379).

> [Python]Segmentation fault with date filter
> -------------------------------------------
>
>                 Key: ARROW-11480
>                 URL: https://issues.apache.org/jira/browse/ARROW-11480
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 3.0.0
>            Reporter: Henrik Anker Rasmussen
>            Priority: Major
>         Attachments: timestamp.parquet
>
>
> If I read a parquet file (see attachment) with timestamps generated in Spark 
> and apply a filter on a date column I get segmentation fault
>  
> {code:java}
> import pyarrow.parquet as pq  
> now = datetime.datetime.now()
> table = pq.read_table("timestamp.parquet", filters=[("date", "<=", now)])
> {code}
>  
> The attached parquet file is generated with this code in spark:
> {code:java}
> now = datetime.datetime.now() 
> data = {"date": [ now - datetime.timedelta(days=i) for i in range(100)]} 
> schema = { "type": "struct", "fields": [{"name": "date", "type": "timestamp", 
> "nullable": True, "metadata": {}}, ], } 
> spf = spark.createDataFrame(pd.DataFrame(data), 
> schema=StructType.fromJson(schema)) 
> spf.write.format("parquet").mode("overwrite").save("timestamp.parquet") 
> {code}
> If I downgrade pyarrow to 2.0.0 it works fine.
> Python version 3.7.7
> pyarrow version 3.0.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11480) [Python]Segmentation fault with date filter

Reply via email to