[ https://issues.apache.org/jira/browse/ARROW-11480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278744#comment-17278744 ]
Joris Van den Bossche commented on ARROW-11480: ----------------------------------------------- Indeed, also the example of ARROW-11379 (where we filter on a partition column, not a normal column) also worked in 2.0.0 and is thus a regression. To make a reproducer with python: {code} df = pd.DataFrame({"dates": pd.date_range("2020-01-01", periods=10, freq="D"), "col": range(10)}) df.to_parquet("timestamps.parquet", use_deprecated_int96_timestamps=True) import datetime import pyarrow.parquet as pq pq.read_table("timestamps.parquet", filters=[("dates", "<=", datetime.datetime(2020, 1, 5))]).to_pandas() {code} It seems I need specifically {{use_deprecated_int96_timestamps=True}} to trigger the segfault. Without it (so using normal timestamp parquet type), it works as expected (so it's in any case a somewhat different issue as ARROW-11379). > [Python]Segmentation fault with date filter > ------------------------------------------- > > Key: ARROW-11480 > URL: https://issues.apache.org/jira/browse/ARROW-11480 > Project: Apache Arrow > Issue Type: Bug > Affects Versions: 3.0.0 > Reporter: Henrik Anker Rasmussen > Priority: Major > Attachments: timestamp.parquet > > > If I read a parquet file (see attachment) with timestamps generated in Spark > and apply a filter on a date column I get segmentation fault > > {code:java} > import pyarrow.parquet as pq > now = datetime.datetime.now() > table = pq.read_table("timestamp.parquet", filters=[("date", "<=", now)]) > {code} > > The attached parquet file is generated with this code in spark: > {code:java} > now = datetime.datetime.now() > data = {"date": [ now - datetime.timedelta(days=i) for i in range(100)]} > schema = { "type": "struct", "fields": [{"name": "date", "type": "timestamp", > "nullable": True, "metadata": {}}, ], } > spf = spark.createDataFrame(pd.DataFrame(data), > schema=StructType.fromJson(schema)) > spf.write.format("parquet").mode("overwrite").save("timestamp.parquet") > {code} > If I downgrade pyarrow to 2.0.0 it works fine. > Python version 3.7.7 > pyarrow version 3.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)