Hi,

I've hit an issue in Python (3.9.12) where creating a Pyarrow dataset over a 
remote filesystem (such as GCS filesystem), and then opening a batch iterator 
over the dataset and having the program immediately exit / clean-up afterwards 
causes a PyGILState_Release error to get thrown. This is with pyarrow version 
v7.0.0.

The error looks like:
Fatal Python error: PyGILState_Release: thread state 0x7fbfd4002380 must be 
current when releasing
Python runtime state: finalizing (tstate=0x55a079959380)

Thread 0x00007fbfff5ee400 (most recent call first):
<no Python frame>


Example reproduce code:

import pandas as pd

import pyarrow.dataset as ds



# Get GCS fsspec filesystem

fs = get_gcs_fs()



dummy_df = pd.DataFrame({"a": [1,2,3]})



# Write out some dummy data for us to load a dataset from

data_path = "test-bucket/debug-arrow-datasets/data.parquet"

with fs.open(data_path, "wb") as f:

    dummy_df.to_parquet(f)



dummy_ds = ds.dataset([data_path], filesystem=fs)



batch_iter = dummy_ds.to_batches()

# Program finish



# Putting some buffer time after the iterator is opened causes the issue to go 
away

# import time

# time.sleep(1)


Using local parquet files for the dataset, adding some buffer time between 
iterator open and program exit (via time.sleep or something else), or consuming 
the entire iterator seems to make the issue go away. Is this reproducible if 
you swap in your own GCS filesystem?

Thanks,
Alex

Reply via email to