Hi Li Jin, I'm not sure yet what changed, but I believe you can fix that error simply by omitting the scheme prefix from the URI and just use the page when loading the dataset. Here's my repro:
import pyarrow as pa import pyarrow.dataset as ds from pyarrow.fs import S3FileSystem s3fs = S3FileSystem( endpoint_override="https://storage.googleapis.com", anonymous=True ) uri = "gs://voltrondata-labs-datasets/nyc-taxi" # This works ds.dataset(uri[5:], filesystem=s3fs) # With prefix causes error ds.dataset(uri, filesystem=s3fs) # ArrowInvalid: Expected an S3 object path of the form 'bucket/key...', got a URI: 'gs://voltrondata-labs-datasets/nyc-taxi' Best, Will Jones On Mon, Aug 1, 2022 at 9:00 AM Li Jin <ice.xell...@gmail.com> wrote: > Hello! > > We recently updated Arrow to 7.0.0 and hit some error with our old code > (Details below). I wonder if there is a new way to do this with the current > version? > > import pyarrow > > import pyarrow.parquet as pq > > > > df = pd.DataFrame({"aa": [1, 2, 3], "bb": [1, 2, 3]}) > > uri = "gs://amp_bucket_liao/try" > > s3fs = # ... > > > > pq.write_to_dataset( > > table=pyarrow.Table.from_pandas(df=df, preserve_index=True), > > root_path=uri, filesystem=s3fs, partition_cols=["aa"] > > ) > > # so far it works fine. > > > > # The following gives an error, error message in the thread > > test_df = pq.read_table( > > source=uri, filesystem=s3fs > > ) > > > > Error: > > > /home/tsdist/vats_deployments/modeling.env.interactive-bc9b04a0-708b-45b8-90bc-14b9ca6ee9ba/ext/public/python/pyarrow/7/0/x/dist/lib/python3.9/pyarrow/error.pxi > in pyarrow.lib.check_status() > > 97 > > 98 if status.IsInvalid(): > > ---> 99 raise ArrowInvalid(message) > > 100 elif status.IsIOError(): > > 101 # Note: OSError constructor is > > > > ArrowInvalid: GetFileInfo() yielded path > 'amp_bucket_liao/try/aa=3/235add6629d44a2f8fa4ec772340b73d.parquet', > which is outside base dir 'gs://amp_bucket_liao/try' >