paleolimbot opened a new pull request, #70: URL: https://github.com/apache/parquet-testing/pull/70
As discussed on the mailing list, it's best to get example files early! These aren't quite finished yet (I haven't figured out how to get the PROJJSON CRS to/from the global metadata from the logical type-to-arrow converter, geography statistics aren't quite there yet) but I thought I'd open the PR to facilitate testing early. Code to generate in details (requires https://github.com/apache/arrow/pull/45459 ) <details> ```python import urllib.request import json import pyarrow as pa from pyarrow import parquet import geoarrow.pyarrow as ga manifest_url = "https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0-rc4/manifest.json" files = {} with urllib.request.urlopen(manifest_url) as f: manifest = json.load(f) for group in manifest["groups"]: for file in group["files"]: if file["format"] == "arrows/wkb": files[group["name"] + "_" + file["name"]] = file["url"] out_dir = "/Users/dewey/gh/parquet-testing/data/geospatial" ones_that_didnt_work = [] for name, url in files.items(): # Skip big files + one CRS example that includes a non-PROJJSON value # on purpose (allowed in GeoArrow), which is rightly rejected # by Parquet if ( "microsoft-buildings" in name or ("ns-water" in name and name != "ns-water_water-point") or "wkt2" in name ): print(f"Skipping {name}") continue out = f"{out_dir}/{name}.parquet" with ( urllib.request.urlopen(url) as f, pa.ipc.open_stream(f) as reader, parquet.ParquetWriter(out, reader.schema, store_schema=False, compression="none") as writer, ): print(f"Reading {url}") # The for batch in reader: writer.write_batch(batch) print(f"Wrote {out}") out = f"{out_dir}/{name}.parquet" ``` </details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org