paleolimbot opened a new pull request, #70:
URL: https://github.com/apache/parquet-testing/pull/70

   As discussed on the mailing list, it's best to get example files early! 
These aren't quite finished yet (I haven't figured out how to get the PROJJSON 
CRS to/from the global metadata from the logical type-to-arrow converter, 
geography statistics aren't quite there yet) but I thought I'd open the PR to 
facilitate testing early.
   
   Code to generate in details (requires 
https://github.com/apache/arrow/pull/45459 )
   
   <details>
   
   ```python
   import urllib.request
   import json
   
   import pyarrow as pa
   from pyarrow import parquet
   import geoarrow.pyarrow as ga
   
   manifest_url = 
"https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0-rc4/manifest.json";
   files = {}
   with urllib.request.urlopen(manifest_url) as f:
       manifest = json.load(f)
       for group in manifest["groups"]:
           for file in group["files"]:
               if file["format"] == "arrows/wkb":
                   files[group["name"] + "_" + file["name"]] = file["url"]
   
   out_dir = "/Users/dewey/gh/parquet-testing/data/geospatial"
   ones_that_didnt_work = []
   for name, url in files.items():
       # Skip big files + one CRS example that includes a non-PROJJSON value
       # on purpose (allowed in GeoArrow), which is rightly rejected
       # by Parquet
       if (
           "microsoft-buildings" in name
           or ("ns-water" in name and name != "ns-water_water-point") or "wkt2" 
in name
       ):
           print(f"Skipping {name}")
           continue
   
       out = f"{out_dir}/{name}.parquet"
       with (
           urllib.request.urlopen(url) as f,
           pa.ipc.open_stream(f) as reader,
           parquet.ParquetWriter(out, reader.schema, store_schema=False, 
compression="none") as writer,
       ):
           print(f"Reading {url}")
           # The
           for batch in reader:
               writer.write_batch(batch)
           print(f"Wrote {out}")
           out = f"{out_dir}/{name}.parquet"
   
   ```
   
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org

Reply via email to