prantogg commented on issue #67:
URL: 
https://github.com/apache/sedona-spatialbench/issues/67#issuecomment-3613729774

   Thanks @MrPowers!
   
   I agree we should start by adding the uncompressed table sizes to the 
SpatialBench docs so users have clear expectations about data volume at each 
scale factor.
   
   On the calculation side: the pandas `memory_usage(deep=True)` method reports 
the in-memory pandas size, which can differ significantly from the actual 
Parquet footprint. A more accurate way to compute the true uncompressed Parquet 
size is to read the column-chunk metadata via PyArrow:
   
   ```python
   import pyarrow.parquet as pq
   
   def parquet_uncompressed_bytes(path):
       pf = pq.ParquetFile(path)
       uncomp = 0
       comp = 0
   
       for rg in range(pf.num_row_groups):
           rg_meta = pf.metadata.row_group(rg)
           for c in range(rg_meta.num_columns):
               col = rg_meta.column(c)
               uncomp += col.total_uncompressed_size
               comp += col.total_compressed_size
   
       return uncomp, comp
   
   uncompressed, compressed = parquet_uncompressed_bytes("building.parquet")
   print("Uncompressed bytes:", uncompressed)
   print("Compressed bytes:", compressed)
   print("Compression ratio:", uncompressed / compressed)
   ```
   
   This gives the sizes recorded by the Parquet writer.
   
   I’ll follow up with docs updates soon. Thanks again for raising this!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to