prantogg commented on issue #67:
URL:
https://github.com/apache/sedona-spatialbench/issues/67#issuecomment-3613729774
Thanks @MrPowers!
I agree we should start by adding the uncompressed table sizes to the
SpatialBench docs so users have clear expectations about data volume at each
scale factor.
On the calculation side: the pandas `memory_usage(deep=True)` method reports
the in-memory pandas size, which can differ significantly from the actual
Parquet footprint. A more accurate way to compute the true uncompressed Parquet
size is to read the column-chunk metadata via PyArrow:
```python
import pyarrow.parquet as pq
def parquet_uncompressed_bytes(path):
pf = pq.ParquetFile(path)
uncomp = 0
comp = 0
for rg in range(pf.num_row_groups):
rg_meta = pf.metadata.row_group(rg)
for c in range(rg_meta.num_columns):
col = rg_meta.column(c)
uncomp += col.total_uncompressed_size
comp += col.total_compressed_size
return uncomp, comp
uncompressed, compressed = parquet_uncompressed_bytes("building.parquet")
print("Uncompressed bytes:", uncompressed)
print("Compressed bytes:", compressed)
print("Compression ratio:", uncompressed / compressed)
```
This gives the sizes recorded by the Parquet writer.
I’ll follow up with docs updates soon. Thanks again for raising this!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]