Hi Robert, > I think that solving the problems around using a dictionary is going > to be really hard. Can we see some evidence that the results will be > worth it?
With the latest patch I've shared, Using a Kaggle dataset of Nintendo-related tweets[1], we leveraged PostgreSQL's acquire_sample_rows function to quickly gather just 1,000 sample rows for a specific attribute out of 104695 rows. These raw samples were passed into Zstd's sampling buffer, generating a custom dictionary. This dictionary was then directly used to compress the documents, resulting in 62% of space savings after compressed: ``` test=# \dt+ List of tables Schema | Name | Type | Owner | Persistence | Access method | Size | Description --------+----------------+-------+----------+-------------+---------------+--------+------------- public | lz4 | table | nikhilkv | permanent | heap | 297 MB | public | pglz | table | nikhilkv | permanent | heap | 259 MB | public | zstd_with_dict | table | nikhilkv | permanent | heap | 114 MB | public | zstd_wo_dict | table | nikhilkv | permanent | heap | 210 MB | (4 rows) ``` We've observed similarly strong results on other datasets as well with using dictionaries. [1] https://www.kaggle.com/code/dcalambas/nintendo-tweets-analysis/data --- Nikhil Veldanda