samuelcolvin commented on issue #7845:
URL: https://github.com/apache/datafusion/issues/7845#issuecomment-2366813122

   I've done some further digging into fast querying of semi-structured data, 
in particularly I have a prototype library "batson" (binary alternative to 
(J)SON), https://github.com/pydantic/jiter/pull/136.
   
   Batson is heavily inspired by Postgres's JSONB type and snowflake's Variant 
type (see 
https://github.com/datafusion-contrib/datafusion-functions-variant/issues/11), 
and the apache initiative [Open 
Variant](https://github.com/apache/spark/blob/master/common/variant/README.md). 
The main advantages of batson are it maintains order of items in objects, ints 
> `i64::MAX` are supported via rust's BigInt.
   
   Performance:
   * batson is 14x faster than jiter (used in the current 
[datafusion-functions-json](https://github.com/datafusion-contrib/datafusion-functions-json))
 when a key exists
   * around 126x faster than jiter when the key does not exist
   
   (for reference, and for those wondering why datafusion-functions-json 
doesn't use serde-json, batson is 106x and 588x faster than serde-json in the 
same cases respectively)
   
   So far so good.
   
   ---
   
   The problem is that batson isn't actually that much faster than 
`datafusion-functions-json` using jiter (DFJ) in real world uses cases:
   * with all data in memory, batson is around 8x faster than DFJ
   * with the the data being read from a local parquet file however, batson is 
only around 2x faster DFJ
   
   The improvement would be even worse in cases (like ours) where the data is 
actually being read from an object store.
   
   The main difference is that decompression takes up the lion’s share of time 
when reading parquet files locally.
   
   Running a query like `select count(*) from records where 
json_contains(attributes, 'needle')` over 1m rows:
   
   In memory:
   ```running queries...
   +-----------------+
   | count(Int64(1)) |
   +-----------------+
   | 105461          |
   +-----------------+
   mode: FilterJson zstd(1), query took 600.49625ms
   +-----------------+
   | count(Int64(1)) |
   +-----------------+
   | 105461          |
   +-----------------+
   mode: FilterBatson zstd(1), query took 75.818625ms
   ```
   
   From parquet:
   
   ```
   running queries...
   +-----------------+
   | count(Int64(1)) |
   +-----------------+
   | 105461          |
   +-----------------+
   mode: FilterJson zstd(1), query took 1.143363958s
   +-----------------+
   | count(Int64(1)) |
   +-----------------+
   | 105461          |
   +-----------------+
   mode: FilterBatson zstd(1), query took 604.522375ms
   ```
   
   (happy to share the code I used, if it helps anyone)
   
   We can see why that is using samply:
   
![image](https://github.com/user-attachments/assets/90b67902-7e60-4e93-8764-02a9e99fe52d)
   
   
![image](https://github.com/user-attachments/assets/44f0b9c7-97a9-4e3e-9f00-c4f5333aff15)
   
   * decompression is taking 71% of time
   * batson `contains` is taking 6.2% of the time
   
   So making batson faster, or implementing it at all isn't immediately that 
useful.
   
   Simple data shredding (as described in Open Variant 
[here](https://github.com/apache/spark/blob/master/common/variant/shredding.md))
 won’t help I believe since datafusion will get and decompress the body column 
if they’re included in the query, even if no rows in the body column are 
accessed.
   
   ways forward:
   * use row group stats to enable pruning of row groups, will help a lot with 
needle-in-haystack search, but not with aggregations on common columns
   * somehow get filter pushdown / late materialization to work based on the 
result of a UDF so some columns aren't decompressed (or even aren't fetched) 
unless they're needed
   * disable parquet compression on these columns, then do our own compression 
on body data, but not headers, only decompress the body when necessary — sounds 
hard
   * something else I haven’t thought of ???
   
   Any pointers from those who understand datafusion better than me (👀 @alamb 
@andygrove @jayzhan211) on the best way forward would be much appreciate.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to