adriangb commented on issue #14874:
URL: https://github.com/apache/datafusion/issues/14874#issuecomment-2695756085

   I'll share that one thing we've found from using JSON quite extensively is 
that often query times are dominated by downloading the large json column, not 
parsing it or extracting data from it. I don't see any way to avoid this unless 
we split the data up into multiple columns or teach DataFusion how to only read 
parts of a column.
   
   I opened https://github.com/apache/datafusion/issues/14993 today which I 
realized is a duplicate of a question I asked before in  
https://github.com/apache/datafusion/issues/7845#issuecomment-2463360160. My 
understanding of how ClickHouse handles JSON is by creating specialized 
"hidden" columns for each key (linked above but see 
https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse). I 
*think* if DataFusion supported something like what I'm proposing in those 
comments (pushing down an expression into a file) we could:
   - At write time for each file being written out take the first X JSON keys 
encountered and write them out as a Union type in their own column (Union type 
serving a similar purpose to CH's variant type). We still retain the original 
column.
   - At query time push the selection / filter down into each file and check if 
there is a specialized column for that key, otherwise fall back to reading from 
the original column.
   - The Union type can be a single level union, i.e. `Union(int, float, bool, 
string, array(stored as utf8 + metadata), object(stored as utf8 + metadata))`.  
I'm not sure if it should be dense or sparse.
   - If the data is truly heterogenous it's important that splitting out of 
keys into pre-computed columns be done at a per-file basis (you can't go more 
granular without completely abandoning Parquet or something), otherwise the 
cardinality of columns grows with the cardinality of keys or if you pick the 
first 128 keys you ever see performance will suffer.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to