adriangb commented on issue #14874: URL: https://github.com/apache/datafusion/issues/14874#issuecomment-2695756085
I'll share that one thing we've found from using JSON quite extensively is that often query times are dominated by downloading the large json column, not parsing it or extracting data from it. I don't see any way to avoid this unless we split the data up into multiple columns or teach DataFusion how to only read parts of a column. I opened https://github.com/apache/datafusion/issues/14993 today which I realized is a duplicate of a question I asked before in https://github.com/apache/datafusion/issues/7845#issuecomment-2463360160. My understanding of how ClickHouse handles JSON is by creating specialized "hidden" columns for each key (linked above but see https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse). I *think* if DataFusion supported something like what I'm proposing in those comments (pushing down an expression into a file) we could: - At write time for each file being written out take the first X JSON keys encountered and write them out as a Union type in their own column (Union type serving a similar purpose to CH's variant type). We still retain the original column. - At query time push the selection / filter down into each file and check if there is a specialized column for that key, otherwise fall back to reading from the original column. - The Union type can be a single level union, i.e. `Union(int, float, bool, string, array(stored as utf8 + metadata), object(stored as utf8 + metadata))`. I'm not sure if it should be dense or sparse. - If the data is truly heterogenous it's important that splitting out of keys into pre-computed columns be done at a per-file basis (you can't go more granular without completely abandoning Parquet or something), otherwise the cardinality of columns grows with the cardinality of keys or if you pick the first 128 keys you ever see performance will suffer. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org