ZENOTME commented on issue #14874: URL: https://github.com/apache/datafusion/issues/14874#issuecomment-2685969063
> Thanks for checking it out [@ZENOTME](https://github.com/ZENOTME) > > > JSON bench will have a different schema for row and looks like datafusion(arrow-json) can't support this? E.g. type of subject for these two row are different. > > I think this may be related to what [@TheBuilderJR](https://github.com/TheBuilderJR) is seeing / discussing here: > > * [Datafusion can't seem to cast evolving structs #14757](https://github.com/apache/datafusion/issues/14757) > > Allowing different (yet compatible) schemas Seems it's not the case about different (yet compatible) schema. E.g. ``` first row: "subject":{"cid":"bafyreidjvrcmckkm765mct5fph36x7kupkfo35rjklbf2k76xkzwyiauge","uri":"at://did:plc:azrv4rcbws6kmcga4fsbphg2/app.bsky.feed.post/3lbgjdpbiec2l"} second row: "subject":"did:plc:r7cdh4sgzqbfdc6wcdxxti7c" ``` The same key(column) can have incompatible type for json, I think this is more about https://github.com/apache/datafusion/issues/7845. According to the following query, looks basically what they do is treat the field of json as column and do some aggregate for them. I think for this case choice 1 of https://github.com/apache/datafusion/issues/7845#issuecomment-1767459328 (add Json/Jsonb type to Arrow) may be the appropriate one. I guess this maybe also the reason clickhouse new json type design(https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse) It also can be a good reference if we want to choose this solution.🤔 ``` SELECT data.commit.collection AS event, count() AS count FROM bluesky GROUP BY event ORDER BY count DESC; SELECT data.commit.collection AS event, count() AS count, uniqExact(data.did) AS users FROM bluesky WHERE data.kind = 'commit' AND data.commit.operation = 'create' GROUP BY event ORDER BY count DESC; SELECT data.commit.collection AS event, toHour(fromUnixTimestamp64Micro(data.time_us)) as hour_of_day, count() AS count FROM bluesky WHERE data.kind = 'commit' AND data.commit.operation = 'create' AND data.commit.collection in ['app.bsky.feed.post', 'app.bsky.feed.repost', 'app.bsky.feed.like'] GROUP BY event, hour_of_day ORDER BY hour_of_day, event; SELECT data.did::String as user_id, min(fromUnixTimestamp64Micro(data.time_us)) as first_post_ts FROM bluesky WHERE data.kind = 'commit' AND data.commit.operation = 'create' AND data.commit.collection = 'app.bsky.feed.post' GROUP BY user_id ORDER BY first_post_ts ASC LIMIT 3; SELECT data.did::String as user_id, date_diff( 'milliseconds', min(fromUnixTimestamp64Micro(data.time_us)), max(fromUnixTimestamp64Micro(data.time_us))) AS activity_span FROM bluesky WHERE data.kind = 'commit' AND data.commit.operation = 'create' AND data.commit.collection = 'app.bsky.feed.post' GROUP BY user_id ORDER BY activity_span DESC LIMIT 3; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org