ZENOTME commented on issue #14874:
URL: https://github.com/apache/datafusion/issues/14874#issuecomment-2685969063

   > Thanks for checking it out [@ZENOTME](https://github.com/ZENOTME)
   > 
   > > JSON bench will have a different schema for row and looks like 
datafusion(arrow-json) can't support this? E.g. type of subject for these two 
row are different.
   > 
   > I think this may be related to what 
[@TheBuilderJR](https://github.com/TheBuilderJR) is seeing / discussing here:
   > 
   > * [Datafusion can't seem to cast evolving structs 
#14757](https://github.com/apache/datafusion/issues/14757)
   > 
   > Allowing different (yet compatible) schemas
   
   Seems it's not the case about different (yet compatible) schema. E.g.
   ```
   first row: 
"subject":{"cid":"bafyreidjvrcmckkm765mct5fph36x7kupkfo35rjklbf2k76xkzwyiauge","uri":"at://did:plc:azrv4rcbws6kmcga4fsbphg2/app.bsky.feed.post/3lbgjdpbiec2l"}
   second row: "subject":"did:plc:r7cdh4sgzqbfdc6wcdxxti7c"
   ```
   The same key(column) can have incompatible type for json, I think this is 
more about https://github.com/apache/datafusion/issues/7845. According to the 
following query, looks basically what they do is treat the field of json as 
column and do some aggregate for them. I think for this case choice 1 of 
https://github.com/apache/datafusion/issues/7845#issuecomment-1767459328 (add 
Json/Jsonb type to Arrow) may be the appropriate one. I guess this maybe also 
the reason clickhouse new json type 
design(https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse)
 It also can be a good reference if we want to choose this solution.🤔
   ```
   SELECT data.commit.collection AS event, count() AS count FROM bluesky GROUP 
BY event ORDER BY count DESC;
   SELECT data.commit.collection AS event, count() AS count, 
uniqExact(data.did) AS users FROM bluesky WHERE data.kind = 'commit' AND 
data.commit.operation = 'create' GROUP BY event ORDER BY count DESC;
   SELECT data.commit.collection AS event, 
toHour(fromUnixTimestamp64Micro(data.time_us)) as hour_of_day, count() AS count 
FROM bluesky WHERE data.kind = 'commit' AND data.commit.operation = 'create' 
AND data.commit.collection in ['app.bsky.feed.post', 'app.bsky.feed.repost', 
'app.bsky.feed.like'] GROUP BY event, hour_of_day ORDER BY hour_of_day, event;
   SELECT data.did::String as user_id, 
min(fromUnixTimestamp64Micro(data.time_us)) as first_post_ts FROM bluesky WHERE 
data.kind = 'commit' AND data.commit.operation = 'create' AND 
data.commit.collection = 'app.bsky.feed.post' GROUP BY user_id ORDER BY 
first_post_ts ASC LIMIT 3;
   SELECT data.did::String as user_id, date_diff( 'milliseconds', 
min(fromUnixTimestamp64Micro(data.time_us)), 
max(fromUnixTimestamp64Micro(data.time_us))) AS activity_span FROM bluesky 
WHERE data.kind = 'commit' AND data.commit.operation = 'create' AND 
data.commit.collection = 'app.bsky.feed.post' GROUP BY user_id ORDER BY 
activity_span DESC LIMIT 3;
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to