amrishlal commented on issue #7229: URL: https://github.com/apache/pinot/issues/7229#issuecomment-893667004
> I haven't really given a thought to how this design would behave when there are too many segments to load. With 2T/day and 500M segments, we have around 4000 segments per day. With a retention of 30 days, we are looking at 4000 * 30 = 120,000 segments. If someone makes a query that literally queries data for last 30 days, we might have to load all of them (yikes!). May be this can be controlled with a max segment config, as has been done in this PR? Possibly two things to consider here: - I am wondering if data in S3 needs to have the same granularity as the data in Pinot or can we aggregate the data to a higher-level dimension while aging it out to S3? For example, if data in the latest segment has a granularity of 1 second, then data in a segment 10 days old may have a granularity of 10 seconds (thereby reducing the data size by factor of 10), and data 30 days old may have a granularity of 1 hour (thereby reducing the data size by a factor of ~4000). - Also, would adding a segment cache between Pinot and S3 help with latency? Usually I would expect some sort of a locality of reference when we pull in data from S3 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
