[GitHub] [pinot] amrishlal commented on issue #7229: Pinot Long Term Data Store

GitBox Thu, 05 Aug 2021 10:57:32 -0700


amrishlal commented on issue #7229:
URL: https://github.com/apache/pinot/issues/7229#issuecomment-893667004



   > I haven't really given a thought to how this design would behave when 
there are too many segments to load. With 2T/day and 500M segments, we have 
around 4000 segments per day. With a retention of 30 days, we are looking at 
4000 * 30 = 120,000 segments. If someone makes a query that literally queries 
data for last 30 days, we might have to load all of them (yikes!). May be this 
can be controlled with a max segment config, as has been done in this PR?
   
   Possibly two things to consider here:
   - I am wondering if data in S3 needs to have the same granularity as the 
data in Pinot or can we aggregate the data to a higher-level dimension while 
aging it out to S3? For example, if data in the latest segment has a 
granularity of 1 second, then data in a segment 10 days old may have a 
granularity of 10 seconds (thereby reducing the data size by factor of 10), and 
data 30 days old may have a granularity of 1 hour (thereby reducing the data 
size by a factor of ~4000).
   - Also, would adding a segment cache between Pinot and S3 help with latency? 
Usually I would expect some sort of a locality of reference when we pull in 
data from S3
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [pinot] amrishlal commented on issue #7229: Pinot Long Term Data Store

Reply via email to