We are in product development and batch size depends on the customer base of customer buying our product. Huge customers buying product may have huge batches while small customers may have much smaller ones. So we dont know upgront how many buckets per batch would be required and we dont wanna ask for additional configuration from our customer to input average batch size. So, we are planning to use dynamic bucketing. Every row in primary is associated with only one batch.
Comments required on the following: 1. I want to know any suggestios on proposed design? 2. Whats the best approach for updating/deleting from index table. When a row is manually purged from primary table, we dont know where that row key exists in x number of buckets created for its batch id? Thanks Anuj Sent from Yahoo Mail on Android From:"sean_r_dur...@homedepot.com" <sean_r_dur...@homedepot.com> Date:Fri, 24 Jul, 2015 at 5:39 pm Subject:RE: Manual Indexing With Buckets It is a bit hard to follow. Perhaps you could include your proposed schema (annotated with your size predictions) to spur more discussion. To me, it sounds a bit convoluted. Why is a “batch” so big (up to 100 million rows)? Is a row in the primary only associated with one batch? Sean Durity – Cassandra Admin, Big Data Team To engage the team, create a request From: Anuj Wadehra [mailto:anujw_2...@yahoo.co.in] Sent: Friday, July 24, 2015 3:57 AM To: user@cassandra.apache.org Subject: Re: Manual Indexing With Buckets Can anyone take this one? Thanks Anuj Sent from Yahoo Mail on Android From:"Anuj Wadehra" <anujw_2...@yahoo.co.in> Date:Thu, 23 Jul, 2015 at 10:57 pm Subject:Manual Indexing With Buckets We have a primary table and we need search capability by batchid column. So we are creating a manual index for search by batch id. We are using buckets to restrict a row size in batch id index table to 50mb. As batch size may vary drastically ( ie one batch id may be associated to 100k row keys in primary table while other may be associated with 100million row keys), we are creating a metadata table to track the approximate data while insertions for a batch in primary table, so that batch id index table has dynamic no of buckets/rows. As more data is inserted for a batch in primary table, new set of 10 buckets are added. At any point in time, clients will write to latest 10 buckets created for a batch od index in round robin to avoid hotspots. Comments required on the following: 1. I want to know any suggestios on above design? 2. Whats the best approach for updating/deleting from index table. When a row is manually purged from primary table, we dont know where that row key exists in x number of buckets created for its batch id? Thanks Anuj Sent from Yahoo Mail on Android The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.