13mb seems to very fine in our exp. we have keys that could take more than 100 mb
Sent from my iPhone > On 17-Nov-2015, at 7:47 PM, Yuri Shkuro <y...@uber.com> wrote: > > You can also subdivide hourly partition further by adding an artificial > "bucket" field to the partition key, which you populate with a random number > say between 0 and 10. When you query, you fan out 10 queries, one for each > bucket, and you need to do a manual merge of the resilts. This way you pay > the cost at read time, but you spread the writes across more nodes, avoiding > a single hotspot for each hour. >> On Nov 17, 2015 8:22 AM, "Jack Krupansky" <jack.krupan...@gmail.com> wrote: >> I'd be more comfortable keeping partition size below 10MB, but the more >> critical factor is the write rate. In a technical sense a single node (and >> its replicas) and a single partition will be a hotspot since all writes for >> an extended period of time will go to that single node and partition (for >> one hour), but the real issue is whether the rate of writes to that single >> node/partition is excessive. In your specific case 416K rows written per >> hour works out to only 115 rows written per second which should be a fairly >> light load. >> >> Your query spans two hours which means two partitions which won't >> necessarily be on the same node. Cassandra may or may not optimize that >> case. If it works for you, great, but if it doesn't, you can split it into >> two queries that you can submit in parallel, each scanning rows in only a >> single partition. >> >> -- Jack Krupansky >> >>> On Tue, Nov 17, 2015 at 5:29 AM, Chandra Sekar KR >>> <chandraseka...@hotmail.com> wrote: >>> Hi, >>> >>> >>> I have a time-series based table with the below structure and partition >>> size/volumetrics. The purpose of this table is to enable range based scans >>> on log_ts and filter the log_id, so it can be further used in the main >>> table (EVENT_LOG) for checking the actual data. The EVENT_LOG_BY_DATE acts >>> as a lookup (index) to the main table. >>> >>> CREATE TABLE EVENT_LOG_BY_DATE ( >>> >>> YEAR INT, >>> >>> MONTH INT, >>> >>> DAY INT, >>> >>> HOUR INT, >>> >>> LOG_TS TIMESTAMP, >>> >>> LOG_ID VARINT, >>> >>> PRIMARY KEY ((YEAR, MONTH, DAY, HOUR), LOG_TS)) >>> >>> WITH CLUSTERING ORDER BY (LOG_TS DESC); >>> >>> >>> SELECT LOG_TS, LOG_ID FROM EVENT_LOG_BY_DATE >>> >>> WHERE YEAR = 2015 AND >>> >>> MONTH = 11 AND >>> >>> DAY = 15 AND >>> >>> HOUR IN (10,11) AND >>> >>> LOG_TS >= '2015-11-15 10:00:00+0000' AND >>> >>> LOG_TS <= '2015-11-15 11:00:00+0000'; >>> >>> >>> >>> Average daily volume of records for this table is ~10million & the avg. row >>> size is ~40B. The partition size for an hour comes close to 13MB with each >>> partition spanning 416K rows. Will the partition on PRIMARY KEY ((YEAR, >>> MONTH, DAY, HOUR) cause any hotspot issues on a node given the hourly data >>> size is ~13MB ? >>> >>> Is there any alternate way to model the above time-series based table that >>> enable range scans? >>> >>> Regards, Chandra KR >>>