You can also subdivide hourly partition further by adding an artificial "bucket" field to the partition key, which you populate with a random number say between 0 and 10. When you query, you fan out 10 queries, one for each bucket, and you need to do a manual merge of the resilts. This way you pay the cost at read time, but you spread the writes across more nodes, avoiding a single hotspot for each hour. On Nov 17, 2015 8:22 AM, "Jack Krupansky" <jack.krupan...@gmail.com> wrote:
> I'd be more comfortable keeping partition size below 10MB, but the more > critical factor is the write rate. In a technical sense a single node (and > its replicas) and a single partition will be a hotspot since all writes for > an extended period of time will go to that single node and partition (for > one hour), but the real issue is whether the rate of writes to that single > node/partition is excessive. In your specific case 416K rows written per > hour works out to only 115 rows written per second which should be a fairly > light load. > > Your query spans two hours which means two partitions which won't > necessarily be on the same node. Cassandra may or may not optimize that > case. If it works for you, great, but if it doesn't, you can split it into > two queries that you can submit in parallel, each scanning rows in only a > single partition. > > -- Jack Krupansky > > On Tue, Nov 17, 2015 at 5:29 AM, Chandra Sekar KR < > chandraseka...@hotmail.com> wrote: > >> Hi, >> >> >> I have a time-series based table with the below structure and partition >> size/volumetrics. The purpose of this table is to enable range based scans >> on log_ts and filter the log_id, so it can be further used in the main >> table (EVENT_LOG) for checking the actual data. The EVENT_LOG_BY_DATE acts >> as a lookup (index) to the main table. >> >> >> CREATE TABLE EVENT_LOG_BY_DATE ( >> >> YEAR INT, >> >> MONTH INT, >> >> DAY INT, >> >> HOUR INT, >> >> LOG_TS TIMESTAMP, >> >> LOG_ID VARINT, >> >> PRIMARY KEY ((YEAR, MONTH, DAY, HOUR), LOG_TS)) >> >> WITH CLUSTERING ORDER BY (LOG_TS DESC); >> >> >> SELECT LOG_TS, LOG_ID FROM EVENT_LOG_BY_DATE >> >> WHERE YEAR = 2015 AND >> >> MONTH = 11 AND >> >> DAY = 15 AND >> >> HOUR IN (10,11) AND >> >> LOG_TS >= '2015-11-15 10:00:00+0000' AND >> >> LOG_TS <= '2015-11-15 11:00:00+0000'; >> >> >> Average daily volume of records for this table is ~10million & the avg. >> row size is ~40B. The partition size for an hour comes close to 13MB with >> each partition spanning 416K rows. Will the partition on PRIMARY KEY >> ((YEAR, MONTH, DAY, HOUR) cause any hotspot issues on a node given the >> hourly data size is ~13MB ? >> >> >> Is there any alternate way to model the above time-series based table >> that enable range scans? >> >> >> Regards, Chandra KR >> > >