Re: Hotspots on Time Series based Model

areddyraja Tue, 17 Nov 2015 06:24:27 -0800

13mb seems to very fine in our exp.  we have keys that could take more than 100 
mb



Sent from my iPhone

> On 17-Nov-2015, at 7:47 PM, Yuri Shkuro <y...@uber.com> wrote:
> 
> You can also subdivide hourly partition further by adding an artificial 
> "bucket" field to the partition key, which you populate with a random number 
> say between 0 and 10. When you query, you fan out 10 queries, one for each 
> bucket, and you need to do a manual merge of the resilts. This way you pay 
> the cost at read time, but you spread the writes across more nodes, avoiding 
> a single hotspot for each hour.
>> On Nov 17, 2015 8:22 AM, "Jack Krupansky" <jack.krupan...@gmail.com> wrote:
>> I'd be more comfortable keeping partition size below 10MB, but the more 
>> critical factor is the write rate. In a technical sense a single node (and 
>> its replicas) and a single partition will be a hotspot since all writes for 
>> an extended period of time will go to that single node and partition (for 
>> one hour), but the real issue is whether the rate of writes to that single 
>> node/partition is excessive. In your specific case 416K rows written per 
>> hour works out to only 115 rows written per second which should be a fairly 
>> light load.
>> 
>> Your query spans two hours which means two partitions which won't 
>> necessarily be on the same node. Cassandra may or may not optimize that 
>> case. If it works for you, great, but if it doesn't, you can split it into 
>> two queries that you can submit in parallel, each scanning rows in only a 
>> single partition.
>> 
>> -- Jack Krupansky
>> 
>>> On Tue, Nov 17, 2015 at 5:29 AM, Chandra Sekar KR 
>>> <chandraseka...@hotmail.com> wrote:
>>> Hi,
>>> 
>>> 
>>> I have a time-series based table with the below structure and partition 
>>> size/volumetrics. The purpose of this table is to enable range based scans 
>>> on log_ts and filter the log_id, so it can be further used in the main 
>>> table (EVENT_LOG) for checking the actual data. The EVENT_LOG_BY_DATE acts 
>>> as a lookup (index) to the main table.
>>> 
>>> CREATE TABLE EVENT_LOG_BY_DATE (
>>> 
>>>   YEAR INT,
>>> 
>>>   MONTH INT,
>>> 
>>>   DAY INT,
>>> 
>>>   HOUR INT,
>>> 
>>>   LOG_TS TIMESTAMP,
>>> 
>>>   LOG_ID VARINT,
>>> 
>>>   PRIMARY KEY ((YEAR, MONTH, DAY, HOUR), LOG_TS)) 
>>> 
>>> WITH CLUSTERING ORDER BY (LOG_TS DESC);
>>> 
>>> 
>>> SELECT LOG_TS, LOG_ID FROM EVENT_LOG_BY_DATE
>>> 
>>>   WHERE YEAR = 2015 AND
>>> 
>>>   MONTH = 11 AND
>>> 
>>>   DAY = 15 AND
>>> 
>>>   HOUR IN (10,11) AND
>>> 
>>>   LOG_TS >= '2015-11-15 10:00:00+0000' AND
>>> 
>>>   LOG_TS <= '2015-11-15 11:00:00+0000';
>>> 
>>> 
>>> 
>>> Average daily volume of records for this table is ~10million & the avg. row 
>>> size is ~40B. The partition size for an hour comes close to 13MB with each 
>>> partition spanning 416K rows. Will the partition on PRIMARY KEY ((YEAR, 
>>> MONTH, DAY, HOUR) cause any hotspot issues on a node given the hourly data 
>>> size is ~13MB ? 
>>> 
>>> Is there any alternate way to model the above time-series based table that 
>>> enable range scans?
>>> 
>>> Regards, Chandra KR
>>>

Re: Hotspots on Time Series based Model

Reply via email to