Re: Good partition key doubt

José Guilherme Vanz Mon, 15 Dec 2014 12:59:35 -0800

Nice, I got it. =]
If I have more questions I'll send other emails. xD
Thank you


On Thu, Dec 11, 2014 at 12:17 PM, DuyHai Doan <doanduy...@gmail.com> wrote:
>
> "what is a good partition key? Is partition key direct related with my
> query performance? What is the best practices?"
>
> A good partition key is a partition key that will scale with your data. An
> example: if you have a business involving individuals, it is likely that
> your business will scale as soon as the number of users will grow. In this
> case user_id is a good partition key because all the users will
> be uniformly distributed over all the Cassandra nodes.
>
> For your log example, using only server_id for partition key is clearly
> not enough because what will scale is the log lines, not the number of
> server.
>
> From the point of view of scalability (not taking about query-ability),
> adding the log_type will not scale either, because the number of different
> log types is likely to be a small set. For great scalability (not taking
> about query-ability), the couple (server_id,log_timestamp) is likely a good
> combination.
>
>  Now for query, as you should know, it is not possible to have range query
> (using <, ≤, ≥, >) over partition key, you must always use equality (=) so
> you won't be able to leverage the log_timestamp component in the partition
> key for your query.
>
> Bucketing by date is a good idea though, and the date resolution will
> depends on the log generation rate. If logs are generated very often, maybe
> a bucket by hour. If the generation rate is smaller, maybe a day or a week
> bucket is fine.
>
> Talking about log_type, putting it into the partition key will help
> partitioning further, in addition of the date bucket. However it forces you
> to always provide a log_type whenever you want to query, be aware of this.
>
> An example of data model for your logs could be
>
> CREATE TABLE logs_by_server_and_type_and_date(
>    server_id int,
>    log_type text,
>    date_bucket int, //Date bucket using format YYYYMMDD or YYYYMMDDHH or
> ...
>    log_timestamp timeuuid,
>    log_info text,
>    PRIMARY KEY((server_id,log_type,date_bucket),log_timestamp)
> );
>
>
> "And if I want to query all logs in a period of time how can I select I
> range o rows?" --> New query path = new table
>
> CREATE TABLE logs_by_date(
>    date_bucket int, //Date bucket using format YYYYMMDD or YYYYMMDDHH or
> ...
>    log_timestamp timeuuid,
>    server_id int,
>    log_type text,
>    log_info text,
>    PRIMARY KEY((date_bucket),log_timestamp) // you may add server_id or
> log_type as clustering column optionally
> );
>
> For this table, the date_bucket should be chosen very carefully because
> for the same bucket, we're going to store logs of ALL servers and all types
> ...
>
> For the query, you should provide the date bucket as partition key, and
> then use (<, ≤, ≥, >) on the log_timestamp column
>
>
> On Thu, Dec 11, 2014 at 12:00 PM, José Guilherme Vanz <
> guilherme....@gmail.com> wrote:
>
>> Hello folks
>>
>> I am studying Cassandra for a short a period of time and now I am
>> modeling a database for study purposes. During my modeling I have faced a
>> doubt, what is a good partition key? Is partition key direct related with
>> my query performance? What is the best practices?
>>
>> Just to study case, let's suppose I have a column family where is
>> inserted all kind of logs ( http server, application server, application
>> logs, etc ) data from different servers. In this column family I have
>> server_id ( unique identifier for each server ) column, log_type ( http
>> server,  application server, application log ) column and log_info column.
>> Is a good ideia create a partition key using server_id and log_type columns
>> to store all logs data from a specific type and server in a physical row?
>> And if do I want a physical row for each day? Is a good idea add a third
>> column with the date in the partition key? And if I want to query all logs
>> in a period of time how can I select I range o rows? Do I have to duplicate
>> date column ( considering I have to use = operator with partition key ) ?
>>
>> All the best
>> --
>> Att. José Guilherme Vanz
>> br.linkedin.com/pub/josé-guilherme-vanz/51/b27/58b/
>> <http://br.linkedin.com/pub/jos%C3%A9-guilherme-vanz/51/b27/58b/>
>> "O sofrimento é passageiro, desistir é para sempre" - Bernardo Fonseca,
>> recordista da Antarctic Ice Marathon.
>>
>
>

-- 
Att. José Guilherme Vanz
br.linkedin.com/pub/josé-guilherme-vanz/51/b27/58b/
<http://br.linkedin.com/pub/jos%C3%A9-guilherme-vanz/51/b27/58b/>
"O sofrimento é passageiro, desistir é para sempre" - Bernardo Fonseca,
recordista da Antarctic Ice Marathon.

Re: Good partition key doubt

Reply via email to