Nice, I got it. =] If I have more questions I'll send other emails. xD Thank you
On Thu, Dec 11, 2014 at 12:17 PM, DuyHai Doan <doanduy...@gmail.com> wrote: > > "what is a good partition key? Is partition key direct related with my > query performance? What is the best practices?" > > A good partition key is a partition key that will scale with your data. An > example: if you have a business involving individuals, it is likely that > your business will scale as soon as the number of users will grow. In this > case user_id is a good partition key because all the users will > be uniformly distributed over all the Cassandra nodes. > > For your log example, using only server_id for partition key is clearly > not enough because what will scale is the log lines, not the number of > server. > > From the point of view of scalability (not taking about query-ability), > adding the log_type will not scale either, because the number of different > log types is likely to be a small set. For great scalability (not taking > about query-ability), the couple (server_id,log_timestamp) is likely a good > combination. > > Now for query, as you should know, it is not possible to have range query > (using <, ≤, ≥, >) over partition key, you must always use equality (=) so > you won't be able to leverage the log_timestamp component in the partition > key for your query. > > Bucketing by date is a good idea though, and the date resolution will > depends on the log generation rate. If logs are generated very often, maybe > a bucket by hour. If the generation rate is smaller, maybe a day or a week > bucket is fine. > > Talking about log_type, putting it into the partition key will help > partitioning further, in addition of the date bucket. However it forces you > to always provide a log_type whenever you want to query, be aware of this. > > An example of data model for your logs could be > > CREATE TABLE logs_by_server_and_type_and_date( > server_id int, > log_type text, > date_bucket int, //Date bucket using format YYYYMMDD or YYYYMMDDHH or > ... > log_timestamp timeuuid, > log_info text, > PRIMARY KEY((server_id,log_type,date_bucket),log_timestamp) > ); > > > "And if I want to query all logs in a period of time how can I select I > range o rows?" --> New query path = new table > > CREATE TABLE logs_by_date( > date_bucket int, //Date bucket using format YYYYMMDD or YYYYMMDDHH or > ... > log_timestamp timeuuid, > server_id int, > log_type text, > log_info text, > PRIMARY KEY((date_bucket),log_timestamp) // you may add server_id or > log_type as clustering column optionally > ); > > For this table, the date_bucket should be chosen very carefully because > for the same bucket, we're going to store logs of ALL servers and all types > ... > > For the query, you should provide the date bucket as partition key, and > then use (<, ≤, ≥, >) on the log_timestamp column > > > On Thu, Dec 11, 2014 at 12:00 PM, José Guilherme Vanz < > guilherme....@gmail.com> wrote: > >> Hello folks >> >> I am studying Cassandra for a short a period of time and now I am >> modeling a database for study purposes. During my modeling I have faced a >> doubt, what is a good partition key? Is partition key direct related with >> my query performance? What is the best practices? >> >> Just to study case, let's suppose I have a column family where is >> inserted all kind of logs ( http server, application server, application >> logs, etc ) data from different servers. In this column family I have >> server_id ( unique identifier for each server ) column, log_type ( http >> server, application server, application log ) column and log_info column. >> Is a good ideia create a partition key using server_id and log_type columns >> to store all logs data from a specific type and server in a physical row? >> And if do I want a physical row for each day? Is a good idea add a third >> column with the date in the partition key? And if I want to query all logs >> in a period of time how can I select I range o rows? Do I have to duplicate >> date column ( considering I have to use = operator with partition key ) ? >> >> All the best >> -- >> Att. José Guilherme Vanz >> br.linkedin.com/pub/josé-guilherme-vanz/51/b27/58b/ >> <http://br.linkedin.com/pub/jos%C3%A9-guilherme-vanz/51/b27/58b/> >> "O sofrimento é passageiro, desistir é para sempre" - Bernardo Fonseca, >> recordista da Antarctic Ice Marathon. >> > > -- Att. José Guilherme Vanz br.linkedin.com/pub/josé-guilherme-vanz/51/b27/58b/ <http://br.linkedin.com/pub/jos%C3%A9-guilherme-vanz/51/b27/58b/> "O sofrimento é passageiro, desistir é para sempre" - Bernardo Fonseca, recordista da Antarctic Ice Marathon.