Hello all

 I've read some materials on the net about Cassandra anti patterns, among
which is mentionned the very large wide-row anti pattern.

 The main rationale to avoid too wide rows are:

 1) fragmentation of data on multiple SStables when the row is very wide,
leading to very slow reads by slice query

 2) repair un-efficient. During repair C* exchanges hashes of row data.
Event if only one column differs, C* still exchange the whole row. Having
very wide rows make repair very expensive

 3) bad scaling. Having wide rows localized on some nodes of your cluster
will create hotspots

 4) hard limit of 2*10⁹ columns per "physical" row

All those recommendations are quite sensible. Now my customer has a quite
specific use case:

 a. no repair nor durability. C* is used to dump massive data (heavy write
+ read) for temporary processing. The tables are truncated at the end of a
long running processing. So the point 2) does not apply here

 b. maximum number of items to be processed is 24*10⁶, far below the hard
limit of  2*10⁹ columns so point 4) does not apply either

 c. small cluster of only 2 nodes, so load balancing is quite
straightforward (50% roughly on each node). Therefore point 3) does not
apply either

 The only drawback for ultra wide row I can see is point 1). But if I use
leveled compaction with a sufficiently large value for "sstable_size_in_mb"
(let's say 200Mb), will my read performance be impacted as the row grows ?

 Of course, splitting wide row into several rows using bucketing technique
is one solution but it forces us to keep track of the bucket number and
it's not convenient. We have one process (jvm) that insert data and another
process (jvm) that read data. Using bucketing, we need to synchronize the
bucket number between the 2 processes.

 For information, below is the wide row table definition:

 create table widerow(
  status text,
  insertiondate timeuuid,
  userid long,
  PRIMARY KEY (status,insertiondate));

 the widerow serves to track user insertion status (status : {TODO,
IMPORTED,CHECKED}).

 The read pattern is always:

 SELECT userid FROM widerow WHERE status = 'xxx' AND
insertiondate>{last_processed_user_insertion_date}




I'll be interested by your insights and remarks about this data model.

 Regards

 Duy Hai DOAN

Reply via email to