Generally you need to make a wide row because the row keys in cassandra are ordered by their md5/murmer code. As a result you have no way of locating "new rows", but if the row name is predictable the columns inside the row are ordered.
On Tue, Feb 4, 2014 at 12:02 PM, Yogi Nerella <ynerella...@gmail.com> wrote: > Sorry, I am not understanding the problem, and I am new to Cassandra, and > want to understand this issue. > > Why do we need to use wide row for this situation, why not a simple table > in cassandra? > > todolist (user, state) ==> is there any other information in this table > which needs for processing todo? > processedlist (user, state) > > > > On Tue, Feb 4, 2014 at 7:50 AM, Edward Capriolo <edlinuxg...@gmail.com>wrote: > >> I have actually been building something similar in my space time. You can >> hang around and wait for it or build your own. Here is the basics. Not >> perfect but it will work. >> >> Create column family queue with gc_grace_period=[1 day] >> >> set queue [timeuuid()] ["z"+timeuuid()] = [ work do do] >> >> The producer can decide how it wants to role over the row key and the >> column key it does not matter. >> >> Supposing there are N consumers. We need a way for the consumers to not >> do the same work. We can use something like the bakery algorithm. Remember >> at QUORUM a reader sees writes. >> >> A consumer needs an identifier (it could be another uuid or an ip >> address) >> A consumer calls get_range_slice on the queue the slice is from new >> byte[] to byte[] limit 100 >> >> The consumer sees data like this. >> >> [1234] [z-$timeuuid] = data >> >> Now we register that this consumer wants to consume this queue >> >> set [1234] [a-$[ip}] at quorum >> >> Now we do a slice >> get_slice [1234] from new byte [] to ' b' >> >> There are a few possible returns. >> 1) 1 bidder... >> [1234] [a-$myip] >> You won start consuming >> >> 2) 2 bidders >> [1234] [a-$myip] >> [1234] [a-$otherip] >> compare $myip vs $otherip higher wins >> >> Whoever wins can then start consuming the columns in the queue and delete >> them when done. >> >> >> >> >> >> >> On Friday, January 31, 2014, DuyHai Doan <doanduy...@gmail.com> wrote: >> > Thanks Nat for your ideas. >> >>This could be as simple as adding year and month to the primary key (in >> the form >'yyyymm'). Alternatively, you could add this in the partition in >> the definition. Either way, it >then becomes pretty easy to re-generate >> these based on the query parameters. >> > >> > The thing is that it's not that simple. My customer has a very BAD >> idea, using Cassandra as a queue (the perfect anti-pattern ever). >> > Before trying to tell them to redesign their entire architecture and >> put in some queueing system like ActiveMQ or something similar, I would >> like to see how I can use wide rows to meet the requirements. >> > The functional need is quite simple: >> > 1) A process A loads users into Cassandra and sets the status on this >> user to be 'TODO'. When using the bucketing technique, we can limit a row >> width to, let's say 100 000 columns. So at the end of the current row, >> process A knows that it should move to next bucket. Bucket is coded using >> composite partition key, in our example it would be 'TODO:1', 'TODO:2' .... >> etc >> > >> > 2) A process B reads the wide row for 'TODO' status. It starts at >> bucket 1 so it will read row with partition key 'TODO:1'. The users are >> processed and inserted in a new row 'PROCESSED:1' for example to keep track >> of the status. After retrieving 100 000 columns, it will switch >> automatically to the next bucket. Simple. Fair enough >> > >> > 3) Now what sucks it that some time, process B does not have enough >> data to perform functional logic on the user it fetched from the wide row, >> so it has to REPUT some users back into the 'TODO' status rather than >> transitioning to 'PROCESSED' status. That's exactly a queue behavior. >> > A simplistic idea would be to insert again those m users with >> 'TODO:n', with n higher than the current bucket number so it can be >> processed later. But then it screws up all the counting system. Process A >> which inserts data will not know that there are already m users in row n, >> so will happily add 100 000 columns, making the row size grow to 100 000 + >> m. When process B reads back again this row, it will stop at the first 100 >> 000 columns and skip the trailing m elements . >> > That 's the main reason for which I dropped the idea of bucketing >> (which is quite smart in normal case) to trade for ultra wide row. >> > Any way, I'll follow your advice and play around with the parameters >> of SizeTiered >> > Regards >> > Duy Hai DOAN >> > >> > On Fri, Jan 31, 2014 at 9:23 PM, Nate McCall <n...@thelastpickle.com> >> wrote: >> >>> >> >>> The only drawback for ultra wide row I can see is point 1). But if I >> use leveled compaction with a sufficiently large value for >> "sstable_size_in_mb" (let's say 200Mb), will my read performance be >> impacted as the row grows ? >> >> >> >> For this use case, you would want to use SizeTieredCompaction and play >> around with the configuration a bit to keep a small number of large >> SSTables. Specifically: keep min|max_threshold really low, set bucket_low >> and bucket_high closer together maybe even both to 1.0, and maybe a larger >> min_sstable_size. >> >> YMMV though - per Rob's suggestion, take the time to run some tests >> tweaking these options. >> >> >> >>> >> >>> Of course, splitting wide row into several rows using bucketing >> technique is one solution but it forces us to keep track of the bucket >> number and it's not convenient. We have one process (jvm) that insert data >> and another process (jvm) that read data. Using bucketing, we need to >> synchronize the bucket number between the 2 processes. >> >> >> >> This could be as simple as adding year and month to the primary key >> (in the form 'yyyymm'). Alternatively, you could add this in the partition >> in the definition. Either way, it then becomes pretty easy to re-generate >> these based on the query parameters. >> >> >> >> >> >> -- >> >> ----------------- >> >> Nate McCall >> >> Austin, TX >> >> @zznate >> >> >> >> Co-Founder & Sr. Technical Consultant >> >> Apache Cassandra Consulting >> >> http://www.thelastpickle.com >> > >> > >