No problems. IMHO you should develop a sizable bruise banging your head against a using Standard CF's and the Random Partitioner before using something else.
Cheers ----------------- Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 23/12/2011, at 6:29 AM, Bryce Allen wrote: > Thanks, that definitely has advantages over using a super column. We > ran into thrift timeouts when the super column got large, and with the > super column range query there is no way (AFAIK) to batch the request at > the subcolumn level. > > -Bryce > > On Thu, 22 Dec 2011 10:06:58 +1300 > aaron morton <aa...@thelastpickle.com> wrote: >> AFAIK there are no plans kill the BOP, but I would still try to make >> your life easier by using the RP. . >> >> My understanding of the problem is at certain times you snapshot the >> files in a dir; and the main query you want to handle is "At what >> points between time t0 and time t1 did files x,y and z exist?". >> >> You could consider: >> >> 1) Partitioning the time series data in across each row, then make >> the row key is the timestamp for the start of the partition. If you >> have rollup partitions consider making the row key <timestamp : >> partition_size> , e.g. <123456789."1d"> for a 1 day partition that >> starts at 123456789 2) In each row use column names that have the >> form <timestamp : file_name> where time stamp is the time of the >> snapshot. >> >> To query between two times (t0 and t1): >> >> 1) Determine which partitions the time span covers, this will give >> you a list of rows. 2) Execute a multi-get slice for the all rows >> using <t0:*> and <t1:*> (I'm using * here as a null, check with your >> client to see how to use composite columns.) >> >> Hope that helps. >> Aaron >> >> >> ----------------- >> Aaron Morton >> Freelance Developer >> @aaronmorton >> http://www.thelastpickle.com >> >> On 21/12/2011, at 9:03 AM, Bryce Allen wrote: >> >>> I wasn't aware of CompositeColumns, thanks for the tip. However I >>> think it still doesn't allow me to do the query I need - basically >>> I need to do a timestamp range query, limiting only to certain file >>> names at each timestamp. With BOP and a separate row for each >>> timestamp, prefixed by a random UUID, and file names as column >>> names, I can do this query. With CompositeColumns, I can only query >>> one contiguous range, so I'd have to know the timestamps before >>> hand to limit the file names. I can resolve this using indexes, but >>> on paper it looks like this would be significantly slower (it would >>> take me 5 round trips instead of 3 to complete each query, and the >>> query is made multiple times on every single client request). >>> >>> The two down sides I've seen listed for BOP are balancing issues and >>> hotspots. I can understand why RP is recommended, from the balancing >>> issues alone. However these aren't problems for my application. Is >>> there anything else I am missing? Does the Cassandra team plan on >>> continuing to support BOP? I haven't completely ruled out RP, but I >>> like having BOP as an option, it opens up interesting modeling >>> alternatives that I think have real advantages for some >>> (if uncommon) applications. >>> >>> Thanks, >>> Bryce >>> >>> On Wed, 21 Dec 2011 08:08:16 +1300 >>> aaron morton <aa...@thelastpickle.com> wrote: >>>> Bryce, >>>> Have you considered using CompositeColumns and a standard >>>> CF? Row key is the UUID column name is (timestamp : dir_entry) you >>>> can then slice all columns with a particular time stamp. >>>> >>>> Even if you have a random key, I would use the RP unless >>>> you have an extreme use case. >>>> >>>> Cheers >>>> >>>> ----------------- >>>> Aaron Morton >>>> Freelance Developer >>>> @aaronmorton >>>> http://www.thelastpickle.com >>>> >>>> On 21/12/2011, at 3:06 AM, Bryce Allen wrote: >>>> >>>>> I think it comes down to how much you benefit from row range >>>>> scans, and how confident you are that going forward all data will >>>>> continue to use random row keys. >>>>> >>>>> I'm considering using BOP as a way of working around the non >>>>> indexes super column limitation. In my current schema, row keys >>>>> are random UUIDs, super column names are timestamps, and columns >>>>> contain a snapshot in time of directory contents, and could be >>>>> quite large. If instead I use row keys that are >>>>> (uuid)-(timestamp), and use a standard column family, I can do a >>>>> row range query and select only specific columns. I'm still >>>>> evaluating if I can do this with BOP - ideally the token would >>>>> just use the first 128 bits of the key, and I haven't found any >>>>> documentation on how it compares keys of different length. >>>>> >>>>> Another trick with BOP is to use MD5(rowkey)-rowkey for data that >>>>> has non uniform row keys. I think it's reasonable to use if most >>>>> data is uniform and benefits from range scans, but a few things >>>>> are added that aren't/don't. This trick does make the keys larger, >>>>> which increases storage cost and IO load, so it's probably a bad >>>>> idea if a significant subset of the data requires it. >>>>> >>>>> Disclaimer - I wrote that wiki article to fill in a documentation >>>>> gap, since there were no examples of BOP and I wasted a lot of >>>>> time before I noticed the hex byte array vs decimal distinction >>>>> for specifying the initial tokens (which to be fair is >>>>> documented, just easy to miss on a skim). I'm also new to >>>>> cassandra, I'm just describing what makes sense to me "on paper". >>>>> FWIW I confirmed that random UUIDs (type 4) row keys really do >>>>> evenly distribute when using BOP. >>>>> >>>>> -Bryce >>>>> >>>>> On Mon, 19 Dec 2011 19:01:00 -0800 >>>>> Drew Kutcharian <d...@venarc.com> wrote: >>>>>> Hey Guys, >>>>>> >>>>>> I just came across >>>>>> http://wiki.apache.org/cassandra/ByteOrderedPartitioner and it >>>>>> got me thinking. If the row keys are java.util.UUID which are >>>>>> generated randomly (and securely), then what type of partitioner >>>>>> would be the best? Since the key values are already random, >>>>>> would it make a difference to use RandomPartitioner or one can >>>>>> use ByteOrderedPartitioner or OrderPreservingPartitioning as >>>>>> well and get the same result? >>>>>> >>>>>> -- Drew >>>>>> >>>> >>