My experience is the same as Philip's. My point was simply that there is no way to get a range more restrictive than "all" if you use random partitioning.
2010/6/9 Philip Stanhope <pstanh...@wimba.com> > If you are using random partitioner, and you want to do an EXPENSIVE row > scan ... I found that I could iterate using start_key="" end_key="" for > first call ... and then all other calls you'd provide the > start_key="LAST_KEY" from previous iteration. If you set count to 1000, then > you'll get 1000 keys first time ... 999 for each additional iteration ... > until you receive a result that is < count and then you are done. In another > century this is a crude pager or cursor approach with no server-side > knowledge of the state. > > Caveat: There are no changes occurring in the column family while you are > doing this type of scan through the keys of a CF. Depending on what you are > trying to do ... this may not be acceptable. > > Another caveat: Once you have sufficient amount of random keys in a CF ... > there are practical limits that you'll soon reach over the amount of data > you can receive in a Thrift response and/or the cost of building the > response (timeouts may occur or you may exhaust memory at the node servicing > the request). > > The same concerns apply to columns accessed via get_slice ... the # of > columns and the values of those columns will run the potential of causing a > timeout on the request or too much data to satisfy the request. > > Once you have sufficiently large keyspace (10M, 100M?) this approach is not > sufficient or scalable. If you want to perform analysis it may very well be > better to get the data into another format that is more appropriate for > analytics (hadoop?). My production environment will have 4+ different > distributed data stores: file system, relational (clustered on distributed > file system), distributed key store (cassandra) and analytics (tbd, could be > multiple). They each serve different purposes for historical and > performance/operational considerations. > > Why you would want to iterate over every single key in a random partitioned > CF is another thing altogether. I had my own reasons (to validate a > batch_mutate that was inserting 5K - 10K rows at a shot). NOTE: I was > getting < 1000ms per 5K batch_mutate call ... or > 5K inserts per second per > thrift client, per node. When this was parallelized using multiple thrift > clients and hitting multiple nodes in the cluster, I was seeing > 25K > inserts per second (write consistency, read consistency and replication > factor are other considerations). Other caveats apply to batch_mutate, it is > not atomic, but when it works it is much much faster than batching single > insert calls. > > -phil > > On Jun 9, 2010, at 12:07 PM, David Boxenhorn wrote: > > I don't get what you're saying. If you want to loop over your entire range > of keys, you can do it with a range query, and start and finish will both be > "". Is there any scenario where you would want to do a range query where > start and/or finish do not equal "", if you use random partitioning? > > 2010/6/9 Philip Stanhope <pstanh...@wimba.com> > >> I feel that there is a significant bit of confusion here. >> >> You CAN use start/finish when using get_range_slices with random >> partitioner. But you can't make any assumptions about what key will be next >> in the range which is the whole point of "random". If you do know a specific >> key that you care about, you can use that as a start, but again, you don't >> know what will come next. >> >> If you have a CF with 1M keys ... you can effectively do a full row scan >> ... it is expensive and you'd have to ask yourself why you'd be wanting to >> do this in the first place. >> >> Ordering with columns for a particular key is completely dependent on the >> CompareWith choice you make when you defined the column family. For example, >> you can make assumptions about the sequencing of columns returned from >> get_slice (NOT get_range_slices). >> >> -phil >> >> On Jun 9, 2010, at 7:29 AM, David Boxenhorn wrote: >> >> To use start and finish parameters at all, you need to use OPP. Start and >> finish parameters don't work if you don't use OPP, i.e. the result set won't >> be: start =< resultSet < finish >> >> 2010/6/9 Ben Browning <ben...@gmail.com> >> >>> OPP stands for Order-Preserving Partitioner. For more information on >>> partitioners, look here: >>> >>> http://wiki.apache.org/cassandra/StorageConfiguration#Partitioner >>> >>> To do key range slices that use both start and finish parameters and >>> retrieve keys in-order, you need to use an ordered partitioner - >>> either the built-in OPP or your own custom one. >>> >>> Ben >>> >>> On Tue, Jun 8, 2010 at 10:26 PM, sina <ywf2...@sina.com> wrote: >>> > what's the mean of opp? And How can i make the "start" and "finish" >>> useful >>> > and make sense? >>> > >>> > >>> > 2010-06-09 >>> > ________________________________ >>> > 9527 >>> > ________________________________ >>> > 发件人: Ben Browning >>> > 发送时间: 2010-06-02 21:08:57 >>> > 收件人: user >>> > 抄送: >>> > 主题: Re: Range search on keys not working? >>> > They exist because when using OPP they are useful and make sense. >>> > On Wed, Jun 2, 2010 at 8:59 AM, David Boxenhorn <da...@lookin2.com> >>> wrote: >>> >> So why do the "start" and "finish" range parameters exist? >>> >> >>> >> On Wed, Jun 2, 2010 at 3:53 PM, Ben Browning <ben...@gmail.com> >>> wrote: >>> >>> >>> >>> Martin, >>> >>> >>> >>> On Wed, Jun 2, 2010 at 8:34 AM, Dr. Martin Grabmüller >>> >>> <martin.grabmuel...@eleven.de> wrote: >>> >>> > I think you can specify an end key, but it should be a key which >>> does >>> >>> > exist >>> >>> > in your column family. >>> >>> >>> >>> >>> >>> Logically, it doesn't make sense to ever specify an end key with >>> >>> random partitioner. If you specified a start key of "aaa" and and end >>> >>> key of "aac" you might get back as results "aaa", "zfc", "hik", etc. >>> >>> And, even if you have a key of "aab" it might not show up. Key ranges >>> >>> only make sense with order-preserving partitioner. The only time to >>> >>> ever use a key range with random partitioner is when you want to >>> >>> iterate over all keys in the CF. >>> >>> >>> >>> Ben >>> >>> >>> >>> >>> >>> > But maybe I'm off the track here and someone else here knows more >>> about >>> >>> > this >>> >>> > key range stuff. >>> >>> > >>> >>> > Martin >>> >>> > >>> >>> > ________________________________ >>> >>> > From: David Boxenhorn [mailto:da...@lookin2.com] >>> >>> > Sent: Wednesday, June 02, 2010 2:30 PM >>> >>> > To: user@cassandra.apache.org >>> >>> > Subject: Re: Range search on keys not working? >>> >>> > >>> >>> > In other words, I should check the values as I iterate, and stop >>> >>> > iterating >>> >>> > when I get out of range? >>> >>> > >>> >>> > I'll try that! >>> >>> > >>> >>> > On Wed, Jun 2, 2010 at 3:15 PM, Dr. Martin Grabmüller >>> >>> > <martin.grabmuel...@eleven.de> wrote: >>> >>> >> >>> >>> >> When not using OOP, you should not use something like 'CATEGORY/' >>> as >>> >>> >> the >>> >>> >> end key. >>> >>> >> Use the empty string as the end key and limit the number of >>> returned >>> >>> >> keys, >>> >>> >> as you did with >>> >>> >> the 'max' value. >>> >>> >> >>> >>> >> If I understand correctly, the end key is used to generate an end >>> token >>> >>> >> by >>> >>> >> hashing it, and >>> >>> >> there is not the same correspondence between 'CATEGORY' and >>> 'CATEGORY/' >>> >>> >> as >>> >>> >> for >>> >>> >> hash('CATEGORY') and hash('CATEGORY/'). >>> >>> >> >>> >>> >> At least, this was the explanation I gave myself when I had the >>> same >>> >>> >> problem. >>> >>> >> >>> >>> >> The solution is to iterate through the keys by always using the >>> last >>> >>> >> key >>> >>> >> returned as the >>> >>> >> start key for the next call to get_range_slices, and the to drop >>> the >>> >>> >> first >>> >>> >> element from >>> >>> >> the result. >>> >>> >> >>> >>> >> HTH, >>> >>> >> Martin >>> >>> >> >>> >>> >> ________________________________ >>> >>> >> From: David Boxenhorn [mailto:da...@lookin2.com] >>> >>> >> Sent: Wednesday, June 02, 2010 2:01 PM >>> >>> >> To: user@cassandra.apache.org >>> >>> >> Subject: Re: Range search on keys not working? >>> >>> >> >>> >>> >> The previous thread where we discussed this is called, "key is >>> sorted?" >>> >>> >> >>> >>> >> >>> >>> >> On Wed, Jun 2, 2010 at 2:56 PM, David Boxenhorn < >>> da...@lookin2.com> >>> >>> >> wrote: >>> >>> >>> >>> >>> >>> I'm not using OPP. But I was assured on earlier threads (I asked >>> >>> >>> several >>> >>> >>> times to be sure) that it would work as stated below: the results >>> >>> >>> would not >>> >>> >>> be ordered, but they would be correct. >>> >>> >>> >>> >>> >>> On Wed, Jun 2, 2010 at 2:51 PM, Torsten Curdt <tcu...@vafer.org> >>> >>> >>> wrote: >>> >>> >>>> >>> >>> >>>> Sounds like you are not using an order preserving partitioner? >>> >>> >>>> >>> >>> >>>> On Wed, Jun 2, 2010 at 13:48, David Boxenhorn < >>> da...@lookin2.com> >>> >>> >>>> wrote: >>> >>> >>>> > Range search on keys is not working for me. I was assured in >>> >>> >>>> > earlier >>> >>> >>>> > threads >>> >>> >>>> > that range search would work, but the results would not be >>> ordered. >>> >>> >>>> > >>> >>> >>>> > I'm trying to get all the rows that start with "CATEGORY." >>> >>> >>>> > >>> >>> >>>> > I'm doing: >>> >>> >>>> > >>> >>> >>>> > String start = "CATEGORY."; >>> >>> >>>> > . >>> >>> >>>> > . >>> >>> >>>> > . >>> >>> >>>> > keyspace.getSuperRangeSlice(columnParent, slicePredicate, >>> start, >>> >>> >>>> > "CATEGORY/", max) >>> >>> >>>> > . >>> >>> >>>> > . >>> >>> >>>> > . >>> >>> >>>> > >>> >>> >>>> > in a loop, setting start to the last key each time - but I'm >>> >>> >>>> > getting >>> >>> >>>> > rows >>> >>> >>>> > that don't start with "CATEGORY."!! >>> >>> >>>> > >>> >>> >>>> > How do I get all rows that start with "CATEGORY."? >>> >>> >>> >>> >>> >> >>> >>> > >>> >>> > >>> >> >>> >> >>> > __________ Information from ESET NOD32 Antivirus, version of virus >>> signature database 5164 (20100601) __________ >>> > The message was checked by ESET NOD32 Antivirus. >>> > http://www.eset.com >>> >> >> >> > >