Re: Inconsistent result in super range slice query (reversed order)

Shotaro Kamio Thu, 17 Feb 2011 21:10:01 -0800

Hi Aaron,

Range slice means get_range_slices() in thrift api,
createSuperSliceQuery in hector, get_range() in pycassa. The example
code in pycassa is attached below.


The problem is a little bit complicated to explain. I'll try to
describe in examples.
Here are 8 super column names which exist in the specific key. The
list is forward order.

#0: "20031210020333/190209-20031210-4476807-s/"
#1: "20031210020333/190209-20031210-4476807-s/0"
#2: "20031210021940/190209-20031210-4476883-s/"
#3: "20031210021940/190209-20031210-4476883-s/0"
#4: "20031210022059/190209-20031210-4476885-s/"
#5: "20031210022059/190209-20031210-4476885-s/0"  <-- Problem around here.
#6: "20031210022154/190209-20031210-4476888-s/"
#7: "20031210022154/190209-20031210-4476888-s/0"

There is no problem if I use the super column names exist on the key.

* Range from #0 to #3 in forward order -> OK
* Range from #0 to #5 in forward order -> OK
* Range from #0 to #7 in forward order -> OK

* Range from #7 to #0 in reverse order -> OK
* Range from #5 to #0 in reverse order -> OK
* Range from #3 to #0 in reverse order -> OK


Because I want to scan orders in a certain range, however, I use
column names which added character "z" (higher than anything in
order_id). Those column names are listed below as #1z, #3z, #5z and
#7z. Note that these super column names don't really exist on the key.
(#4+ is a column name to locate between #4 and #5)

#0 : "20031210020333/190209-20031210-4476807-s/"
#1 : "20031210020333/190209-20031210-4476807-s/0"
#1z: "20031210020333/190209-20031210-4476807-s/z" (don't exist)
#2 : "20031210021940/190209-20031210-4476883-s/"
#3 : "20031210021940/190209-20031210-4476883-s/0"
#3z: "20031210021940/190209-20031210-4476883-s/z" (don't exist)
#4 : "20031210022059/190209-20031210-4476885-s/"
#4+: "20031210022059/190209-20031210-4476885-s/+" (don't exist)
#5 : "20031210022059/190209-20031210-4476885-s/0"  <-- Problem around here.
#5z: "20031210022059/190209-20031210-4476885-s/z" (don't exist)
#6 : "20031210022154/190209-20031210-4476888-s/"
#7 : "20031210022154/190209-20031210-4476888-s/0"
#7z: "20031210022154/190209-20031210-4476888-s/z" (don't exist)

Then, try to range slice them.

* Range from #0 to #3z in forward order -> OK
* Range from #0 to #4+ in forward order -> OK
* Range from #0 to #5z in forward order -> OK
* Range from #0 to #7z in forward order -> OK

* Range from #7z to #0 in reverse order -> OK
* Range from #5z to #0 in reverse order -> FAIL (no result)
* Range from #4+ to #0 in reverse order -> OK
* Range from #3z to #0 in reverse order -> OK

The problem happens in this case. No error or warning is shown in cassandra log.

Also, I tried dumping data into json via sstable2json and restored it
with json2sstable. But the same problem occurs.


The code I used for the test is something like this.
----------------------
client = pycassa.connect(KEYSPACE, [ CASSANDRA_HOST ])
cf = pycassa.ColumnFamily(client, COLUMN_FAMILY)

columns = [
"20031210020333/190209-20031210-4476807-s/"  , #0
"20031210020333/190209-20031210-4476807-s/0" , #1
"20031210021940/190209-20031210-4476883-s/"  , #2
"20031210021940/190209-20031210-4476883-s/0" , #3
"20031210022059/190209-20031210-4476885-s/"  , #4
"20031210022059/190209-20031210-4476885-s/0" , #5
# <--Problem_around_here.
"20031210022154/190209-20031210-4476888-s/"  , #6
"20031210022154/190209-20031210-4476888-s/0"   #7
]

reversed = False
if len(sys.argv) > 1:
    # use reversed order if "-r" option is given. "-f" or others for
forward order, no option will list all column names.
    reversed = (sys.argv[1] == '-r')

    start_date = columns[0]
    end_date  = columns[7] + "z" # add "z" to make problem.

    if reversed:
        temp = start_date
        start_date = end_date
        end_date   = temp
        pass
else:
    start_date = end_date = ''
    pass

print "start_date =", start_date, "end_date =", end_date, "reversed =
", reversed

for it in cf.get_range(start = A_KEY, finish = A_KEY,
column_reversed=reversed, column_count=10000, column_start=start_date,
column_finish=end_date):

    for d in it[1].iteritems():
        print "col='%s', len = %d" % (d[0], len(d[0]))
        pass
    pass

-------------------------


Regards,
Shotaro




On Fri, Feb 18, 2011 at 5:19 AM, Aaron Morton <aa...@thelastpickle.com> wrote:
> First some terminology, when you say range slice do you mean getting multiple 
> rows? Or do you mean get_slice where you return multiple super columns from 
> one row?
>
> Your examples looks like you want to get multiple super columns from one row. 
> In which case the choice of partitioner is not important. The comparator and 
> sub comparator as specified in the CF definition control the ordering of 
> colums. If possible i would suggest using the random partitioner.
>
> Could you provide examples of how you are doing the queries using pycassa we 
> may be able to help.
>
> My initial guess is that the ranges you specify for the query are not correct 
> when using ASCII ordering for column names, e,g,
>
> 20031210 < 20031210022059/190209-20031210-4476885-s/z is true
>
> 20031210022059/190209-20031210-4476885-s/z < 20031210 is not true
>
> Trying appending the highest value ASCII character to the end of 20031210
>
> Cheers
> Aaron
>
> On 18/02/2011, at 4:35 AM, Shotaro Kamio <kamios...@gmail.com> wrote:
>
>> Hi,
>>
>> We are in trouble with a strange behavior in cassandra 0.7.2 (also
>> happened in 0.7.0). Could someone help us?
>>
>> The problem happens on a column family of super column type named "Order".
>> Data structure is something like:
>>  Order[ a_key ][ date + "/" + order_id + "/" (+ suffix) ][attribute] = value
>>
>> For example,
>> Order[ "100" ][ "20031210022059/190209-20031210-4476885-s/" ]
>> is a super column.
>> Because we want to scan them in the latest-first order, range slice
>> query with reversed order is used. (Partitioner is
>> ByteOrderedPartitioner).
>>
>> In some supercolumns in my cassandra instance, reversed query returns
>> no result while it should have results.
>> For instance,
>>
>> * Range slice in normal (lexical)-order ( Order[ "100" ] [ from
>> "20031210" to "20031210022059/190209-20031210-4476885-s/z" ] ) will
>> return results correctly.
>>
>> col='20031210014347/190209-20031210-4476668-s/'
>> col='20031210014347/190209-20031210-4476668-s/0'
>> col='20031210022059/190209-20031210-4476885-s/'
>> col='20031210022059/190209-20031210-4476885-s/0'
>>
>> * Range slice in reversed (latest-first)-order ( Order[ "100" ] [ from
>> "20031210022059/190209-20031210-4476885-s/z" to  "20031210" ] ) will
>> return NO result!
>>
>> Note that the super column name
>> "20031210022059/190209-20031210-4476885-s/z" doesn't exist. The query
>> should work. And, it succeeds in other super columns.
>>
>> * Range slice in reversed (latest-first)-order starting from existing
>> column name ( Order[ "100" ] [ from
>> "20031210022059/190209-20031210-4476885-s/0" to "20031210" ] ) will
>> return results which should return.
>>
>> Both pycassa and hector show the same behavior on the same column
>> name. I guess that cassandra has some logical error.
>>
>>
>> I'll appreciate any help.
>>
>>
>> Best reagards,
>> Shotaro
>



-- 
Shotaro Kamio

Re: Inconsistent result in super range slice query (reversed order)

Reply via email to