Re: Inconsistent result in super range slice query (reversed order)

Shotaro Kamio Mon, 21 Feb 2011 18:53:22 -0800

Hi Tyler,

Your script doesn't cause the problem. But the problem really occurs
in a situation.
My colleague analyzed the problem and find out how to reproduce the problem.
Please look at the jira. https://issues.apache.org/jira/browse/CASSANDRA-2212


Best regards,
Shotaro


On Fri, Feb 18, 2011 at 3:59 PM, Tyler Hobbs <ty...@datastax.com> wrote:
> I'm unable to reproduce this in pycassa starting with a clean database.  Are
> you doing anything else to these rows besides inserting them?
>
> Here's the complete script I'm using below.  Could you confirm that this
> causes problems for you?
>
> - Tyler
>
> =========
>
> import sys
> import pycassa
>
> pool = pycassa.ConnectionPool('Keyspace1')
> cf = pycassa.ColumnFamily(pool, 'Super1')
>
> KEY = 'key'
>
> columns = [
>     "20031210020333/190209-20031210-4476807-s/"  , #0
>     "20031210020333/190209-20031210-4476807-s/0" , #1
>     "20031210021940/190209-20031210-4476883-s/"  , #2
>     "20031210021940/190209-20031210-4476883-s/0" , #3
>     "20031210022059/190209-20031210-4476885-s/"  , #4
>     "20031210022059/190209-20031210-4476885-s/0" , #5
>     # <--Problem_around_here.
>     "20031210022154/190209-20031210-4476888-s/"  , #6
>     "20031210022154/190209-20031210-4476888-s/0"   #7
> ]
>
> for supercolumn in columns:
>     cf.insert(KEY, {supercolumn: {'subcol': 'subval', 'subcol2': 'subval'}})
>
> def get_cols(start_date, end_date, reversed):
>     for key, cols in cf.get_range(start = KEY,
>                                   finish = KEY,
>                                   column_reversed=reversed,
>                                   column_count=10000,
>                                   column_start=start_date,
>                                   column_finish=end_date):
>         for supercol, subcols in cols.iteritems():
>             print "col='%s' \tlen = %d" % (supercol, len(subcols))
>
> start = 0
> for end in [0,3,5,7]:
>     print "\nstart %d, end %d + 'z'" % (start, end)
>     get_cols(columns[start], columns[end] + 'z', False)
>
> end = 0
> for start in [0, 3, 5, 7]:
>     print "\nstart %d + 'z', end %d (reversed)" % (start, end)
>     get_cols(columns[end], columns[start] + 'z', False)
>
>
> On Thu, Feb 17, 2011 at 11:09 PM, Shotaro Kamio <kamios...@gmail.com> wrote:
>>
>> Hi Aaron,
>>
>> Range slice means get_range_slices() in thrift api,
>> createSuperSliceQuery in hector, get_range() in pycassa. The example
>> code in pycassa is attached below.
>>
>> The problem is a little bit complicated to explain. I'll try to
>> describe in examples.
>> Here are 8 super column names which exist in the specific key. The
>> list is forward order.
>>
>> #0: "20031210020333/190209-20031210-4476807-s/"
>> #1: "20031210020333/190209-20031210-4476807-s/0"
>> #2: "20031210021940/190209-20031210-4476883-s/"
>> #3: "20031210021940/190209-20031210-4476883-s/0"
>> #4: "20031210022059/190209-20031210-4476885-s/"
>> #5: "20031210022059/190209-20031210-4476885-s/0"  <-- Problem around here.
>> #6: "20031210022154/190209-20031210-4476888-s/"
>> #7: "20031210022154/190209-20031210-4476888-s/0"
>>
>> There is no problem if I use the super column names exist on the key.
>>
>> * Range from #0 to #3 in forward order -> OK
>> * Range from #0 to #5 in forward order -> OK
>> * Range from #0 to #7 in forward order -> OK
>>
>> * Range from #7 to #0 in reverse order -> OK
>> * Range from #5 to #0 in reverse order -> OK
>> * Range from #3 to #0 in reverse order -> OK
>>
>>
>> Because I want to scan orders in a certain range, however, I use
>> column names which added character "z" (higher than anything in
>> order_id). Those column names are listed below as #1z, #3z, #5z and
>> #7z. Note that these super column names don't really exist on the key.
>> (#4+ is a column name to locate between #4 and #5)
>>
>> #0 : "20031210020333/190209-20031210-4476807-s/"
>> #1 : "20031210020333/190209-20031210-4476807-s/0"
>> #1z: "20031210020333/190209-20031210-4476807-s/z" (don't exist)
>> #2 : "20031210021940/190209-20031210-4476883-s/"
>> #3 : "20031210021940/190209-20031210-4476883-s/0"
>> #3z: "20031210021940/190209-20031210-4476883-s/z" (don't exist)
>> #4 : "20031210022059/190209-20031210-4476885-s/"
>> #4+: "20031210022059/190209-20031210-4476885-s/+" (don't exist)
>> #5 : "20031210022059/190209-20031210-4476885-s/0"  <-- Problem around
>> here.
>> #5z: "20031210022059/190209-20031210-4476885-s/z" (don't exist)
>> #6 : "20031210022154/190209-20031210-4476888-s/"
>> #7 : "20031210022154/190209-20031210-4476888-s/0"
>> #7z: "20031210022154/190209-20031210-4476888-s/z" (don't exist)
>>
>> Then, try to range slice them.
>>
>> * Range from #0 to #3z in forward order -> OK
>> * Range from #0 to #4+ in forward order -> OK
>> * Range from #0 to #5z in forward order -> OK
>> * Range from #0 to #7z in forward order -> OK
>>
>> * Range from #7z to #0 in reverse order -> OK
>> * Range from #5z to #0 in reverse order -> FAIL (no result)
>> * Range from #4+ to #0 in reverse order -> OK
>> * Range from #3z to #0 in reverse order -> OK
>>
>> The problem happens in this case. No error or warning is shown in
>> cassandra log.
>>
>> Also, I tried dumping data into json via sstable2json and restored it
>> with json2sstable. But the same problem occurs.
>>
>>
>> The code I used for the test is something like this.
>> ----------------------
>> client = pycassa.connect(KEYSPACE, [ CASSANDRA_HOST ])
>> cf = pycassa.ColumnFamily(client, COLUMN_FAMILY)
>>
>> columns = [
>> "20031210020333/190209-20031210-4476807-s/"  , #0
>> "20031210020333/190209-20031210-4476807-s/0" , #1
>> "20031210021940/190209-20031210-4476883-s/"  , #2
>> "20031210021940/190209-20031210-4476883-s/0" , #3
>> "20031210022059/190209-20031210-4476885-s/"  , #4
>> "20031210022059/190209-20031210-4476885-s/0" , #5
>> # <--Problem_around_here.
>> "20031210022154/190209-20031210-4476888-s/"  , #6
>> "20031210022154/190209-20031210-4476888-s/0"   #7
>> ]
>>
>> reversed = False
>> if len(sys.argv) > 1:
>>    # use reversed order if "-r" option is given. "-f" or others for
>> forward order, no option will list all column names.
>>    reversed = (sys.argv[1] == '-r')
>>
>>    start_date = columns[0]
>>    end_date  = columns[7] + "z" # add "z" to make problem.
>>
>>    if reversed:
>>        temp = start_date
>>        start_date = end_date
>>        end_date   = temp
>>        pass
>> else:
>>    start_date = end_date = ''
>>    pass
>>
>> print "start_date =", start_date, "end_date =", end_date, "reversed =
>> ", reversed
>>
>> for it in cf.get_range(start = A_KEY, finish = A_KEY,
>> column_reversed=reversed, column_count=10000, column_start=start_date,
>> column_finish=end_date):
>>
>>    for d in it[1].iteritems():
>>        print "col='%s', len = %d" % (d[0], len(d[0]))
>>        pass
>>    pass
>>
>> -------------------------
>>
>>
>> Regards,
>> Shotaro
>>
>>
>>
>>
>> On Fri, Feb 18, 2011 at 5:19 AM, Aaron Morton <aa...@thelastpickle.com>
>> wrote:
>> > First some terminology, when you say range slice do you mean getting
>> > multiple rows? Or do you mean get_slice where you return multiple super
>> > columns from one row?
>> >
>> > Your examples looks like you want to get multiple super columns from one
>> > row. In which case the choice of partitioner is not important. The
>> > comparator and sub comparator as specified in the CF definition control the
>> > ordering of colums. If possible i would suggest using the random
>> > partitioner.
>> >
>> > Could you provide examples of how you are doing the queries using
>> > pycassa we may be able to help.
>> >
>> > My initial guess is that the ranges you specify for the query are not
>> > correct when using ASCII ordering for column names, e,g,
>> >
>> > 20031210 < 20031210022059/190209-20031210-4476885-s/z is true
>> >
>> > 20031210022059/190209-20031210-4476885-s/z < 20031210 is not true
>> >
>> > Trying appending the highest value ASCII character to the end of
>> > 20031210
>> >
>> > Cheers
>> > Aaron
>> >
>> > On 18/02/2011, at 4:35 AM, Shotaro Kamio <kamios...@gmail.com> wrote:
>> >
>> >> Hi,
>> >>
>> >> We are in trouble with a strange behavior in cassandra 0.7.2 (also
>> >> happened in 0.7.0). Could someone help us?
>> >>
>> >> The problem happens on a column family of super column type named
>> >> "Order".
>> >> Data structure is something like:
>> >>  Order[ a_key ][ date + "/" + order_id + "/" (+ suffix) ][attribute] =
>> >> value
>> >>
>> >> For example,
>> >> Order[ "100" ][ "20031210022059/190209-20031210-4476885-s/" ]
>> >> is a super column.
>> >> Because we want to scan them in the latest-first order, range slice
>> >> query with reversed order is used. (Partitioner is
>> >> ByteOrderedPartitioner).
>> >>
>> >> In some supercolumns in my cassandra instance, reversed query returns
>> >> no result while it should have results.
>> >> For instance,
>> >>
>> >> * Range slice in normal (lexical)-order ( Order[ "100" ] [ from
>> >> "20031210" to "20031210022059/190209-20031210-4476885-s/z" ] ) will
>> >> return results correctly.
>> >>
>> >> col='20031210014347/190209-20031210-4476668-s/'
>> >> col='20031210014347/190209-20031210-4476668-s/0'
>> >> col='20031210022059/190209-20031210-4476885-s/'
>> >> col='20031210022059/190209-20031210-4476885-s/0'
>> >>
>> >> * Range slice in reversed (latest-first)-order ( Order[ "100" ] [ from
>> >> "20031210022059/190209-20031210-4476885-s/z" to  "20031210" ] ) will
>> >> return NO result!
>> >>
>> >> Note that the super column name
>> >> "20031210022059/190209-20031210-4476885-s/z" doesn't exist. The query
>> >> should work. And, it succeeds in other super columns.
>> >>
>> >> * Range slice in reversed (latest-first)-order starting from existing
>> >> column name ( Order[ "100" ] [ from
>> >> "20031210022059/190209-20031210-4476885-s/0" to "20031210" ] ) will
>> >> return results which should return.
>> >>
>> >> Both pycassa and hector show the same behavior on the same column
>> >> name. I guess that cassandra has some logical error.
>> >>
>> >>
>> >> I'll appreciate any help.
>> >>
>> >>
>> >> Best reagards,
>> >> Shotaro
>> >
>>
>>
>>
>> --
>> Shotaro Kamio
>
>
>
> --
> Tyler Hobbs
> Software Engineer, DataStax
> Maintainer of the pycassa Cassandra Python client library
>
>



-- 
Shotaro Kamio

Re: Inconsistent result in super range slice query (reversed order)

Reply via email to