On 8/7/10 7:04 PM, Benjamin Black wrote:
certainly it matters: your previous version is not bounded on time, so
will grow without bound.  ergo, it is not a good fit for cassandra.

On Sat, Aug 7, 2010 at 2:51 PM, Mark<static.void....@gmail.com>  wrote:
On 8/7/10 2:33 PM, Benjamin Black wrote:
Right, this is an index row per time interval (your previous email was
not).

On Sat, Aug 7, 2010 at 11:43 AM, Mark<static.void....@gmail.com>    wrote:

On 8/7/10 11:30 AM, Mark wrote:

On 8/7/10 4:22 AM, Thomas Heller wrote:

Ok, I think the part I was missing was the concatenation of the key
and
partition to do the look ups. Is this the preferred way of
accomplishing
needs such as this? Are there alternatives ways?

Depending on your needs you can concat the row key or use super
columns.


How would one then "query" over multiple days? Same question for all
days.
Should I use range_slice or multiget_slice? And if its range_slice
does
that
mean I need OrderPreservingPartitioner?

The last 3 days is pretty simple: ['2010-08-07', '2010-08-06',
'2010-08-05'], as is 7, 31, etc. Just generate the keys in your app
and use multiget_slice.

If you want to get all days where a specific ip address had some
requests you'll just need another CF where the row key is the addr and
column names are the days (values optional again). Pretty much the
same all over again, just add another CF and insert the data you need.

get_range_slice in my experience is better used for "offline" tasks
where you really want to process every row there is.

/thomas

Ok... as an example using looking up logs by ip for a certain
timeframe/range would this work?

<ColumnFamily Name="SearchLog"/>

<ColumnFamily Name="IPSearchLog"
                           ColumnType="Super"
                           CompareWith="UTF8Type"
                           CompareSubcolumnsWith="TimeUUIDType"/>

Resulting in a structure like:

{
  "127.0.0.1" : {
       "2010080711" : {
            uuid1 : ""
            uuid2: ""
            uuid3: ""
       }
      "2010080712" : {
            uuid1 : ""
            uuid2: ""
            uuid3: ""
       }
   }
  "some.other.ip" : {
       "2010080711" : {
            uuid1 : ""
       }
   }
}

Whereas each uuid is the key used for SearchLog.  Is there anything
wrong
with this? I know there is a 2 billion column limit but in this case
that
would never be exceeded because each column represents an hour. However
does
the above "schema" imply that for any certain IP there can only be a
maxium
of 2GB of data stored?

Or should I invert the ip with the time slices? The limitation of this
seems
like there can only be 2 billion unique ips per hour which is more than
enough for our application :)

{
  "2010080711" : {
       "127.0.0.1" : {
            uuid1 : ""
            uuid2: ""
            uuid3: ""
       }
      "some.other.ip" : {
            uuid1 : ""
            uuid2: ""
            uuid3: ""
       }
   }
  "2010080712" : {
       "127.0.0.1" : {
            uuid1 : ""
       }
   }
}



In the end does it really matter which one to go with? I kind of like the
previous version so I don't have to build up all the keys for the multi_get
and instead I can just provide and start&  finish for the columns (time
frames).

Is there any performance penalty for a multi_get that includes x keys versus a get on 1 key with a start/finish range of x?

Using your gem,

multi_get("SearchLog", ["20090101"..."20100807"], "127.0.0.1")
vs
get("SearchLog", "127.0.0.1", :start => "20090101", :finish => ""127.0.0.1")

Thanks

Reply via email to