Certainly. There is also a performance penalty to unbounded row sizes. That penalty is your nodes OOMing. I strongly recommend you abandon that direction.
On Sat, Aug 7, 2010 at 9:06 PM, Mark <static.void....@gmail.com> wrote: > On 8/7/10 7:04 PM, Benjamin Black wrote: >> >> certainly it matters: your previous version is not bounded on time, so >> will grow without bound. ergo, it is not a good fit for cassandra. >> >> On Sat, Aug 7, 2010 at 2:51 PM, Mark<static.void....@gmail.com> wrote: >> >>> >>> On 8/7/10 2:33 PM, Benjamin Black wrote: >>> >>>> >>>> Right, this is an index row per time interval (your previous email was >>>> not). >>>> >>>> On Sat, Aug 7, 2010 at 11:43 AM, Mark<static.void....@gmail.com> >>>> wrote: >>>> >>>> >>>>> >>>>> On 8/7/10 11:30 AM, Mark wrote: >>>>> >>>>> >>>>>> >>>>>> On 8/7/10 4:22 AM, Thomas Heller wrote: >>>>>> >>>>>> >>>>>>>> >>>>>>>> Ok, I think the part I was missing was the concatenation of the key >>>>>>>> and >>>>>>>> partition to do the look ups. Is this the preferred way of >>>>>>>> accomplishing >>>>>>>> needs such as this? Are there alternatives ways? >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> Depending on your needs you can concat the row key or use super >>>>>>> columns. >>>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> How would one then "query" over multiple days? Same question for all >>>>>>>> days. >>>>>>>> Should I use range_slice or multiget_slice? And if its range_slice >>>>>>>> does >>>>>>>> that >>>>>>>> mean I need OrderPreservingPartitioner? >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> The last 3 days is pretty simple: ['2010-08-07', '2010-08-06', >>>>>>> '2010-08-05'], as is 7, 31, etc. Just generate the keys in your app >>>>>>> and use multiget_slice. >>>>>>> >>>>>>> If you want to get all days where a specific ip address had some >>>>>>> requests you'll just need another CF where the row key is the addr >>>>>>> and >>>>>>> column names are the days (values optional again). Pretty much the >>>>>>> same all over again, just add another CF and insert the data you >>>>>>> need. >>>>>>> >>>>>>> get_range_slice in my experience is better used for "offline" tasks >>>>>>> where you really want to process every row there is. >>>>>>> >>>>>>> /thomas >>>>>>> >>>>>>> >>>>>> >>>>>> Ok... as an example using looking up logs by ip for a certain >>>>>> timeframe/range would this work? >>>>>> >>>>>> <ColumnFamily Name="SearchLog"/> >>>>>> >>>>>> <ColumnFamily Name="IPSearchLog" >>>>>> ColumnType="Super" >>>>>> CompareWith="UTF8Type" >>>>>> CompareSubcolumnsWith="TimeUUIDType"/> >>>>>> >>>>>> Resulting in a structure like: >>>>>> >>>>>> { >>>>>> "127.0.0.1" : { >>>>>> "2010080711" : { >>>>>> uuid1 : "" >>>>>> uuid2: "" >>>>>> uuid3: "" >>>>>> } >>>>>> "2010080712" : { >>>>>> uuid1 : "" >>>>>> uuid2: "" >>>>>> uuid3: "" >>>>>> } >>>>>> } >>>>>> "some.other.ip" : { >>>>>> "2010080711" : { >>>>>> uuid1 : "" >>>>>> } >>>>>> } >>>>>> } >>>>>> >>>>>> Whereas each uuid is the key used for SearchLog. Is there anything >>>>>> wrong >>>>>> with this? I know there is a 2 billion column limit but in this case >>>>>> that >>>>>> would never be exceeded because each column represents an hour. >>>>>> However >>>>>> does >>>>>> the above "schema" imply that for any certain IP there can only be a >>>>>> maxium >>>>>> of 2GB of data stored? >>>>>> >>>>>> >>>>> >>>>> Or should I invert the ip with the time slices? The limitation of this >>>>> seems >>>>> like there can only be 2 billion unique ips per hour which is more than >>>>> enough for our application :) >>>>> >>>>> { >>>>> "2010080711" : { >>>>> "127.0.0.1" : { >>>>> uuid1 : "" >>>>> uuid2: "" >>>>> uuid3: "" >>>>> } >>>>> "some.other.ip" : { >>>>> uuid1 : "" >>>>> uuid2: "" >>>>> uuid3: "" >>>>> } >>>>> } >>>>> "2010080712" : { >>>>> "127.0.0.1" : { >>>>> uuid1 : "" >>>>> } >>>>> } >>>>> } >>>>> >>>>> >>>>> >>>>> >>> >>> In the end does it really matter which one to go with? I kind of like the >>> previous version so I don't have to build up all the keys for the >>> multi_get >>> and instead I can just provide and start& finish for the columns (time >>> frames). >>> >>> > > Is there any performance penalty for a multi_get that includes x keys versus > a get on 1 key with a start/finish range of x? > > Using your gem, > > multi_get("SearchLog", ["20090101"..."20100807"], "127.0.0.1") > vs > get("SearchLog", "127.0.0.1", :start => "20090101", :finish => ""127.0.0.1") > > Thanks >