certainly it matters: your previous version is not bounded on time, so will grow without bound. ergo, it is not a good fit for cassandra.
On Sat, Aug 7, 2010 at 2:51 PM, Mark <static.void....@gmail.com> wrote: > On 8/7/10 2:33 PM, Benjamin Black wrote: >> >> Right, this is an index row per time interval (your previous email was >> not). >> >> On Sat, Aug 7, 2010 at 11:43 AM, Mark<static.void....@gmail.com> wrote: >> >>> >>> On 8/7/10 11:30 AM, Mark wrote: >>> >>>> >>>> On 8/7/10 4:22 AM, Thomas Heller wrote: >>>> >>>>>> >>>>>> Ok, I think the part I was missing was the concatenation of the key >>>>>> and >>>>>> partition to do the look ups. Is this the preferred way of >>>>>> accomplishing >>>>>> needs such as this? Are there alternatives ways? >>>>>> >>>>> >>>>> Depending on your needs you can concat the row key or use super >>>>> columns. >>>>> >>>>> >>>>>> >>>>>> How would one then "query" over multiple days? Same question for all >>>>>> days. >>>>>> Should I use range_slice or multiget_slice? And if its range_slice >>>>>> does >>>>>> that >>>>>> mean I need OrderPreservingPartitioner? >>>>>> >>>>> >>>>> The last 3 days is pretty simple: ['2010-08-07', '2010-08-06', >>>>> '2010-08-05'], as is 7, 31, etc. Just generate the keys in your app >>>>> and use multiget_slice. >>>>> >>>>> If you want to get all days where a specific ip address had some >>>>> requests you'll just need another CF where the row key is the addr and >>>>> column names are the days (values optional again). Pretty much the >>>>> same all over again, just add another CF and insert the data you need. >>>>> >>>>> get_range_slice in my experience is better used for "offline" tasks >>>>> where you really want to process every row there is. >>>>> >>>>> /thomas >>>>> >>>> >>>> Ok... as an example using looking up logs by ip for a certain >>>> timeframe/range would this work? >>>> >>>> <ColumnFamily Name="SearchLog"/> >>>> >>>> <ColumnFamily Name="IPSearchLog" >>>> ColumnType="Super" >>>> CompareWith="UTF8Type" >>>> CompareSubcolumnsWith="TimeUUIDType"/> >>>> >>>> Resulting in a structure like: >>>> >>>> { >>>> "127.0.0.1" : { >>>> "2010080711" : { >>>> uuid1 : "" >>>> uuid2: "" >>>> uuid3: "" >>>> } >>>> "2010080712" : { >>>> uuid1 : "" >>>> uuid2: "" >>>> uuid3: "" >>>> } >>>> } >>>> "some.other.ip" : { >>>> "2010080711" : { >>>> uuid1 : "" >>>> } >>>> } >>>> } >>>> >>>> Whereas each uuid is the key used for SearchLog. Is there anything >>>> wrong >>>> with this? I know there is a 2 billion column limit but in this case >>>> that >>>> would never be exceeded because each column represents an hour. However >>>> does >>>> the above "schema" imply that for any certain IP there can only be a >>>> maxium >>>> of 2GB of data stored? >>>> >>> >>> Or should I invert the ip with the time slices? The limitation of this >>> seems >>> like there can only be 2 billion unique ips per hour which is more than >>> enough for our application :) >>> >>> { >>> "2010080711" : { >>> "127.0.0.1" : { >>> uuid1 : "" >>> uuid2: "" >>> uuid3: "" >>> } >>> "some.other.ip" : { >>> uuid1 : "" >>> uuid2: "" >>> uuid3: "" >>> } >>> } >>> "2010080712" : { >>> "127.0.0.1" : { >>> uuid1 : "" >>> } >>> } >>> } >>> >>> >>> > > In the end does it really matter which one to go with? I kind of like the > previous version so I don't have to build up all the keys for the multi_get > and instead I can just provide and start & finish for the columns (time > frames). >