Re: Design questions/Schema help

Mark Mon, 26 Jul 2010 19:41:10 -0700

On 7/26/10 7:06 PM, Dave Viner wrote:

AFAIK, atomic increments are not available. There recently has beenquite a bit of discussion about them. So, you might search the archives.



Dave Viner

On Mon, Jul 26, 2010 at 7:02 PM, Mark <static.void....@gmail.com<mailto:static.void....@gmail.com>> wrote:


    On 7/26/10 6:06 PM, Dave Viner wrote:

    I'd love to hear other's opinions here... but here are my 2 cents.

    With Cassandra, you need to think of the queries - which you've
    pretty much done.

    For the most popular queries, you could do something like:

    <ColumnFamily Name="QueriesCounted"
                    ComparesWith="UTF8Type"
                    />
    And then access it as:
    key-space.QueriesCounted['query-foo-bar'] = $count;

    This makes it easy to get the count for any particular query.
     I'm not sure the best way to store the "top counts" idea.
     Perhaps a secondary process which iterates over all the queries
    to see which sorts the query values by count, and then stores
    them into another ColumnFamily.

    You could use the same idea for the last query (session ids by query)

    <ColumnFamily Name="QueriesRecorded"
                    ComparesWith="UTF8Type"
                    ColumnType="super"
    CompareSubcolumnsWith="TimeUUIDType"
                    />
    And then access it as:
    key-space. QueriesRecorded['query-foo-bar'][timeuuid] = session-id;

    Actually, if you used that idea (queries-recorded), you could
    generate the counts and aggregates from that directly in a hadoop
    post-processing...

    But perhaps others will have better ideas.  If you haven't read
    http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model, go
    read it now.  It won't answer your question directly, but will
    describe the process of modeling a blog in cassandra so you can
    get a sense of the process.

    Dave Viner




    On Mon, Jul 26, 2010 at 4:46 PM, Mark <static.void....@gmail.com
    <mailto:static.void....@gmail.com>> wrote:

        We are thinking about using Cassandra to store our search
        logs. Can someone point me in the right direction/lend some
        guidance on design? I am new to Cassandra and I am having
        trouble wrapping my head around some of these new concepts.
        My brain keeps wanting to go back to a RDBMS design.

        We will be storing the user query, # of hits returned and
        their session id. We would like to be able to answer the
        following questions.

        - What is the n most popular queries and their counts within
        the last x (mins/hours/days/etc). Basically the most popular
        searches within a given time range.
        - What is the most popular query within the last x where hits
        = 0. Same as above but with an extra "where" clause
        - For session id x give me all their other queries
        - What are all the session ids that searched for 'foos'

        We accomplish the above functionality w/ MySQL using 2
        tables. One for the raw search log information and the other
        to keep the aggregate/running counts of queries.

        Would this sort of ad-hoc querying be better implemented
        using Hadoop + Hive? If so, should I be storing all this
        information in Cassandra then using Hadoop to retrieve it?

        Thanks for your suggestions

    "Perhaps a secondary process which iterates over all the queries
    to see which sorts the query values by count, and then stores them
    into another ColumnFamily."

    - I was trying to avoid this. Is there some sort of atomic
    increment feature available? I guess I could do the same thing we
    are currently doing which is...

    a) store full query details into table A
    b) query table B for aggregate count of query 'foo' then store
    count + 1

Thanks Ill look into that.

Say I am trying to model something like this:

SearchLogs : {
    foo : {
        TimeUUID_1 : {
            session : af55e102b67c2de27bf12024ac0e7798
            user_id : mr jiggles
        }
        TimeUUID_2 : {
            ....
         }
     }
     bar : {
        .....
    }
}

Would this be my config?

<ColumnFamily Name="SearchLogs"
                    ColumnType="Super"
                    ComparesWith="BytesType"
                    CompareSubcolumnsWith="TimeUUIDType"/>

So basically MyCompany.SearchLogs['foo'] would return an array of hashesordered by time. Is this correct?

Re: Design questions/Schema help

Reply via email to