Hey guys,

I'm curious about your experiences regarding a data modeling question we are facing with. At the moment we see 2 major different approaches in terms of how to build the tables But I'm googling around already for days with no luck to find any useful material explaining to me how a Map (as collection datatype) works on the storage engine, and what could surprise us later if we . So decided to ask this question... (If someone has some nice pointers here maybe that is also much appreciated!)

So
*To describe the problem* in a simplified form

 * Imagine you have users (everyone is identified with a UUID),
 * and we want to answer a simple question: "have we seen this guy before?"
 * we "just" want to be able to answer this question for a limited time
   - let's say for 3 months
 * ....but... there are lots of lots of users we run into... many
   millions / each day...
 * ....and ~15-20% of them are returning users only - so many guys we
   just might see once

We are thinking about something like a big big Map, in a form of
    userId => lastSeenTimestamp

Obviously if we would have something like that then answering the above question is simply:
    if(map.get(userId) != null)  => TRUE - we have seen the guy before

Regarding the 2 major modelling approaches I mentioned above

*Approach 1*
Just simply use a table, something like this

CREATE TABLE IF NOT EXISTS users (
    user_id            varchar,
    last_seen        int,                -- a UNIX timestamp is enough, thats why int

    PRIMARY KEY (user_id)
) ....
AND default_time_to_live = <3 months of seconds>;

*Approach 2
*to do not produce that much rows, "cluster" the guys a bit together (into 1 row) so introduce a hashing function over the userId, producing a value btw [0; 10000]
and go with a table like

CREATE TABLE IF NOT EXISTS users (
    user_id_hash    int,
    users_seen        map<text, int>,            -- this is a userId => last timestamp map

    PRIMARY KEY (user_id_hash)
) ....
AND default_time_to_live = <3 months of seconds>;        -- yes, its clearly not a good enough way ...


In theory:

 * on a WRITE path both representation gives us a way to do the write
   without the need of read
 * even the READ path is pretty efficient in both cases
 * Approach2 is worse definitely when we come to the cleanup - "remove
   info if older than 3 month"
 * Approach2 might affect the balance of the cluster more - thats clear
   (however not that much due to the "law of large number" and really
   enough random factors)

And what we are struggling around is: what do you think
*Which approach would be better over time? *So will slow down the cluster less considering in compaction etc etc

As far as we can see the real question is:

which hurts more?

 * much more rows, but very small rows (regarding data size), or
 * much less rows, but much bigger rows (regarding data size)

?

Any thoughts, comments, pointers to some related case studies, articles, etc is highly appreciated!! :-)

thanks!

--
Attila Wind

http://www.linkedin.com/in/attilaw
Mobile: +49 176 43556932


Reply via email to