Hey guys,
I'm curious about your experiences regarding a data modeling question we
are facing with.
At the moment we see 2 major different approaches in terms of how to
build the tables
But I'm googling around already for days with no luck to find any useful
material explaining to me how a Map (as collection datatype) works on
the storage engine, and what could surprise us later if we . So decided
to ask this question... (If someone has some nice pointers here maybe
that is also much appreciated!)
So
*To describe the problem* in a simplified form
* Imagine you have users (everyone is identified with a UUID),
* and we want to answer a simple question: "have we seen this guy before?"
* we "just" want to be able to answer this question for a limited time
- let's say for 3 months
* ....but... there are lots of lots of users we run into... many
millions / each day...
* ....and ~15-20% of them are returning users only - so many guys we
just might see once
We are thinking about something like a big big Map, in a form of
userId => lastSeenTimestamp
Obviously if we would have something like that then answering the above
question is simply:
if(map.get(userId) != null) => TRUE - we have seen the guy before
Regarding the 2 major modelling approaches I mentioned above
*Approach 1*
Just simply use a table, something like this
CREATE TABLE IF NOT EXISTS users (
user_id varchar,
last_seen int, -- a UNIX timestamp is enough,
thats why int
PRIMARY KEY (user_id)
) ....
AND default_time_to_live = <3 months of seconds>;
*Approach 2
*to do not produce that much rows, "cluster" the guys a bit together
(into 1 row) so
introduce a hashing function over the userId, producing a value btw [0;
10000]
and go with a table like
CREATE TABLE IF NOT EXISTS users (
user_id_hash int,
users_seen map<text, int>, -- this is a userId =>
last timestamp map
PRIMARY KEY (user_id_hash)
) ....
AND default_time_to_live = <3 months of seconds>; -- yes, its
clearly not a good enough way ...
In theory:
* on a WRITE path both representation gives us a way to do the write
without the need of read
* even the READ path is pretty efficient in both cases
* Approach2 is worse definitely when we come to the cleanup - "remove
info if older than 3 month"
* Approach2 might affect the balance of the cluster more - thats clear
(however not that much due to the "law of large number" and really
enough random factors)
And what we are struggling around is: what do you think
*Which approach would be better over time? *So will slow down the
cluster less considering in compaction etc etc
As far as we can see the real question is:
which hurts more?
* much more rows, but very small rows (regarding data size), or
* much less rows, but much bigger rows (regarding data size)
?
Any thoughts, comments, pointers to some related case studies, articles,
etc is highly appreciated!! :-)
thanks!
--
Attila Wind
http://www.linkedin.com/in/attilaw
Mobile: +49 176 43556932