I’m a little late on this one, but I would choose approach 1. It is much more comprehensible to anyone who comes afterwards. And it should easily scale in Cassandra to whatever volume you have. I think I would call the table recent_users to make it very clear the purpose of the table. It is also extensible if you want to add new features like last_page_visited or last_purchase_date – just add new columns. Data expiration is automatic with the default TTL on the table.
Approach 2 could have tombstone issues within the map. And it would be hard to extend for new features. I think clean-up would require a separate process, too. I don’t think you can expire rows within a map column using TTL. Sean Durity From: Rahul Singh <rahul.xavier.si...@gmail.com> Sent: Saturday, September 19, 2020 10:41 AM To: user@cassandra.apache.org; Attila Wind <attila.wind@swf.technology> Subject: [EXTERNAL] Re: data modeling qu: use a Map datatype, or just simple rows... ? Not necessarily. A deterministic hash randomizes a key that may be susceptible to “clustering” that also may need to be used in other non Cassandra systems. This way records can be accessed in both systems while leveraging the partitioner in Cassandra without pitfalls. The same can be done with natural string keys like “email.” Best regards, Rahul Singh ________________________________ From: Sagar Jambhulkar <sagar.jambhul...@gmail.com<mailto:sagar.jambhul...@gmail.com>> Sent: Saturday, September 19, 2020 6:45:25 AM To: user@cassandra.apache.org<mailto:user@cassandra.apache.org> <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>; Attila Wind <attila.wind@swf.technology<mailto:attila.wind@swf.technology>> Subject: Re: data modeling qu: use a Map datatype, or just simple rows... ? Don't really see a difference in two options. Won't the partitioner run on user id and create a hash for you? Unless your hash function is better than partitioner. On Fri, 18 Sep 2020, 21:33 Attila Wind, <attilaw@swf.technology<mailto:attilaw@swf.technology>> wrote: Hey guys, I'm curious about your experiences regarding a data modeling question we are facing with. At the moment we see 2 major different approaches in terms of how to build the tables But I'm googling around already for days with no luck to find any useful material explaining to me how a Map (as collection datatype) works on the storage engine, and what could surprise us later if we . So decided to ask this question... (If someone has some nice pointers here maybe that is also much appreciated!) So To describe the problem in a simplified form • Imagine you have users (everyone is identified with a UUID), • and we want to answer a simple question: "have we seen this guy before?" • we "just" want to be able to answer this question for a limited time - let's say for 3 months • ....but... there are lots of lots of users we run into... many millions / each day... • ....and ~15-20% of them are returning users only - so many guys we just might see once We are thinking about something like a big big Map, in a form of userId => lastSeenTimestamp Obviously if we would have something like that then answering the above question is simply: if(map.get(userId) != null) => TRUE - we have seen the guy before Regarding the 2 major modelling approaches I mentioned above Approach 1 Just simply use a table, something like this CREATE TABLE IF NOT EXISTS users ( user_id varchar, last_seen int, -- a UNIX timestamp is enough, thats why int PRIMARY KEY (user_id) ) .... AND default_time_to_live = <3 months of seconds>; Approach 2 to do not produce that much rows, "cluster" the guys a bit together (into 1 row) so introduce a hashing function over the userId, producing a value btw [0; 10000] and go with a table like CREATE TABLE IF NOT EXISTS users ( user_id_hash int, users_seen map<text, int>, -- this is a userId => last timestamp map PRIMARY KEY (user_id_hash) ) .... AND default_time_to_live = <3 months of seconds>; -- yes, its clearly not a good enough way ... In theory: • on a WRITE path both representation gives us a way to do the write without the need of read • even the READ path is pretty efficient in both cases • Approach2 is worse definitely when we come to the cleanup - "remove info if older than 3 month" • Approach2 might affect the balance of the cluster more - thats clear (however not that much due to the "law of large number" and really enough random factors) And what we are struggling around is: what do you think Which approach would be better over time? So will slow down the cluster less considering in compaction etc etc As far as we can see the real question is: which hurts more? • much more rows, but very small rows (regarding data size), or • much less rows, but much bigger rows (regarding data size) ? Any thoughts, comments, pointers to some related case studies, articles, etc is highly appreciated!! :-) thanks! -- Attila Wind http://www.linkedin.com/in/attilaw [linkedin.com]<https://urldefense.com/v3/__http:/www.linkedin.com/in/attilaw__;!!M-nmYVHPHQ!fbjELuksXhq8Ht4Y-3UYAb65yuAIUKyRb3Jpi1y8ShNLn_CPsmF2NgJ7nJG980vsK53HnAs$> Mobile: +49 176 43556932 ________________________________ The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.