Hi, I've been watching these presentations about real-time user segmentation using HBase by rich relelvance: https://www.youtube.com/watch?v=dPnuOv3CPQ0 http://www.slideshare.net/Hadoop_Summit/doctor-nguyen-june27425pmroom230av2
It's a really great detailed talk, highly recommended. They use it to calculate segments by evaluating and combining rules like "All users who did EventX with MetricY Between dates D1 and D2 at least N times". It seems to be working well for them. But, there are one or two things I can't figure out. Would anyone be interested in talking about how they did it, or has anyone here implemented a similar scenario and be willing to chat? I'd love to get together and swap/discuss ideas about implementing some variations of this approach, and very happy to share my experience so far. The details, it looks like they used cell versioning to store multiple click stream events per-day (they mention this briefly in another version of the video http://vimeo.com/70500725). They must have increased the max-versions of the column to something quite high in order to do this. How did they use cell versions to store so many values? I was under the impression that increasing the max versions size over about 100 would result in very large HFiles? Then, they somehow figure out how many times the event happened between these dates. I'm assuming they must calculate some running totals for each user in memory as they're iterating the results, but that could be expensive if you need a hashtable in memory with millions of users. Thanks, Meena
