Hi, Is there a good way to determine repeat visitor in analyzing web logs using Hive/Hadoop? One idea that I can come up with is storing the list of user id and session id (session data) in another table and then join that table. However, the session data table would grow indefinitely (potentially over 1B+ records). Joining two large table in Hive would result in a Common Join and I cannot find any performance information on it. Is this even feasible and scalable?
There was an older thread that was somewhat related to this issue: http://osdir.com/ml/hive-user-hadoop-apache/2009-07/msg00267.html and one of the suggestions was using HBase. However, I don't see anything related on using Hive with HBase integration for updating fields. Are there any alternatives? Or a better approach to solve this problem? Thanks for any pointers. Thanks, --wil