Determining New/Repeat Visitor

Wil - Thu, 10 Feb 2011 14:24:54 -0800

Hi,

Is there a good way to determine repeat visitor in analyzing web logs using 
Hive/Hadoop?  One idea that I can come up with is storing the list of user id 
and session id (session data) in another table and then join that table. 
 However, the session data table would grow indefinitely (potentially over 1B+ 
records).  Joining two large table in Hive would result in a Common Join and I 
cannot find any performance information on it.  Is this even feasible and 
scalable?


There was an older thread that was somewhat related to this 
issue: http://osdir.com/ml/hive-user-hadoop-apache/2009-07/msg00267.html and 
one 
of the suggestions was using HBase.  However, I don't see anything related on 
using Hive with HBase integration for updating fields.

Are there any alternatives? Or a better approach to solve this problem?

Thanks for any pointers.

Thanks,
--wil

Determining New/Repeat Visitor

Reply via email to