Hello, In case anyone will be looking for a similar solution in future I put a short blog post on this subject: http://www.dataminelab.com/blog/calculating-unique-visitors-in-hadoop-and-hive/
Best, Radek On 14 January 2011 12:50, Radek Maciaszek <ra...@maciaszek.co.uk> wrote: > Hi Itai, > > I did not think about sampling users instead of sampling records, but it > makes a much more sense indeed. > As it happens my ID is also hexadecimal and so I did exactly what you > suggested. In the results my error is less than 1% comparing to observed > values! > > Many thanks!! > Radek > > > On 14 January 2011 11:32, Itai Hochman <i...@outbrain.com> wrote: > >> We had a similar challenge and we dealt with it by sampling based on the >> user id. >> >> We have a unique id which is a random hexadecimal format- for instance >> A12890900. >> >> The query is running only on users that have an id that ends with 00. >> >> At the end we multiply by 256 and get a pretty close number to the real >> number. >> >> >> >> Itai >> >> >> >> On 01/14/2011 01:14 PM, Radek Maciaszek wrote: >> >> Hi, >>> >>> I am working on some large scale unique users analysis (think hundreds of >>> millions of records per day). Since number of all records per month goes >>> into many billions I am hoping that there may be some alternative to running >>> "SELECT DISTINCT user_unique_id..." such as sampling data or perhaps >>> deriving monthly numbers based on daily numbers of unique users with a use >>> of statistical linear regression or a similar technique. >>> >>> I tried to use sampling but that does not seem to give me the results I >>> would expect. That is, for example if I sample every 1/100 record I would >>> expect to see about 1% of the total number of unique users from a given >>> period of time but instead I am seeing bigger numbers, for example something >>> like 5%. I am not sure if the law of large number applies to unique users >>> analysis. >>> >>> Moreover it seems that "select distinct" does not scale linearly (which >>> is understandable) and so doing monthly unique users analysis takes way >>> longer than 30 x time to perform daily unique users analysis and requires a >>> lot of memory. >>> >>> I was wondering if anyone had a similar challenge? >>> >>> Many thanks, >>> Radek >>> >> >> > > >