Hi, I am working on some large scale unique users analysis (think hundreds of millions of records per day). Since number of all records per month goes into many billions I am hoping that there may be some alternative to running "SELECT DISTINCT user_unique_id..." such as sampling data or perhaps deriving monthly numbers based on daily numbers of unique users with a use of statistical linear regression or a similar technique.
I tried to use sampling but that does not seem to give me the results I would expect. That is, for example if I sample every 1/100 record I would expect to see about 1% of the total number of unique users from a given period of time but instead I am seeing bigger numbers, for example something like 5%. I am not sure if the law of large number applies to unique users analysis. Moreover it seems that "select distinct" does not scale linearly (which is understandable) and so doing monthly unique users analysis takes way longer than 30 x time to perform daily unique users analysis and requires a lot of memory. I was wondering if anyone had a similar challenge? Many thanks, Radek