Re: Consequences of sampling before analyzing data with DataSketches

Justin Thaler Wed, 18 Nov 2020 16:24:07 -0800

Lee's response is correct, but I'll elaborate slightly (hopefully this is
helpful instead of confusing).

There are some queries for which the following is true: if the data sample
is uniform from the original (unsampled) data, then accurate answers with
respect to the sample are also accurate with respect to the original
(unsampled) data.

As one example, consider quantile queries:

If you have n original data points from an ordered domain and you sample at
least t ~= log(n)/epsilon^2 of the data points at random, it is known that,
with high probability over the sample, for each domain item i, the
fractional rank of i in the sample (i.e., the number of sampled points less
than or equal to i, divided by the sample size t) will match the fractional
rank of i in the original unsampled data (i.e., the number of data points
less than or equal to i, divided by n) up to additive error at most
epsilon.

In fact, at a conceptual level, the KLL quantiles algorithm that's
implemented in the library is implicitly performing a type of downsampling
internally and then summarizing the sample (this is a little bit of a
simplification).

Something similar is true for frequent items. However, it is not true for
"non-additive" queries such as unique counts.

All of that said, the library will not be able to say anything about what
errors the user should expect if the data is pre-sampled, because in such a
situation there are many factors that are out of the library's control.

On Wed, Nov 18, 2020 at 3:08 PM leerho <lee...@gmail.com> wrote:

> Sorry, if you presample your data all bets are off in terms of accuracy.
>
> On Wed, Nov 18, 2020 at 10:55 AM Sergio Castro <sergio...@gmail.com>
> wrote:
>
>> Hi, I am new to DataSketches.
>>
>>  I know Datasketches provides an *approximate* calculation of statistics
>> with *mathematically proven error bounds*.
>>
>> My question is:
>> Say that I am constrained to take a sampling of the original data set
>> before handling it to Datasketches (for example, I cannot take more than
>> 10.000 random rows from a table).
>> What would be the consequence of this previous sampling in the
>> "mathematically proven error bounds" of the Datasketches statistics,
>> relative to the original data set?
>>
>> Best,
>>
>> Sergio
>>
>

Re: Consequences of sampling before analyzing data with DataSketches

Reply via email to