Lee's response is correct, but I'll elaborate slightly (hopefully this is helpful instead of confusing).
There are some queries for which the following is true: if the data sample is uniform from the original (unsampled) data, then accurate answers with respect to the sample are also accurate with respect to the original (unsampled) data. As one example, consider quantile queries: If you have n original data points from an ordered domain and you sample at least t ~= log(n)/epsilon^2 of the data points at random, it is known that, with high probability over the sample, for each domain item i, the fractional rank of i in the sample (i.e., the number of sampled points less than or equal to i, divided by the sample size t) will match the fractional rank of i in the original unsampled data (i.e., the number of data points less than or equal to i, divided by n) up to additive error at most epsilon. In fact, at a conceptual level, the KLL quantiles algorithm that's implemented in the library is implicitly performing a type of downsampling internally and then summarizing the sample (this is a little bit of a simplification). Something similar is true for frequent items. However, it is not true for "non-additive" queries such as unique counts. All of that said, the library will not be able to say anything about what errors the user should expect if the data is pre-sampled, because in such a situation there are many factors that are out of the library's control. On Wed, Nov 18, 2020 at 3:08 PM leerho <lee...@gmail.com> wrote: > Sorry, if you presample your data all bets are off in terms of accuracy. > > On Wed, Nov 18, 2020 at 10:55 AM Sergio Castro <sergio...@gmail.com> > wrote: > >> Hi, I am new to DataSketches. >> >> I know Datasketches provides an *approximate* calculation of statistics >> with *mathematically proven error bounds*. >> >> My question is: >> Say that I am constrained to take a sampling of the original data set >> before handling it to Datasketches (for example, I cannot take more than >> 10.000 random rows from a table). >> What would be the consequence of this previous sampling in the >> "mathematically proven error bounds" of the Datasketches statistics, >> relative to the original data set? >> >> Best, >> >> Sergio >> >