Re: Consequences of sampling before analyzing data with DataSketches

Justin Thaler Thu, 19 Nov 2020 09:02:07 -0800

Hi Lee,

I guess you mean the link to the paper on sketching subsampled data? That's
strange, it's working for me. Anyway, here is more information if anyone
wants to access it.


The paper is entitled "Space-Efficient Estimation of Statistics over
Sub-Sampled Streams" by McGregor, Pavan, Tirthapura, and Woodruff. The
conference version was published in PODS 2012 and the journal version
published in Algorithmica in 2015 or 2016.

Here are some other online versions of the paper:

https://link.springer.com/article/10.1007/s00453-015-9974-0 (journal
version, not sure if it's paywalled)

http://www.cs.cmu.edu/afs/cs/user/dwoodruf/www/mptw12.pdf

https://dl.acm.org/doi/10.1145/2213556.2213594 (conference version, not
sure if it's paywalled)



On Thu, Nov 19, 2020 at 10:58 AM leerho <lee...@gmail.com> wrote:

> Hi Justin,  the site you referenced returns an error 500 (internal server
> error).  It might be down, or out-of-service.  You might also check to make
> sure it is the correct URL.
>
> Thanks!
> Lee.
>
> On Thu, Nov 19, 2020 at 6:05 AM Justin Thaler <justin.r.tha...@gmail.com>
> wrote:
>
>> I think the way to think about this is the following. If you downsample
>> and then sketch, there are two sources of error: sampling error and
>> sketching error. The former refers to how much the answer to your query
>> over the sample deviates from the answer over the original data, while the
>> second refers to how much the estimate returned by the sketch deviates from
>> the exact answer on the sample.
>>
>> If the sampling error is very large, then no matter how accurate your
>> sketch is, your total error will be large, so you won't be gaining anything
>> by throwing resources into minimizing sketching error.
>>
>> If sampling error is very small, then there's not really a need to drive
>> sketching error any lower than you would otherwise choose it to be.
>>
>> So as a practical matter, my personal recommendation would be to make
>> sure your sample is big enough that the sampling error is very small, and
>> then set the sketching error as you normally would ignoring the subsampling.
>>
>> In case it's helpful, I should mention that there's been (at least) one
>> academic paper devoted to precisely the question of what is the best
>> approach to sketching for various query classes if data must first be
>> subsampled if you'd like to check it out:
>> https://core.ac.uk/download/pdf/212809966.pdf
>>
>> I should reiterate that there are certain types of queries that
>> inherently don't play well with random sampling (i.e., it's basically
>> impossible to give a meaningful bound on the sampling error, at least
>> without making assumptions about the data, which is something that error
>> guarantees provided by the library assiduously avoids).
>>
>> On Thu, Nov 19, 2020 at 7:20 AM Sergio Castro <sergio...@gmail.com>
>> wrote:
>>
>>> Thanks a lot for your answers to my first question, Lee and Justin.
>>>
>>> Justin, regarding this observation: "*All of that said, the library
>>> will not be able to say anything about what errors the user should expect
>>> if the data is pre-sampled, because in such a situation there are many
>>> factors that are out of the library's control.* "
>>> Trying to alleviate this problem. I know I can tune the DataSketches
>>> computation by means of trading-off memory vs accuracy.
>>> So is it correct that in the scenario where I am constrained to
>>> pre-sample the data, I should aim for the best optimization for accuracy
>>> even if this will require more memory, with the objective of alleviating
>>> the impact of my double sampling problem (meaning the pre-sampling I am
>>> constrained to do before + the sampling performed by Datasketches itself)?
>>> While in the scenarios where I am not constrained to use pre-sampling I
>>> still could use the default DataSketches configuration with a more balanced
>>> trade-off between accuracy and memory requirements?
>>>
>>> Would you say this is a good best-effort strategy? Or in both cases you
>>> would recommend me to use the same configuration ?
>>>
>>> Thanks for your time and feedback,
>>>
>>> Sergio
>>>
>>>
>>> On Thu, Nov 19, 2020 at 1:24 AM Justin Thaler <justin.r.tha...@gmail.com>
>>> wrote:
>>>
>>>> Lee's response is correct, but I'll elaborate slightly (hopefully this
>>>> is helpful instead of confusing).
>>>>
>>>> There are some queries for which the following is true: if the data
>>>> sample is uniform from the original (unsampled) data, then accurate answers
>>>> with respect to the sample are also accurate with respect to the original
>>>> (unsampled) data.
>>>>
>>>> As one example, consider quantile queries:
>>>>
>>>> If you have n original data points from an ordered domain and you
>>>> sample at least t ~= log(n)/epsilon^2 of the data points at random, it is
>>>> known that, with high probability over the sample, for each domain item i,
>>>> the fractional rank of i in the sample (i.e., the number of sampled points
>>>> less than or equal to i, divided by the sample size t) will match the
>>>> fractional rank of i in the original unsampled data (i.e., the number of
>>>> data points less than or equal to i, divided by n) up to additive error at
>>>> most epsilon.
>>>>
>>>> In fact, at a conceptual level, the KLL quantiles algorithm that's
>>>> implemented in the library is implicitly performing a type of downsampling
>>>> internally and then summarizing the sample (this is a little bit of a
>>>> simplification).
>>>>
>>>> Something similar is true for frequent items. However, it is not true
>>>> for "non-additive" queries such as unique counts.
>>>>
>>>> All of that said, the library will not be able to say anything about
>>>> what errors the user should expect if the data is pre-sampled, because in
>>>> such a situation there are many factors that are out of the library's
>>>> control.
>>>>
>>>> On Wed, Nov 18, 2020 at 3:08 PM leerho <lee...@gmail.com> wrote:
>>>>
>>>>> Sorry, if you presample your data all bets are off in terms of
>>>>> accuracy.
>>>>>
>>>>> On Wed, Nov 18, 2020 at 10:55 AM Sergio Castro <sergio...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi, I am new to DataSketches.
>>>>>>
>>>>>>  I know Datasketches provides an *approximate* calculation of
>>>>>> statistics with *mathematically proven error bounds*.
>>>>>>
>>>>>> My question is:
>>>>>> Say that I am constrained to take a sampling of the original data set
>>>>>> before handling it to Datasketches (for example, I cannot take more than
>>>>>> 10.000 random rows from a table).
>>>>>> What would be the consequence of this previous sampling in the
>>>>>> "mathematically proven error bounds" of the Datasketches statistics,
>>>>>> relative to the original data set?
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Sergio
>>>>>>
>>>>>

Re: Consequences of sampling before analyzing data with DataSketches

Reply via email to