Re: [E] Re: Consequences of sampling before analyzing data with DataSketches

leerho Thu, 19 Nov 2020 09:58:40 -0800

Works for me now :)


On Thu, Nov 19, 2020 at 9:10 AM Will Lauer <wla...@verizonmedia.com> wrote:

> Lee, That link looks like it's working for me now. Must have been a
> temporary server error.
>
> Will
>
> <http://www.verizonmedia.com>
>
> Will Lauer
>
> Senior Principal Architect, Audience & Advertising Reporting
> Data Platforms & Systems Engineering
>
> M 508 561 6427
> 1908 S. First St
> Champaign, IL 61822
>
> <http://www.facebook.com/verizonmedia>   <http://twitter.com/verizonmedia>
>    <https://www.linkedin.com/company/verizon-media/>
> <http://www.instagram.com/verizonmedia>
>
>
>
> On Thu, Nov 19, 2020 at 9:57 AM leerho <lee...@gmail.com> wrote:
>
>> Hi Justin,  the site you referenced returns an error 500 (internal server
>> error).  It might be down, or out-of-service.  You might also check to make
>> sure it is the correct URL.
>>
>> Thanks!
>> Lee.
>>
>> On Thu, Nov 19, 2020 at 6:05 AM Justin Thaler <justin.r.tha...@gmail.com>
>> wrote:
>>
>>> I think the way to think about this is the following. If you downsample
>>> and then sketch, there are two sources of error: sampling error and
>>> sketching error. The former refers to how much the answer to your query
>>> over the sample deviates from the answer over the original data, while the
>>> second refers to how much the estimate returned by the sketch deviates from
>>> the exact answer on the sample.
>>>
>>> If the sampling error is very large, then no matter how accurate your
>>> sketch is, your total error will be large, so you won't be gaining anything
>>> by throwing resources into minimizing sketching error.
>>>
>>> If sampling error is very small, then there's not really a need to drive
>>> sketching error any lower than you would otherwise choose it to be.
>>>
>>> So as a practical matter, my personal recommendation would be to make
>>> sure your sample is big enough that the sampling error is very small, and
>>> then set the sketching error as you normally would ignoring the subsampling.
>>>
>>> In case it's helpful, I should mention that there's been (at least) one
>>> academic paper devoted to precisely the question of what is the best
>>> approach to sketching for various query classes if data must first be
>>> subsampled if you'd like to check it out:
>>> https://core.ac.uk/download/pdf/212809966.pdf
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__core.ac.uk_download_pdf_212809966.pdf&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=TS-clarTE9n5rihY9KO9VJBABYz9__eAAcLmXJGPrLA&s=yWiRfonSO6QIO9joZFuuz_gglz6SQnYjxysLQZza5IM&e=>
>>>
>>> I should reiterate that there are certain types of queries that
>>> inherently don't play well with random sampling (i.e., it's basically
>>> impossible to give a meaningful bound on the sampling error, at least
>>> without making assumptions about the data, which is something that error
>>> guarantees provided by the library assiduously avoids).
>>>
>>> On Thu, Nov 19, 2020 at 7:20 AM Sergio Castro <sergio...@gmail.com>
>>> wrote:
>>>
>>>> Thanks a lot for your answers to my first question, Lee and Justin.
>>>>
>>>> Justin, regarding this observation: "*All of that said, the library
>>>> will not be able to say anything about what errors the user should expect
>>>> if the data is pre-sampled, because in such a situation there are many
>>>> factors that are out of the library's control.* "
>>>> Trying to alleviate this problem. I know I can tune the DataSketches
>>>> computation by means of trading-off memory vs accuracy.
>>>> So is it correct that in the scenario where I am constrained to
>>>> pre-sample the data, I should aim for the best optimization for accuracy
>>>> even if this will require more memory, with the objective of alleviating
>>>> the impact of my double sampling problem (meaning the pre-sampling I am
>>>> constrained to do before + the sampling performed by Datasketches itself)?
>>>> While in the scenarios where I am not constrained to use pre-sampling I
>>>> still could use the default DataSketches configuration with a more balanced
>>>> trade-off between accuracy and memory requirements?
>>>>
>>>> Would you say this is a good best-effort strategy? Or in both cases you
>>>> would recommend me to use the same configuration ?
>>>>
>>>> Thanks for your time and feedback,
>>>>
>>>> Sergio
>>>>
>>>>
>>>> On Thu, Nov 19, 2020 at 1:24 AM Justin Thaler <
>>>> justin.r.tha...@gmail.com> wrote:
>>>>
>>>>> Lee's response is correct, but I'll elaborate slightly (hopefully this
>>>>> is helpful instead of confusing).
>>>>>
>>>>> There are some queries for which the following is true: if the data
>>>>> sample is uniform from the original (unsampled) data, then accurate 
>>>>> answers
>>>>> with respect to the sample are also accurate with respect to the original
>>>>> (unsampled) data.
>>>>>
>>>>> As one example, consider quantile queries:
>>>>>
>>>>> If you have n original data points from an ordered domain and you
>>>>> sample at least t ~= log(n)/epsilon^2 of the data points at random, it is
>>>>> known that, with high probability over the sample, for each domain item i,
>>>>> the fractional rank of i in the sample (i.e., the number of sampled points
>>>>> less than or equal to i, divided by the sample size t) will match the
>>>>> fractional rank of i in the original unsampled data (i.e., the number of
>>>>> data points less than or equal to i, divided by n) up to additive error at
>>>>> most epsilon.
>>>>>
>>>>> In fact, at a conceptual level, the KLL quantiles algorithm that's
>>>>> implemented in the library is implicitly performing a type of downsampling
>>>>> internally and then summarizing the sample (this is a little bit of a
>>>>> simplification).
>>>>>
>>>>> Something similar is true for frequent items. However, it is not true
>>>>> for "non-additive" queries such as unique counts.
>>>>>
>>>>> All of that said, the library will not be able to say anything about
>>>>> what errors the user should expect if the data is pre-sampled, because in
>>>>> such a situation there are many factors that are out of the library's
>>>>> control.
>>>>>
>>>>> On Wed, Nov 18, 2020 at 3:08 PM leerho <lee...@gmail.com> wrote:
>>>>>
>>>>>> Sorry, if you presample your data all bets are off in terms of
>>>>>> accuracy.
>>>>>>
>>>>>> On Wed, Nov 18, 2020 at 10:55 AM Sergio Castro <sergio...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi, I am new to DataSketches.
>>>>>>>
>>>>>>>  I know Datasketches provides an *approximate* calculation of
>>>>>>> statistics with *mathematically proven error bounds*.
>>>>>>>
>>>>>>> My question is:
>>>>>>> Say that I am constrained to take a sampling of the original data
>>>>>>> set before handling it to Datasketches (for example, I cannot take more
>>>>>>> than 10.000 random rows from a table).
>>>>>>> What would be the consequence of this previous sampling in the
>>>>>>> "mathematically proven error bounds" of the Datasketches statistics,
>>>>>>> relative to the original data set?
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Sergio
>>>>>>>
>>>>>>

Re: [E] Re: Consequences of sampling before analyzing data with DataSketches

Reply via email to