Works for me now :)
On Thu, Nov 19, 2020 at 9:10 AM Will Lauer <wla...@verizonmedia.com> wrote: > Lee, That link looks like it's working for me now. Must have been a > temporary server error. > > Will > > <http://www.verizonmedia.com> > > Will Lauer > > Senior Principal Architect, Audience & Advertising Reporting > Data Platforms & Systems Engineering > > M 508 561 6427 > 1908 S. First St > Champaign, IL 61822 > > <http://www.facebook.com/verizonmedia> <http://twitter.com/verizonmedia> > <https://www.linkedin.com/company/verizon-media/> > <http://www.instagram.com/verizonmedia> > > > > On Thu, Nov 19, 2020 at 9:57 AM leerho <lee...@gmail.com> wrote: > >> Hi Justin, the site you referenced returns an error 500 (internal server >> error). It might be down, or out-of-service. You might also check to make >> sure it is the correct URL. >> >> Thanks! >> Lee. >> >> On Thu, Nov 19, 2020 at 6:05 AM Justin Thaler <justin.r.tha...@gmail.com> >> wrote: >> >>> I think the way to think about this is the following. If you downsample >>> and then sketch, there are two sources of error: sampling error and >>> sketching error. The former refers to how much the answer to your query >>> over the sample deviates from the answer over the original data, while the >>> second refers to how much the estimate returned by the sketch deviates from >>> the exact answer on the sample. >>> >>> If the sampling error is very large, then no matter how accurate your >>> sketch is, your total error will be large, so you won't be gaining anything >>> by throwing resources into minimizing sketching error. >>> >>> If sampling error is very small, then there's not really a need to drive >>> sketching error any lower than you would otherwise choose it to be. >>> >>> So as a practical matter, my personal recommendation would be to make >>> sure your sample is big enough that the sampling error is very small, and >>> then set the sketching error as you normally would ignoring the subsampling. >>> >>> In case it's helpful, I should mention that there's been (at least) one >>> academic paper devoted to precisely the question of what is the best >>> approach to sketching for various query classes if data must first be >>> subsampled if you'd like to check it out: >>> https://core.ac.uk/download/pdf/212809966.pdf >>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__core.ac.uk_download_pdf_212809966.pdf&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=TS-clarTE9n5rihY9KO9VJBABYz9__eAAcLmXJGPrLA&s=yWiRfonSO6QIO9joZFuuz_gglz6SQnYjxysLQZza5IM&e=> >>> >>> I should reiterate that there are certain types of queries that >>> inherently don't play well with random sampling (i.e., it's basically >>> impossible to give a meaningful bound on the sampling error, at least >>> without making assumptions about the data, which is something that error >>> guarantees provided by the library assiduously avoids). >>> >>> On Thu, Nov 19, 2020 at 7:20 AM Sergio Castro <sergio...@gmail.com> >>> wrote: >>> >>>> Thanks a lot for your answers to my first question, Lee and Justin. >>>> >>>> Justin, regarding this observation: "*All of that said, the library >>>> will not be able to say anything about what errors the user should expect >>>> if the data is pre-sampled, because in such a situation there are many >>>> factors that are out of the library's control.* " >>>> Trying to alleviate this problem. I know I can tune the DataSketches >>>> computation by means of trading-off memory vs accuracy. >>>> So is it correct that in the scenario where I am constrained to >>>> pre-sample the data, I should aim for the best optimization for accuracy >>>> even if this will require more memory, with the objective of alleviating >>>> the impact of my double sampling problem (meaning the pre-sampling I am >>>> constrained to do before + the sampling performed by Datasketches itself)? >>>> While in the scenarios where I am not constrained to use pre-sampling I >>>> still could use the default DataSketches configuration with a more balanced >>>> trade-off between accuracy and memory requirements? >>>> >>>> Would you say this is a good best-effort strategy? Or in both cases you >>>> would recommend me to use the same configuration ? >>>> >>>> Thanks for your time and feedback, >>>> >>>> Sergio >>>> >>>> >>>> On Thu, Nov 19, 2020 at 1:24 AM Justin Thaler < >>>> justin.r.tha...@gmail.com> wrote: >>>> >>>>> Lee's response is correct, but I'll elaborate slightly (hopefully this >>>>> is helpful instead of confusing). >>>>> >>>>> There are some queries for which the following is true: if the data >>>>> sample is uniform from the original (unsampled) data, then accurate >>>>> answers >>>>> with respect to the sample are also accurate with respect to the original >>>>> (unsampled) data. >>>>> >>>>> As one example, consider quantile queries: >>>>> >>>>> If you have n original data points from an ordered domain and you >>>>> sample at least t ~= log(n)/epsilon^2 of the data points at random, it is >>>>> known that, with high probability over the sample, for each domain item i, >>>>> the fractional rank of i in the sample (i.e., the number of sampled points >>>>> less than or equal to i, divided by the sample size t) will match the >>>>> fractional rank of i in the original unsampled data (i.e., the number of >>>>> data points less than or equal to i, divided by n) up to additive error at >>>>> most epsilon. >>>>> >>>>> In fact, at a conceptual level, the KLL quantiles algorithm that's >>>>> implemented in the library is implicitly performing a type of downsampling >>>>> internally and then summarizing the sample (this is a little bit of a >>>>> simplification). >>>>> >>>>> Something similar is true for frequent items. However, it is not true >>>>> for "non-additive" queries such as unique counts. >>>>> >>>>> All of that said, the library will not be able to say anything about >>>>> what errors the user should expect if the data is pre-sampled, because in >>>>> such a situation there are many factors that are out of the library's >>>>> control. >>>>> >>>>> On Wed, Nov 18, 2020 at 3:08 PM leerho <lee...@gmail.com> wrote: >>>>> >>>>>> Sorry, if you presample your data all bets are off in terms of >>>>>> accuracy. >>>>>> >>>>>> On Wed, Nov 18, 2020 at 10:55 AM Sergio Castro <sergio...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi, I am new to DataSketches. >>>>>>> >>>>>>> I know Datasketches provides an *approximate* calculation of >>>>>>> statistics with *mathematically proven error bounds*. >>>>>>> >>>>>>> My question is: >>>>>>> Say that I am constrained to take a sampling of the original data >>>>>>> set before handling it to Datasketches (for example, I cannot take more >>>>>>> than 10.000 random rows from a table). >>>>>>> What would be the consequence of this previous sampling in the >>>>>>> "mathematically proven error bounds" of the Datasketches statistics, >>>>>>> relative to the original data set? >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> Sergio >>>>>>> >>>>>>