Re: [E] Theta Serialize/Deserialize and then update?

Karl Matthias Thu, 26 Aug 2021 10:39:48 -0700

Thanks for that. I figured out how to manage it in the Java lib. You need
to use a WritableMemory to wrap the byte array and then explicitly
instantiate an UpdateSketch with the WritableMemory. This is now working
and I'm doing some prototyping. Ideally I could use this from the C++
library as well, but I will work with the Java lib for now while
investigating.


I will spend some time seeing if I can simplify a series model to do what I
want.

On Thu, Aug 26, 2021 at 12:07 AM Alexander Saydakov <
sayda...@verizonmedia.com> wrote:

> I believe that Java code still has the functionality to serialize and
> deserialize updatable Theta sketches. You point to a "wrap" operation,
> which is one of two ways to deserialize: heapify (instantiate an object on
> heap from a given chunk of bytes, involves copying data) and wrap (directly
> operate on a given chunk of bytes, often off-heap)
>
> Perhaps you could explain your use case a little more? What would the life
> cycle of your sketches be? When would you serialize them? When deserialize?
> How many do you anticipate to keep overall? How many would you like to
> update? What is the reason for serializing? And so on.
>
> On Wed, Aug 25, 2021 at 2:26 PM Karl Matthias <k...@community.com> wrote:
>
>> Thank you, I will dig around the old source and see if I can find it.
>> AFAICT it was already removed from the Java implementation as well [1]. You
>> can serialize an UpdateSketch but when deserializing they are read-only.
>>
>> I do deeply understand time series data (I was on the team that designed
>> the second generation metrics pipeline at New Relic) but the problem I'm
>> trying to solve is not nicely modeled as a time series. Of course that is
>> possible, but doing it that way will require much more data and many more
>> calculations than I want at reporting time. The reported data will always
>> be for all time. So modeling as a time series will require an increasingly
>> large number of sketches, and possibly thus also a periodic
>> roll-up/compaction phase. None of which is necessary if I can simply update
>> the same sketch—really a set of them representing various dimensions—until
>> I rebuild it/them from the source events on a periodic basis. It is also
>> too much cardinality across too many dimensions to use the sketches simply
>> as a roll-up tool for distinct counting on the original data.
>>
>> I was hoping a private fork wasn't necessary to do it, but I can
>> understand that you folks intentionally chose not to support it. I will
>> have a go at it and see what I can make work.
>>
>> Thanks for the replies!
>>
>> [1]
>> https://github.com/apache/datasketches-java/blob/27ecce938555d731f29df97f12f4744a0efb663d/src/main/java/org/apache/datasketches/theta/Sketch.java#L139
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_datasketches-2Djava_blob_27ecce938555d731f29df97f12f4744a0efb663d_src_main_java_org_apache_datasketches_theta_Sketch.java-23L139&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=4MOEFXeD5db9oY9LJT00yMhrs15KmwAKMoMQm_mpWP8&s=qPeEDGmb9kd6n6nkOG002YD-j3Taq0udBPitc-G_rHk&e=>
>>
>> On Wed, Aug 25, 2021 at 9:46 PM Alexander Saydakov <
>> sayda...@verizonmedia.com> wrote:
>>
>>> It is possible, and we used to have serialization and deserialization of
>>> updatable Theta sketches. At some point we decided that it is more
>>> confusing than useful and might encourage anti-patterns in big systems
>>> (such as deserialize-update-serialize sequences on every update). So we
>>> removed this functionality from the C++ code, but not from Java (yet).
>>> Again, I would suggest treating serialization as finalizing a sketch. If
>>> you want to update it, create a fresh one for this new time frame or
>>> whatever classifier makes sense (batch, session, transaction). Hopefully
>>> this new sketch can be kept for updating for a while (unlit some
>>> close-of-books for a period of time or until the whole batch is processed
>>> or something). Finalized sketches can be easily merged as needed. Say, you
>>> create a new sketch every minute and serialize the previous one. Later you
>>> can have your report to show the last 60-min rolling window or a calendar
>>> day or something like that by aggregating the appropriate set of sketches
>>> for that report.
>>>
>>>
>>> On Wed, Aug 25, 2021 at 1:20 PM Karl Matthias <k...@community.com>
>>> wrote:
>>>
>>>> Thanks for the reply. Yes I could do time series sketches, but what I
>>>> want actually is a summary representation of the current set, which I
>>>> update over time and eventually replace entirely. It's an evented system
>>>> and I want to use Theta sketches as a sort of summary. I can rebuild them
>>>> entirely at any time, but if maintained live they would be a fast
>>>> approximation that is combinable with other Theta sketches. Ideally I would
>>>> not have to keep them all in memory to do that and could serialize and
>>>> deserialize at will.
>>>>
>>>> It sounds like it's not currently implemented. But if I can manage the
>>>> code to do it, it is possible?
>>>>
>>>> On Wed, Aug 25, 2021 at 8:09 PM Alexander Saydakov <
>>>> sayda...@verizonmedia.com> wrote:
>>>>
>>>>> Is there a good reason to necessarily update the same sketch you
>>>>> decided to serialize?
>>>>> I would suggest considering that sketch finalized. Perhaps, in your
>>>>> system these sketches would represent different time periods or different
>>>>> categories or something like that. Later on you may want to merge (union)
>>>>> some of them to obtain an estimate for a longer time frame or a total
>>>>> across categories and so on.
>>>>>
>>>>> On Wed, Aug 25, 2021 at 11:14 AM Karl Matthias <k...@community.com>
>>>>> wrote:
>>>>>
>>>>>> Hey folks,
>>>>>>
>>>>>> I am working with both the Java library and the C++ library and the
>>>>>> Theta sketch.
>>>>>>
>>>>>> What I would like to do is update a sketch, save it somewhere (i.e.
>>>>>> disk, etc), then reload it later and possibly update it then. The
>>>>>> CompactSketch doesn't support updates when an UpdateSketch is serialized
>>>>>> and loaded, it is read-only.
>>>>>>
>>>>>> From looking at the Java code it seems like it would be possible to
>>>>>> create an UpdateSketch from the contents of a CompactSketch but there
>>>>>> doesn't appear to be an existing method that does this. Am I missing
>>>>>> something that already does this? Or is it not possible?
>>>>>>
>>>>>> Many thanks
>>>>>> Karl
>>>>>>
>>>>>>

Re: [E] Theta Serialize/Deserialize and then update?

Reply via email to