Re: [E] Theta Serialize/Deserialize and then update?

Alexander Saydakov Wed, 25 Aug 2021 16:07:43 -0700

I believe that Java code still has the functionality to serialize and
deserialize updatable Theta sketches. You point to a "wrap" operation,
which is one of two ways to deserialize: heapify (instantiate an object on
heap from a given chunk of bytes, involves copying data) and wrap (directly
operate on a given chunk of bytes, often off-heap)


Perhaps you could explain your use case a little more? What would the life
cycle of your sketches be? When would you serialize them? When deserialize?
How many do you anticipate to keep overall? How many would you like to
update? What is the reason for serializing? And so on.

On Wed, Aug 25, 2021 at 2:26 PM Karl Matthias <k...@community.com> wrote:

> Thank you, I will dig around the old source and see if I can find it.
> AFAICT it was already removed from the Java implementation as well [1]. You
> can serialize an UpdateSketch but when deserializing they are read-only.
>
> I do deeply understand time series data (I was on the team that designed
> the second generation metrics pipeline at New Relic) but the problem I'm
> trying to solve is not nicely modeled as a time series. Of course that is
> possible, but doing it that way will require much more data and many more
> calculations than I want at reporting time. The reported data will always
> be for all time. So modeling as a time series will require an increasingly
> large number of sketches, and possibly thus also a periodic
> roll-up/compaction phase. None of which is necessary if I can simply update
> the same sketch—really a set of them representing various dimensions—until
> I rebuild it/them from the source events on a periodic basis. It is also
> too much cardinality across too many dimensions to use the sketches simply
> as a roll-up tool for distinct counting on the original data.
>
> I was hoping a private fork wasn't necessary to do it, but I can
> understand that you folks intentionally chose not to support it. I will
> have a go at it and see what I can make work.
>
> Thanks for the replies!
>
> [1]
> https://github.com/apache/datasketches-java/blob/27ecce938555d731f29df97f12f4744a0efb663d/src/main/java/org/apache/datasketches/theta/Sketch.java#L139
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_datasketches-2Djava_blob_27ecce938555d731f29df97f12f4744a0efb663d_src_main_java_org_apache_datasketches_theta_Sketch.java-23L139&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=4MOEFXeD5db9oY9LJT00yMhrs15KmwAKMoMQm_mpWP8&s=qPeEDGmb9kd6n6nkOG002YD-j3Taq0udBPitc-G_rHk&e=>
>
> On Wed, Aug 25, 2021 at 9:46 PM Alexander Saydakov <
> sayda...@verizonmedia.com> wrote:
>
>> It is possible, and we used to have serialization and deserialization of
>> updatable Theta sketches. At some point we decided that it is more
>> confusing than useful and might encourage anti-patterns in big systems
>> (such as deserialize-update-serialize sequences on every update). So we
>> removed this functionality from the C++ code, but not from Java (yet).
>> Again, I would suggest treating serialization as finalizing a sketch. If
>> you want to update it, create a fresh one for this new time frame or
>> whatever classifier makes sense (batch, session, transaction). Hopefully
>> this new sketch can be kept for updating for a while (unlit some
>> close-of-books for a period of time or until the whole batch is processed
>> or something). Finalized sketches can be easily merged as needed. Say, you
>> create a new sketch every minute and serialize the previous one. Later you
>> can have your report to show the last 60-min rolling window or a calendar
>> day or something like that by aggregating the appropriate set of sketches
>> for that report.
>>
>>
>> On Wed, Aug 25, 2021 at 1:20 PM Karl Matthias <k...@community.com> wrote:
>>
>>> Thanks for the reply. Yes I could do time series sketches, but what I
>>> want actually is a summary representation of the current set, which I
>>> update over time and eventually replace entirely. It's an evented system
>>> and I want to use Theta sketches as a sort of summary. I can rebuild them
>>> entirely at any time, but if maintained live they would be a fast
>>> approximation that is combinable with other Theta sketches. Ideally I would
>>> not have to keep them all in memory to do that and could serialize and
>>> deserialize at will.
>>>
>>> It sounds like it's not currently implemented. But if I can manage the
>>> code to do it, it is possible?
>>>
>>> On Wed, Aug 25, 2021 at 8:09 PM Alexander Saydakov <
>>> sayda...@verizonmedia.com> wrote:
>>>
>>>> Is there a good reason to necessarily update the same sketch you
>>>> decided to serialize?
>>>> I would suggest considering that sketch finalized. Perhaps, in your
>>>> system these sketches would represent different time periods or different
>>>> categories or something like that. Later on you may want to merge (union)
>>>> some of them to obtain an estimate for a longer time frame or a total
>>>> across categories and so on.
>>>>
>>>> On Wed, Aug 25, 2021 at 11:14 AM Karl Matthias <k...@community.com>
>>>> wrote:
>>>>
>>>>> Hey folks,
>>>>>
>>>>> I am working with both the Java library and the C++ library and the
>>>>> Theta sketch.
>>>>>
>>>>> What I would like to do is update a sketch, save it somewhere (i.e.
>>>>> disk, etc), then reload it later and possibly update it then. The
>>>>> CompactSketch doesn't support updates when an UpdateSketch is serialized
>>>>> and loaded, it is read-only.
>>>>>
>>>>> From looking at the Java code it seems like it would be possible to
>>>>> create an UpdateSketch from the contents of a CompactSketch but there
>>>>> doesn't appear to be an existing method that does this. Am I missing
>>>>> something that already does this? Or is it not possible?
>>>>>
>>>>> Many thanks
>>>>> Karl
>>>>>
>>>>>

Re: [E] Theta Serialize/Deserialize and then update?

Reply via email to