Re: [E] Theta Serialize/Deserialize and then update?

Karl Matthias Wed, 25 Aug 2021 14:26:25 -0700

Thank you, I will dig around the old source and see if I can find it.
AFAICT it was already removed from the Java implementation as well [1]. You
can serialize an UpdateSketch but when deserializing they are read-only.


I do deeply understand time series data (I was on the team that designed
the second generation metrics pipeline at New Relic) but the problem I'm
trying to solve is not nicely modeled as a time series. Of course that is
possible, but doing it that way will require much more data and many more
calculations than I want at reporting time. The reported data will always
be for all time. So modeling as a time series will require an increasingly
large number of sketches, and possibly thus also a periodic
roll-up/compaction phase. None of which is necessary if I can simply update
the same sketch—really a set of them representing various dimensions—until
I rebuild it/them from the source events on a periodic basis. It is also
too much cardinality across too many dimensions to use the sketches simply
as a roll-up tool for distinct counting on the original data.

I was hoping a private fork wasn't necessary to do it, but I can understand
that you folks intentionally chose not to support it. I will have a go at
it and see what I can make work.

Thanks for the replies!

[1]
https://github.com/apache/datasketches-java/blob/27ecce938555d731f29df97f12f4744a0efb663d/src/main/java/org/apache/datasketches/theta/Sketch.java#L139

On Wed, Aug 25, 2021 at 9:46 PM Alexander Saydakov <
sayda...@verizonmedia.com> wrote:

> It is possible, and we used to have serialization and deserialization of
> updatable Theta sketches. At some point we decided that it is more
> confusing than useful and might encourage anti-patterns in big systems
> (such as deserialize-update-serialize sequences on every update). So we
> removed this functionality from the C++ code, but not from Java (yet).
> Again, I would suggest treating serialization as finalizing a sketch. If
> you want to update it, create a fresh one for this new time frame or
> whatever classifier makes sense (batch, session, transaction). Hopefully
> this new sketch can be kept for updating for a while (unlit some
> close-of-books for a period of time or until the whole batch is processed
> or something). Finalized sketches can be easily merged as needed. Say, you
> create a new sketch every minute and serialize the previous one. Later you
> can have your report to show the last 60-min rolling window or a calendar
> day or something like that by aggregating the appropriate set of sketches
> for that report.
>
>
> On Wed, Aug 25, 2021 at 1:20 PM Karl Matthias <k...@community.com> wrote:
>
>> Thanks for the reply. Yes I could do time series sketches, but what I
>> want actually is a summary representation of the current set, which I
>> update over time and eventually replace entirely. It's an evented system
>> and I want to use Theta sketches as a sort of summary. I can rebuild them
>> entirely at any time, but if maintained live they would be a fast
>> approximation that is combinable with other Theta sketches. Ideally I would
>> not have to keep them all in memory to do that and could serialize and
>> deserialize at will.
>>
>> It sounds like it's not currently implemented. But if I can manage the
>> code to do it, it is possible?
>>
>> On Wed, Aug 25, 2021 at 8:09 PM Alexander Saydakov <
>> sayda...@verizonmedia.com> wrote:
>>
>>> Is there a good reason to necessarily update the same sketch you decided
>>> to serialize?
>>> I would suggest considering that sketch finalized. Perhaps, in your
>>> system these sketches would represent different time periods or different
>>> categories or something like that. Later on you may want to merge (union)
>>> some of them to obtain an estimate for a longer time frame or a total
>>> across categories and so on.
>>>
>>> On Wed, Aug 25, 2021 at 11:14 AM Karl Matthias <k...@community.com>
>>> wrote:
>>>
>>>> Hey folks,
>>>>
>>>> I am working with both the Java library and the C++ library and the
>>>> Theta sketch.
>>>>
>>>> What I would like to do is update a sketch, save it somewhere (i.e.
>>>> disk, etc), then reload it later and possibly update it then. The
>>>> CompactSketch doesn't support updates when an UpdateSketch is serialized
>>>> and loaded, it is read-only.
>>>>
>>>> From looking at the Java code it seems like it would be possible to
>>>> create an UpdateSketch from the contents of a CompactSketch but there
>>>> doesn't appear to be an existing method that does this. Am I missing
>>>> something that already does this? Or is it not possible?
>>>>
>>>> Many thanks
>>>> Karl
>>>>
>>>>

Re: [E] Theta Serialize/Deserialize and then update?

Reply via email to