Just for completeness the change is a handful loc. The rest is added
tests and we'd loose the sstable format change opportunity window.
Thx again for the replies.
On 26/6/23 9:33, Benedict wrote:
I would prefer we not plan on two distinct changes to this,
particularly when neither change is particularly more complex than the
other. There is a modest cost to maintenance from changing this
multiple times.
But if others feel strongly otherwise I won’t stand in the way.
On 26 Jun 2023, at 05:49, Berenguer Blasi <berenguerbl...@gmail.com>
wrote:
Thanks for the replies.
I intend to javadoc the ssatble format in detail someday and more
improvements might come up then, along the vint encoding mentioned
here. But unless sbdy volunteers to do that in 5.0, is anybody
against I try to get the original proposal (1 byte flags for sentinel
values) in?
Regards
Distant future people will not be happy about this, I can already
tell you now.
Eh, they'll all be AI's anyway and will just rewrite the code in a
background thread.
LOL
On 23/6/23 15:44, Josh McKenzie wrote:
If we’re doing this, why don’t we delta encode a vint from some
per-sstable minimum value? I’d expect that to commonly compress to
a single byte or so.
+1 to this approach.
Distant future people will not be happy about this, I can already
tell you now.
Eh, they'll all be AI's anyway and will just rewrite the code in a
background thread.
On Fri, Jun 23, 2023, at 9:02 AM, Berenguer Blasi wrote:
It's a possibility. Though I haven't coded and benchmarked such an
approach and I don't think I would have the time before the freeze to
take advantage of the sstable format change opportunity.
Still it's sthg that can be explored later. If we can shave a few
extra
% then that would always be great imo.
On 23/6/23 13:57, Benedict wrote:
> If we’re doing this, why don’t we delta encode a vint from some
per-sstable minimum value? I’d expect that to commonly compress to
a single byte or so.
>
>> On 23 Jun 2023, at 12:55, Aleksey Yeshchenko <alek...@apple.com>
wrote:
>>
>> Distant future people will not be happy about this, I can
already tell you now.
>>
>> Sounds like a reasonable improvement to me however.
>>
>>> On 23 Jun 2023, at 07:22, Berenguer Blasi
<berenguerbl...@gmail.com> wrote:
>>>
>>> Hi all,
>>>
>>> DeletionTime.markedForDeleteAt is a long useconds since Unix
Epoch. But I noticed that with 7 bytes we can already encode ~2284
years. We can either shed the 8th byte, for reduced IO and disk, or
can encode some sentinel values (such as LIVE) as flags there. That
would mean reading and writing 1 byte instead of 12 (8 mfda long +
4 ldts int). Yes we already avoid serializing DeletionTime (DT) in
sstables at _row_ level entirely but not at _partition_ level and
it is also serialized at index, metadata, etc.
>>>
>>> So here's a POC:
https://github.com/bereng/cassandra/commits/ldtdeser-trunk and some
jmh (1) to evaluate the impact of the new alg (2). It's tested here
against a 70% and a 30% LIVE DTs to see how we perform:
>>>
>>> [java] Benchmark (liveDTPcParam) (sstableParam) Mode
Cnt Score Error Units
>>> [java] DeletionTimeDeSerBench.testRawAlgReads
70PcLive NC avgt 15 0.331 ± 0.001 ns/op
>>> [java] DeletionTimeDeSerBench.testRawAlgReads
70PcLive OA avgt 15 0.335 ± 0.004 ns/op
>>> [java] DeletionTimeDeSerBench.testRawAlgReads
30PcLive NC avgt 15 0.334 ± 0.002 ns/op
>>> [java] DeletionTimeDeSerBench.testRawAlgReads
30PcLive OA avgt 15 0.340 ± 0.008 ns/op
>>> [java] DeletionTimeDeSerBench.testNewAlgWrites
70PcLive NC avgt 15 0.337 ± 0.006 ns/op
>>> [java] DeletionTimeDeSerBench.testNewAlgWrites
70PcLive OA avgt 15 0.340 ± 0.004 ns/op
>>> [java] DeletionTimeDeSerBench.testNewAlgWrites
30PcLive NC avgt 15 0.339 ± 0.004 ns/op
>>> [java] DeletionTimeDeSerBench.testNewAlgWrites
30PcLive OA avgt 15 0.343 ± 0.016 ns/op
>>>
>>> That was ByteBuffer backed to test the extra bit level
operations impact. But what would be the impact of an end to end
test against disk?
>>>
>>> [java] Benchmark (diskRAMParam) (liveDTPcParam)
(sstableParam) Mode Cnt Score Error Units
>>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM
70PcLive NC avgt 15 605236.515 ± 19929.058 ns/op
>>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM
70PcLive OA avgt 15 586477.039 ± 7384.632 ns/op
>>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM
30PcLive NC avgt 15 937580.311 ± 30669.647 ns/op
>>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM
30PcLive OA avgt 15 914097.770 ± 9865.070 ns/op
>>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT
Disk 70PcLive NC avgt 15 1314417.207 ±
37879.012 ns/op
>>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT
Disk 70PcLive OA avgt 15 805256.345 ±
15471.587 ns/op
>>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT
Disk 30PcLive NC avgt 15 1583239.011 ±
50104.245 ns/op
>>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT
Disk 30PcLive OA avgt 15 1439605.006 ±
64342.510 ns/op
>>> [java] DeletionTimeDeSerBench.testE2ESerializeDT
RAM 70PcLive NC avgt 15 295711.217 ±
5432.507 ns/op
>>> [java] DeletionTimeDeSerBench.testE2ESerializeDT
RAM 70PcLive OA avgt 15 305282.827 ±
1906.841 ns/op
>>> [java] DeletionTimeDeSerBench.testE2ESerializeDT
RAM 30PcLive NC avgt 15 446029.899 ±
4038.938 ns/op
>>> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM
30PcLive OA avgt 15 479085.875 ± 10032.804 ns/op
>>> [java] DeletionTimeDeSerBench.testE2ESerializeDT
Disk 70PcLive NC avgt 15 1789434.838 ±
206455.771 ns/op
>>> [java] DeletionTimeDeSerBench.testE2ESerializeDT
Disk 70PcLive OA avgt 15 589752.861 ±
31676.265 ns/op
>>> [java] DeletionTimeDeSerBench.testE2ESerializeDT
Disk 30PcLive NC avgt 15 1754862.122 ±
164903.051 ns/op
>>> [java] DeletionTimeDeSerBench.testE2ESerializeDT
Disk 30PcLive OA avgt 15 1252162.253 ±
121626.818 ns/o
>>>
>>> We can see big improvements when backed with the disk and
little impact from the new alg.
>>>
>>> Given we're already introducing a new sstable format (OA) in
5.0 I would like to try to get this in before the freeze. The point
being that sstables with lots of small partitions would benefit
from a smaller DT at partition level. My tests show a 3%-4% size
reduction on disk.
>>>
>>> Before proceeding though I'd like to bounce the idea against
the community for all the corner cases and scenarios I might have
missed where this could be a problem?
>>>
>>> Thx in advance!
>>>
>>>
>>> (1)
https://github.com/bereng/cassandra/blob/ldtdeser-trunk/test/microbench/org/apache/cassandra/test/microbench/DeletionTimeDeSerBench.java
>>>
>>> (2)
https://github.com/bereng/cassandra/blob/ldtdeser-trunk/src/java/org/apache/cassandra/db/DeletionTime.java#L212
>>>