I’m confused why we see *any* increase in sstable size - TTLs and deletion times are already written as unsigned vints as offsets from an sstable epoch for each value.
I would dig in more carefully to explore why you’re seeing this increase? For the same data there should be no change to size on disk. > On 14 Nov 2022, at 06:36, C. Scott Andreas <sc...@paradoxica.net> wrote: > > A 2-3% increase in storage volume is roughly equivalent to giving up the > gain from LZ4 -> LZ4HC, or a one to two-level bump in Zstandard compression > levels. This regression could be very expensive for storage-bound use cases. > > From the perspective of storage overhead, the unsigned int approach sounds > preferable. > >>> On Nov 13, 2022, at 10:13 PM, Berenguer Blasi <berenguerbl...@gmail.com> >>> wrote: >>> >> >> Hi all, >> >> We have done some more research on c14227. The current patch for >> CASSANDRA-14227 solves the TTL limit issue by switching TTL to long instead >> of int. This approach does not have a negative impact on memtable memory >> usage, as C* controles the memory used by the Memtable, but based on our >> testing it increases the bytes flushed by 4 to 7% and the byte on disk by 2 >> to 3%. >> >> As a mitigation to this problem it is possible to encode localDeletionTime >> as a vint. It results in a 1% improvement but might cause additional >> computations during compaction or some other operations. >> >> Benedict's proposal to keep on using ints for TTL but as a delta to >> nowInSecond would work for memtables but not for work in the SSTable where >> nowInSecond does not exist. By consequence we would still suffer from the >> impact on byte flushed and bytes on disk. >> >> Another approach that was suggested is the use of unsigned integer. Java 8 >> has an unsigned integer API that would allow us to use unsigned int for >> TTLs. Based on computation unsigned ints would give us a maximum time of 136 >> years since the Unix Epoch and therefore a maximum expiration timestamp in >> 2106. We would have to keep TTL at 20y instead of 68y to give us enough >> breathing room though, otherwise in 2035 we'd hit the same problem again. >> >> Happy to hear opinions. >> >> On 18/10/22 10:56, Berenguer Blasi wrote: >>> Hi, >>> >>> apologies for the late reply as I have been OOO. I have done some profiling >>> and results look virtually identical on trunk and 14227. I have attached >>> some screenshots to the ticket >>> https://issues.apache.org/jira/browse/CASSANDRA-14227. Unless my eyes are >>> fooling me everything in the jfrs look the same. >>> >>> Regards >>> >>> On 30/9/22 9:44, Berenguer Blasi wrote: >>>> Hi Benedict, >>>> >>>> thanks for the reply! Yes some profiling is probably needed, then we can >>>> see if going down the delta encoding big refactor rabbit hole is worth it? >>>> >>>> Let's see what other concerns people bring up. >>>> >>>> Thx. >>>> >>>> On 29/9/22 11:12, Benedict Elliott Smith wrote: >>>>> My only slight concern with this approach is the additional memory >>>>> pressure. Since 64yrs should be plenty at any moment in time, I wonder if >>>>> it wouldn’t be better to represent these times as deltas from the >>>>> nowInSec being used to process the query. So, long math would only be >>>>> used to normalise the times to this nowInSec (from whatever is stored in >>>>> the sstable) within a method, and ints would be stored in memtables and >>>>> any objects used for processing. >>>>> >>>>> This might admittedly be more work, but I don’t believe it should be too >>>>> challenging - we can introduce a method deletionTime(int nowInSec) that >>>>> returns a long value by adding nowInSec to the deletionTime, and make the >>>>> underlying value private, refactoring call sites? >>>>> >>>>>> On 29 Sep 2022, at 09:37, Berenguer Blasi <berenguerbl...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> Hi all, >>>>>> >>>>>> I have taken a stab in a PR you can find attached in the ticket. Mainly: >>>>>> >>>>>> - I have moved deletion times, gc and nowInSec timestamps to long. That >>>>>> should get us past the 2038 limit. >>>>>> >>>>>> - TTL is maxed now to 68y. Think CQL API compatibility and a sort of a >>>>>> 'free' guardrail. >>>>>> >>>>>> - A new NONE overflow policy is the default but everything is backwards >>>>>> compatible by keeping the previous ones in place. Think upgrade >>>>>> scenarios or apps relying on the previous behavior. >>>>>> >>>>>> - The new limit is around year 292,471,208,677 which sounds ok given the >>>>>> Sun will start collapsing in 3 to 5 billion years :-) >>>>>> >>>>>> - Please feel free to drop by the ticket and take a look at the PR even >>>>>> if it's cursory >>>>>> >>>>>> Thx in advance. >>>>>> >>>>>