Adam Horky created KAFKA-20732:
----------------------------------

             Summary: Data loss: tiered size-retention deletes in-retention 
data from both tiers after leader election when highestOffsetInRemoteStorage is 
unset (-1)
                 Key: KAFKA-20732
                 URL: https://issues.apache.org/jira/browse/KAFKA-20732
             Project: Kafka
          Issue Type: Bug
          Components: Tiered-Storage
    Affects Versions: 3.9.0, 4.3.0
            Reporter: Adam Horky


Size-based remote retention deletes the oldest data when 
{{onlyLocalLogSegmentsSize + remoteLogSizeBytes > retention.bytes}}. 
{{onlyLocalLogSegmentsSize}} sums segments with {{baseOffset >= 
highestOffsetInRemoteStorage()}}. That field is in-memory, initialised to 
{{-1}}, not recovered on startup, and seeded only by the copy and follower 
tasks.

After a leader election the copy and expiration tasks both start at {{-1}} and 
are scheduled with {{initialDelay = 0}} on separate thread pools, so they race. 
If the expiration task runs first, the filter {{baseOffset >= -1}} matches the 
entire local log, and those segments are also counted in {{remoteLogSizeBytes}} 
— the same offsets counted twice. The apparent size roughly doubles, 
{{retention.bytes}} appears breached, and the broker deletes the oldest remote 
segments and advances {{logStartOffset}}. {{logStartOffset}} replicates to 
followers, so each replica deletes local segments below the new floor. 
In-retention data is removed from both tiers on all replicas; consumers receive 
{{OFFSET_OUT_OF_RANGE}} for the removed offsets.

The follower path already seeds this offset before it can cause harm: 
{{RLMFollowerTask.execute()}} does so with the comment _"so that the local log 
segments are not deleted before they are copied to remote storage."_ The 
leader/expiration path, which runs size-retention, has no equivalent step.

h2. Example — one affected partition

A 24-partition test topic at its cap ({{retention.bytes = 134217728}} = 128 
MiB, {{local.retention.bytes = -2}}, {{retention.ms = -1}}); both tiers held 
the full window (remote ≈ local ≈ cap, copy-lag 0). One broker is restarted and 
becomes the fresh leader of partition {{...-1}}; ~40 s later its expiration 
task runs while {{highestOffsetInRemoteStorage}} is still {{-1}}:

{code}
# leader shared-sasl-kafka-1 — size-retention on a partition whose unique data 
is ~128 MiB:
08:43:59.420  RLMExpirationTask ...repro.v1-1] About to delete remote log 
segment ... due to
              retention size 134217728 breach. Log size after deletion will be 
282854919
              #   cap = 134217728 (128 MiB); computed ~270 MiB = local(~128) + 
remote(~128), counted twice
              #   ... deletes 9 remote segments, recomputing downward until 
just above the cap:
08:43:59.430  ... due to retention size 134217728 breach. Log size after 
deletion will be 149820412
08:43:59.430  UnifiedLog ...repro.v1-1] Incremented log start offset to 289616 
due to segment deletion

# the new log-start floor replicates to the followers:
08:43:59.680  shared-sasl-kafka-0 ...repro.v1-1] Incremented log start offset 
to 289616 due to leader offset increment
08:44:36.319  shared-sasl-kafka-2 ...repro.v1-1] Incremented log start offset 
to 289616 due to leader offset increment

# each replica then deletes its local segments below the floor — data now gone 
from BOTH tiers:
08:44:21.146  UnifiedLog ...repro.v1-1] Deleting segments due to local log 
start offset 289616 breach:
              LogSegment(baseOffset=144975,...), ... ,(baseOffset=273618,...)   
# 9 segments, then LocalLog "Deleting segment files"
{code}

{{logStartOffset}} jumped 144975 → 289616: ~144,641 records (~141 MiB) removed 
from both tiers, even though the true unique size (~128 MiB) was within the cap.

h2. Reproductions

* *Staging (3.9.0, KRaft, RF=3, Aiven RSM):* a rolling restart over-deleted 6 
of 12 partitions of a topic that sat at its {{retention.bytes}}; on each, the 
new {{logStartOffset}} equalled {{highestCopiedRemoteOffset + 1}} (the whole 
copied window expired).
* *Test cluster:* the topic above, rolling-restarted, over-deleted 13 of 24 
partitions (example shown). A control topic with {{retention.bytes = -1}} 
(size-retention disabled) under the identical restart was unaffected.
* *Unit test (trunk and 3.9.0):* two tests with identical remote metadata and 
{{retention.bytes}}, differing only in {{onlyLocalLogSegmentsSize}} ({{0}} when 
seeded → no deletion; whole-log when {{-1}} → both segments deleted). A 
companion {{UnifiedLogTest}} confirms the real {{UnifiedLog}} returns the whole 
local log at {{-1}}.

h2. Fix

Seeding {{highestOffsetInRemoteStorage}} from {{__remote_log_metadata}} before 
size-retention (mirroring the follower path) makes the unit test pass and 
removes the over-deletion on the test cluster (same fill + restart, zero 
{{logStartOffset}} jumps).

* failing test (repro, no fix), diff vs trunk: 
[https://github.com/apache/kafka/compare/trunk...horkyada:kafka:tiered-size-retention-unseeded-highest-offset-repro]
* proposed fix, diff vs trunk: 
[https://github.com/apache/kafka/compare/trunk...horkyada:kafka:fix-seed-highest-remote-offset]

(Happy to open a PR against apache/kafka once I have an ICLA.)

h2. Scope

Size-retention only — {{retention.ms}} is unaffected (time-retention never 
reads {{onlyLocalLogSegmentsSize}}/{{highestOffsetInRemoteStorage}}). The loss 
requires a populated remote tier on a topic at/near its cap (with {{R ≈ 0}} 
there is nothing to double-count). Magnitude scales with the local-tier size; 
with the default {{local.retention.bytes = -2}} it is the whole window.

h2. Related issues (same area, different root cause — not a duplicate)

* *KAFKA-17212* — same method, but a {{>=}}-vs-{{>}} off-by-one double-counting 
a _single_ segment while the offset is correct; {{>=}}→{{>}} does not fix this 
(with {{-1}}, {{baseOffset > -1}} still matches everything). This is a 
whole-log over-count.
* *KAFKA-16711* — same {{-1}} stale-offset class, different trigger (logDir 
altering), and it _blocks_ cleanup rather than over-deleting.
* *KAFKA-20148* — same {{RLMExpirationTask}} data-loss class, but the trigger 
is _disabling_ remote storage (cancel-vs-execute). This needs no disable.

(Happy to provide the source line-by-line walk-through and the full 
reproduction setup if useful.)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to