ShummGen opened a new issue, #63609:
URL: https://github.com/apache/doris/issues/63609

   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no 
similar issues.
   
   
   ### Version
   
   Apache Doris 4.1.0
   
   ### What's Wrong?
   
   We encountered repeated BE crashes (`SIGSEGV`) when Doris reads historically 
corrupted segment files.
   
   The crash is reproducible at the symptom level: once the affected tablets 
are read by query execution or cumulative compaction, BE may crash instead of 
returning a safe corruption error.
   
   The latest crash happened around `2026-05-25 09:14`, and this is already the 
5th crash of the same kind.
   
   Crash summary:
   - Component: `doris_be`
   - Signal: `SIGSEGV (11)`
   - Latest crash time: `2026-05-25 09:14`
   - Repeated crash count: 5
   
   The stack trace shows the crash happens in `memcpy`, called from string 
serialization inside vectorized aggregation:
   
   ```text
   *** Aborted at 1748134471 (unix time) try "date -d @1748134471" ***
   *** Signal 11 (SIGSEGV) received by PID 86955 ***
   PC: @     0x7fdd1f000000  (unknown)
   *** SIGSEGV address not mapped to object (@0x7fdd1f000000) received by PID 
86955 ***
   
   Stack trace:
   #0  __memmove_avx_unaligned_erms () at 
../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:341
   #1  0x0000555559d2d3a4 in memcpy ()
   #2  0x0000555559d2d3a4 in doris::vectorized::ColumnStr<unsigned 
int>::serialize_impl(...)
   #3  0x0000555559d2d3a4 in doris::vectorized::ColumnStr<unsigned 
int>::serialize_vec(...)
   #4  0x0000555559d2d3a4 in doris::vectorized::DistinctStreamingAgg(...)
   At the same time, BE logs contain a large number of corruption-related 
errors:
   
   checksum mismatch
   ZSTD decompression failed
   cumulative compaction failures on the same tablets
   Typical log examples:
   
   E20250525 09:14:32.123456 86955 tablet.cpp:1234] checksum mismatch in 
/home/doris_test_local/be-storage/data/504/1775727969562/29255074/02000000000ac7103d41342af722a463b0a66dca056a21a2_2.dat,
 actual=3512177969 vs expect=2503684114
   E20250525 09:14:32.123789 86955 beta_rowset_reader.cpp:567] failed to read 
segment: Corruption: ZSTD decompression failed
   W20250525 09:14:32.124012 86955 cumulative_compaction.cpp:890] failed to do 
cumulative compaction. tablet=1775727969562
   Our current understanding is:
   
   Doris correctly detects that some historical segment files are corrupted.
   But later, while processing those corrupted data paths, Doris still enters a 
code path that reaches DistinctStreamingAgg -> ColumnStr::serialize_vec -> 
memcpy.
   That path eventually dereferences an invalid address and crashes BE with 
SIGSEGV.
   This looks like a bug in error handling / corrupted data protection, because 
BE should not segfault even if a segment file is bad.
   
   ### What You Expected?
   
   We expect Doris BE to handle corrupted segment files gracefully without 
crashing.
   
   Expected behavior:
   
   return a safe read/corruption error
   fail the related query or compaction task gracefully
   optionally mark the affected tablet/rowset as bad
   avoid process crash (SIGSEGV) in DistinctStreamingAgg / 
ColumnStr::serialize_vec
   
   ### How to Reproduce?
   
   We do not yet have a minimal synthetic reproducer, but the production 
symptoms are consistent and repeatable.
   
   Observed reproduction conditions:
   
   Some tablets contain corrupted segment files.
   When Doris reads those tablets during query execution or cumulative 
compaction, logs report:
   checksum mismatch
   Corruption: ZSTD decompression failed
   After that, BE may crash with SIGSEGV.
   
   ### Anything Else?
   
   Environment:
   
   Doris version: 4.1.0
   OS: Linux (CentOS/RHEL)
   mem_limit = 28.01 GB
   soft_mem_limit = 25.21 GB
   Corruption-related observations:
   
   current active log contains about:
   23,869 checksum mismatch errors
   31,794 ZSTD decompression error errors
   cumulative compaction repeatedly fails on corrupted tablets
   we have 5 core dump files from repeated BE crashes
   Corruption pattern:
   
   corrupted tablets are concentrated in a narrow historical creation window 
(May 13-14)
   another node also has corrupted tablets from the same period
   after that period, we did not observe evidence of newly created corrupted 
tablets
   this pattern suggests a historical cluster-wide event, while the current 
issue we want to report is that Doris crashes when reading those corrupted files
   Hardware / OS observations:
   
   /proc/diskstats shows no I/O errors on the NVMe devices
   no filesystem error was found in available journal logs after May 15
   disks continued running normally for more than 12 days after the corruption 
window
   We understand that the original corruption may or may not have been caused 
by Doris itself. However, regardless of the root cause of the corrupted files, 
Doris BE should not crash with SIGSEGV while reading them.
   
   If maintainers think this should be fixed, we can provide more materials:
   
   full be.out stack trace
   more log snippets
   core dump backtrace
   tablet / rowset metadata
   additional checksum mismatch samples
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to