ShummGen opened a new issue, #63609: URL: https://github.com/apache/doris/issues/63609
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no similar issues. ### Version Apache Doris 4.1.0 ### What's Wrong? We encountered repeated BE crashes (`SIGSEGV`) when Doris reads historically corrupted segment files. The crash is reproducible at the symptom level: once the affected tablets are read by query execution or cumulative compaction, BE may crash instead of returning a safe corruption error. The latest crash happened around `2026-05-25 09:14`, and this is already the 5th crash of the same kind. Crash summary: - Component: `doris_be` - Signal: `SIGSEGV (11)` - Latest crash time: `2026-05-25 09:14` - Repeated crash count: 5 The stack trace shows the crash happens in `memcpy`, called from string serialization inside vectorized aggregation: ```text *** Aborted at 1748134471 (unix time) try "date -d @1748134471" *** *** Signal 11 (SIGSEGV) received by PID 86955 *** PC: @ 0x7fdd1f000000 (unknown) *** SIGSEGV address not mapped to object (@0x7fdd1f000000) received by PID 86955 *** Stack trace: #0 __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:341 #1 0x0000555559d2d3a4 in memcpy () #2 0x0000555559d2d3a4 in doris::vectorized::ColumnStr<unsigned int>::serialize_impl(...) #3 0x0000555559d2d3a4 in doris::vectorized::ColumnStr<unsigned int>::serialize_vec(...) #4 0x0000555559d2d3a4 in doris::vectorized::DistinctStreamingAgg(...) At the same time, BE logs contain a large number of corruption-related errors: checksum mismatch ZSTD decompression failed cumulative compaction failures on the same tablets Typical log examples: E20250525 09:14:32.123456 86955 tablet.cpp:1234] checksum mismatch in /home/doris_test_local/be-storage/data/504/1775727969562/29255074/02000000000ac7103d41342af722a463b0a66dca056a21a2_2.dat, actual=3512177969 vs expect=2503684114 E20250525 09:14:32.123789 86955 beta_rowset_reader.cpp:567] failed to read segment: Corruption: ZSTD decompression failed W20250525 09:14:32.124012 86955 cumulative_compaction.cpp:890] failed to do cumulative compaction. tablet=1775727969562 Our current understanding is: Doris correctly detects that some historical segment files are corrupted. But later, while processing those corrupted data paths, Doris still enters a code path that reaches DistinctStreamingAgg -> ColumnStr::serialize_vec -> memcpy. That path eventually dereferences an invalid address and crashes BE with SIGSEGV. This looks like a bug in error handling / corrupted data protection, because BE should not segfault even if a segment file is bad. ### What You Expected? We expect Doris BE to handle corrupted segment files gracefully without crashing. Expected behavior: return a safe read/corruption error fail the related query or compaction task gracefully optionally mark the affected tablet/rowset as bad avoid process crash (SIGSEGV) in DistinctStreamingAgg / ColumnStr::serialize_vec ### How to Reproduce? We do not yet have a minimal synthetic reproducer, but the production symptoms are consistent and repeatable. Observed reproduction conditions: Some tablets contain corrupted segment files. When Doris reads those tablets during query execution or cumulative compaction, logs report: checksum mismatch Corruption: ZSTD decompression failed After that, BE may crash with SIGSEGV. ### Anything Else? Environment: Doris version: 4.1.0 OS: Linux (CentOS/RHEL) mem_limit = 28.01 GB soft_mem_limit = 25.21 GB Corruption-related observations: current active log contains about: 23,869 checksum mismatch errors 31,794 ZSTD decompression error errors cumulative compaction repeatedly fails on corrupted tablets we have 5 core dump files from repeated BE crashes Corruption pattern: corrupted tablets are concentrated in a narrow historical creation window (May 13-14) another node also has corrupted tablets from the same period after that period, we did not observe evidence of newly created corrupted tablets this pattern suggests a historical cluster-wide event, while the current issue we want to report is that Doris crashes when reading those corrupted files Hardware / OS observations: /proc/diskstats shows no I/O errors on the NVMe devices no filesystem error was found in available journal logs after May 15 disks continued running normally for more than 12 days after the corruption window We understand that the original corruption may or may not have been caused by Doris itself. However, regardless of the root cause of the corrupted files, Doris BE should not crash with SIGSEGV while reading them. If maintainers think this should be fixed, we can provide more materials: full be.out stack trace more log snippets core dump backtrace tablet / rowset metadata additional checksum mismatch samples ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
