zclllyybb commented on issue #64700:
URL: https://github.com/apache/doris/issues/64700#issuecomment-4767954566

   Breakwater-GitHub-Analysis-Slot: slot_6612ad2e0ca8
   This content is generated by AI for reference only.
   
   Initial triage:
   
   I cannot confirm a specific Doris 4.1.2 root cause from the pasted 
application logs alone.
   
   The before/after samples are not the same workload shape. In the "before 
upgrade" section, the shown listener batches are roughly 200-658 source rows 
and each serialized application cycle is mostly about 2.7-4.7s. In the "after 
upgrade" section, the shown listener batches are about 895-1095 source rows, 
and some derived Stream Load calls reach about 1.9k-2.7k rows. Since the 
application appears to send many Stream Load requests sequentially on one Kafka 
listener thread, part of the end-to-end increase can come from larger batches 
plus serialized per-table loads.
   
   That said, the evidence is still suspicious for a shared Doris-side 
bottleneck or environment bottleneck, not just row count. After upgrade, even 
very small loads such as 1-29 rows often take about 500-1300ms, and 
`device_inv_realtime` reaches 2.2s, 3.6s, and 6.2s in the pasted samples. That 
pattern would be consistent with transaction publish delay, backend 
write/resource pressure, compaction/tablet-version pressure, 
network/request-body delay, or client-side queuing, but the current logs do not 
identify which one.
   
   Code-side anchor for maintainers: in Doris 4.1.2, the Stream Load BE 
response and BE INFO log split the latency into `LoadTimeMs`, `BeginTxnTimeMs`, 
`StreamLoadPutTimeMs`, `ReceiveDataTimeMs`, `ReadDataTimeMs`, 
`WriteDataTimeMs`, and `CommitAndPublishTimeMs` 
(`be/src/load/stream_load/stream_load_context.cpp`, plus the `finished to 
execute stream load` log in `be/src/service/http/action/stream_load.cpp`). 
Those fields are the key evidence needed here:
   
   - If `CommitAndPublishTimeMs` is high, focus on FE transaction publish, 
tablet version count, compaction backlog, or master FE pressure.
   - If `WriteDataTimeMs` is high, focus on BE ingestion/storage/write path, 
schema/index cost, payload size, and resource pressure.
   - If `ReceiveDataTimeMs` or the gap between client time and `LoadTimeMs` is 
high, focus on client serialization, HTTP request upload, redirect/network 
path, or the single Kafka listener thread.
   - If `LoadTimeMs` is low while the application "Doris Stream Load" duration 
is high, the bottleneck is likely outside the Doris Stream Load execution 
itself.
   
   Information needed to make this actionable:
   
   1. The exact previous Doris version and whether any table schema, replica 
count, bucket count, BE/FE count, hardware, network path, or Stream Load client 
settings changed during the upgrade.
   2. Full Stream Load JSON responses before and after upgrade for the same 
table and comparable payload size, including the timing fields above and the 
request label.
   3. Matching BE logs around `finished to execute stream load. label=...` and 
FE master logs around begin/commit/publish for several slow labels.
   4. DDL for the affected tables, especially whether they are Unique Key/MOW 
tables, whether partial update or sequence columns are used, indexes, 
partitions, buckets, and replication settings.
   5. Stream Load request headers, especially `group_commit`, 
`two_phase_commit`, `merge_type`, `columns`, `partial_columns`, and format 
settings.
   6. Cluster state during the slow window: `SHOW BACKENDS`, CPU/IO/network 
utilization, compaction backlog, tablet version count or tablet health output, 
and whether other loads/queries were running.
   7. A minimal repro or a controlled comparison that sends the same payload to 
the same table before and after upgrade, preferably with the full server-side 
Stream Load timing fields.
   
   Next suggested maintainer step: ask for the full Stream Load JSON responses 
and the matching BE/FE logs first. Without those timing fields, this issue 
should remain open as a possible write-latency regression, but the current 
public evidence is insufficient to assign it to a concrete Doris 4.1.2 code bug.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to