rinchinov opened a new pull request, #19592: URL: https://github.com/apache/druid/pull/19592
Fixes #18606. ### Description #### Bug `DeltaInputSourceIterator.hasNext()` used a **local variable** for the per-file `CloseableIterator<FilteredColumnarBatch>`. When the method returned `true` after finding the first non-empty batch of a file, that iterator went out of scope. The next `hasNext()` call advanced to the **next file**, skipping all remaining batches of the current file. With the Delta kernel's default batch size of **1024 rows**, this produced exactly `1024 × numFiles` rows regardless of actual file size — matching the symptom reported in #18606. #### Fix Promoted `filteredBatchIterator` to a class field (`currentFileIterator`). `hasNext()` now drains all batches of the current file before advancing to the next one. Also fixed `close()` to close `currentFileIterator` and drain all remaining file iterators (the original only closed one). #### Regression test Added `LargeRowGroupDeltaTable` (2 Parquet files × 2000 rows = 4000 total) and `BatchDrainRegressionTests` inside `DeltaInputSourceTest`. - **Without the fix**: `1024 × 2 = 2048` rows returned - **With the fix**: `4000` rows returned #### Release note Fixed a bug in the Delta Lake input source where only 1024 rows per Parquet file were ingested. Ingestion tasks now return all rows from each file. --- ##### Key changed/added classes in this PR - `DeltaInputSourceReader` (fix) - `LargeRowGroupDeltaTable` (new test descriptor) - `DeltaInputSourceTest` (new `BatchDrainRegressionTests` inner class) - `src/test/resources/large-row-group-table` (new test Delta table) --- This PR has: - [x] been self-reviewed. - [x] a release note entry in the PR description. - [x] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader. - [x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
