rinchinov opened a new pull request, #19592:
URL: https://github.com/apache/druid/pull/19592

   Fixes #18606.
   
   ### Description
   
   #### Bug
   
   `DeltaInputSourceIterator.hasNext()` used a **local variable** for the 
per-file
   `CloseableIterator<FilteredColumnarBatch>`. When the method returned `true` 
after
   finding the first non-empty batch of a file, that iterator went out of scope.
   The next `hasNext()` call advanced to the **next file**, skipping all 
remaining
   batches of the current file.
   
   With the Delta kernel's default batch size of **1024 rows**, this produced 
exactly
   `1024 × numFiles` rows regardless of actual file size — matching the symptom
   reported in #18606.
   
   #### Fix
   
   Promoted `filteredBatchIterator` to a class field (`currentFileIterator`).
   `hasNext()` now drains all batches of the current file before advancing to 
the
   next one. Also fixed `close()` to close `currentFileIterator` and drain all
   remaining file iterators (the original only closed one).
   
   #### Regression test
   
   Added `LargeRowGroupDeltaTable` (2 Parquet files × 2000 rows = 4000 total) 
and
   `BatchDrainRegressionTests` inside `DeltaInputSourceTest`.
   
   - **Without the fix**: `1024 × 2 = 2048` rows returned
   - **With the fix**: `4000` rows returned
   
   #### Release note
   
   Fixed a bug in the Delta Lake input source where only 1024 rows per Parquet 
file
   were ingested. Ingestion tasks now return all rows from each file.
   
   ---
   
   ##### Key changed/added classes in this PR
   - `DeltaInputSourceReader` (fix)
   - `LargeRowGroupDeltaTable` (new test descriptor)
   - `DeltaInputSourceTest` (new `BatchDrainRegressionTests` inner class)
   - `src/test/resources/large-row-group-table` (new test Delta table)
   
   ---
   
   This PR has:
   
   - [x] been self-reviewed.
   - [x] a release note entry in the PR description.
   - [x] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [x] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to