[ 
https://issues.apache.org/jira/browse/IMPALA-13548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18021946#comment-18021946
 ] 

ASF subversion and git services commented on IMPALA-13548:
----------------------------------------------------------

Commit e05d92cb3d0aa46c7eed8e30a8e580b01254ea34 in impala's branch 
refs/heads/master from Joe McDonnell
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=e05d92cb3 ]

IMPALA-13548: Schedule scan ranges oldest to newest for tuple caching

Scheduling does not sort scan ranges by modification time. When a new
file is added to a table, its order in the list of scan ranges is
not based on modification time. Instead, it is based on which partition
it belongs to and what its filename is. A new file that is added early
in the list of scan ranges can cause cascading differences in scheduling.
For tuple caching, this means that multiple runtime cache keys could
change due to adding a single file.

To minimize that disruption, this adds the ability to sort the scan
ranges by modification time and schedule scan ranges oldest to newest.
This enables it for scan nodes that feed into tuple cache nodes
(similar to deterministic scan range assignment).

Testing:
 - Modified TestTupleCacheFullCluster::test_scan_range_distributed
   to have stricter checks about how many cache keys change after
   an insert (only one should change)
 - Modified TupleCacheTest#testDeterministicScheduling to verify that
   oldest to newest scheduling is also enabled.

Change-Id: Ia4108c7a00c6acf8bbfc036b2b76e7c02ae44d47
Reviewed-on: http://gerrit.cloudera.org:8080/23228
Reviewed-by: Michael Smith <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Add a mode to schedule scan ranges in order of modification time
> ----------------------------------------------------------------
>
>                 Key: IMPALA-13548
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13548
>             Project: IMPALA
>          Issue Type: Task
>          Components: Backend
>    Affects Versions: Impala 4.5.0
>            Reporter: Joe McDonnell
>            Assignee: Joe McDonnell
>            Priority: Major
>
> When a file gets added to a table, the scheduler can have some instability in 
> how it assigns scan ranges. The scheduler is walking through the scan ranges 
> and handing them out in a single pass. If the new scan range is at the end of 
> the list, then there is minimal disruption. Every assignment would be the 
> same except the node that got the new scan range. However, if the new scan 
> range is early in the list, it's assignment can change subsequent assignments 
> of other scan ranges. This can cascade and result in an entirely different 
> assignment.
> This is bad for the tuple cache, because it makes it difficult to get cache 
> hits for a table that is ingesting data.
> If the scan ranges were ordered by modification time (ascending), then new 
> scan ranges for an ingest would be at the end of the list and cause minimal 
> disruption.
> We should add a mode that does this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to