[
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16280518#comment-16280518
]
Xiang Li commented on HBASE-15482:
----------------------------------
[~davelatham], thanks very much for the comment and guide.
Uploaded the very first patch 000:
* The conf key is "hbase.TableSnapshotInputFormat.locality" with default to
true, that is, always care the locality and calculate the block locations,
unless it is set to false explicitly. When it is set to false, the logic
containing getBestLocations() is skipped and new
TableSnapshotInputFormatImpl.InputSplit with locations as null.
* The access modifier of both the conf key and the default value is set to
public, so that they could be accessed by test classes of other packages.
* The UTs are embedded into existing test cases.
** Test case in the package of mapred covers the scenario that the conf key is
not specifiy and default value of true is taken.
** Test cases in the package of mapreduce cover the scenarios that the conf key
is explicitly set to true or false.
Hi [~davelatham], [~liyin], [~ted_yu], could you please help to review the
patch at your most convenience?
> Provide an option to skip calculating block locations for SnapshotInputFormat
> -----------------------------------------------------------------------------
>
> Key: HBASE-15482
> URL: https://issues.apache.org/jira/browse/HBASE-15482
> Project: HBase
> Issue Type: Improvement
> Components: mapreduce
> Reporter: Liyin Tang
> Assignee: Xiang Li
> Priority: Minor
>
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the
> splits based on the block locations in order to get best locality. However,
> this process may take a long time for large snapshots.
> In some setup, the computing layer, Spark, Hive or Presto could run out side
> of HBase cluster. In these scenarios, the block locality doesn't matter.
> Therefore, it will be great to have an option to skip calculating the block
> locations for every job. That will super useful for the Hive/Presto/Spark
> connectors.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)