[
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16283035#comment-16283035
]
Xiang Li commented on HBASE-15482:
----------------------------------
[~tedyu], thanks very much for your comments!
patch 001 is updated to address your comments as well as the errors reported by
checkstyle.
* "hbase.TableSnapshotInputFormat.locality" is changed into
"hbase.TableSnapshotInputFormat.locality.enable".
* The truncation of locations is moved into getBestLocations().
* The errors reported by checkstyle are corrected.
Regarding {{moving the truncation of locations into getBestLocations()}}:
The code has different logic for different combinations of
hostAndWeights.length and numTopsAtMost.
And there is a small behavior change on getBestLocations() when
hostAndWeights.length is 0:
* Originally, it returns a empty list.
* After the change, it returns null. I think we do not need to allocate an
empty list here, as the locations will be used to construct
TableSnapshotInputFormatImpl.InputSplit and null will be checked as follow
{code:title=hbase/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableSnapshotInputFormatImpl.java|borderStyle=solid}
public InputSplit(TableDescriptor htd, HRegionInfo regionInfo, List<String>
locations,
Scan scan, Path restoreDir) {
this.htd = htd;
this.regionInfo = regionInfo;
if (locations == null || locations.isEmpty()) { // <--- here
this.locations = new String[0];
} else {
this.locations = locations.toArray(new String[locations.size()]);
}
try {
this.scan = scan != null ? TableMapReduceUtil.convertScanToString(scan)
: "";
} catch (IOException e) {
LOG.warn("Failed to convert Scan to String", e);
}
this.restoreDir = restoreDir.toString();
}
{code}
And TableSnapshotInputFormatImpl is @InterfaceAudience.Private, there is no
other calls of getBestLocations() in the whole HBase project except UTs. A UT
is updated according to the change above.
> Provide an option to skip calculating block locations for SnapshotInputFormat
> -----------------------------------------------------------------------------
>
> Key: HBASE-15482
> URL: https://issues.apache.org/jira/browse/HBASE-15482
> Project: HBase
> Issue Type: Improvement
> Components: mapreduce
> Reporter: Liyin Tang
> Assignee: Xiang Li
> Priority: Minor
> Fix For: 2.1.0
>
> Attachments: HBASE-15482.master.000.patch
>
>
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the
> splits based on the block locations in order to get best locality. However,
> this process may take a long time for large snapshots.
> In some setup, the computing layer, Spark, Hive or Presto could run out side
> of HBase cluster. In these scenarios, the block locality doesn't matter.
> Therefore, it will be great to have an option to skip calculating the block
> locations for every job. That will super useful for the Hive/Presto/Spark
> connectors.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)