Hive TableSample with number of rows.

Sandeep Khurana Thu, 24 Mar 2016 03:08:26 -0700

Hello

Hive provides a table sample approach for number of rows. The documentation
is at
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling#LanguageManualSampling-BlockSampling


It states

"For example, the following query will take the first 10 rows from each
input split.
SELECT * FROM source TABLESAMPLE(10 ROWS);
"

But when I look at the code, FetchOperator.java at
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java

I see below method, check the bold and underlined lines. It looks like it
is exiting the sampling as and when the number of recs (size) is obtained
from the splits i.e. if first input split gives the needed data then it
wont go over rest of the splits and recs from 1st split itself will be
returned. But this is in contradiction to what the documentation states.

When I run query for tablesapme with number of rows I also get the rows
from same split. I validated this by selecting "INPUT__FILE__NAME" as well
(my data on hdfs has thousands of files) .

Am I missing something or is it a bug?

private FetchInputFormatSplit[] splitSampling(SplitSample splitSample,
      FetchInputFormatSplit[] splits) {
    long totalSize = 0;
    for (FetchInputFormatSplit split: splits) {
        totalSize += split.getLength();
    }
    List<FetchInputFormatSplit> result = new
ArrayList<FetchInputFormatSplit>(splits.length);
   * long targetSize = splitSample.getTargetSize(totalSize);*
    int startIndex = splitSample.getSeedNum() % splits.length;
    long size = 0;
    for (int i = 0; i < splits.length; i++) {
      FetchInputFormatSplit split = splits[(startIndex + i) %
splits.length];
      result.add(split);
      long splitgLength = split.getLength();
      if (size + splitgLength >= targetSize) {
   *     if (size + splitgLength > targetSize) {*
*          split.shrinkedLength = targetSize - size;*
*        }*
*        break;*
*      }*
      size += splitgLength;
    }
    return result.toArray(new FetchInputFormatSplit[result.size()]);
  }

HIve bug for this is , https://issues.apache.org/jira/browse/HIVE-3401 .


-- 
Thanks and regards
Sandeep Khurana

Hive TableSample with number of rows.

Reply via email to