Hello Hive provides a table sample approach for number of rows. The documentation is at https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling#LanguageManualSampling-BlockSampling
It states "For example, the following query will take the first 10 rows from each input split. SELECT * FROM source TABLESAMPLE(10 ROWS); " But when I look at the code, FetchOperator.java at https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java I see below method, check the bold and underlined lines. It looks like it is exiting the sampling as and when the number of recs (size) is obtained from the splits i.e. if first input split gives the needed data then it wont go over rest of the splits and recs from 1st split itself will be returned. But this is in contradiction to what the documentation states. When I run query for tablesapme with number of rows I also get the rows from same split. I validated this by selecting "INPUT__FILE__NAME" as well (my data on hdfs has thousands of files) . Am I missing something or is it a bug? private FetchInputFormatSplit[] splitSampling(SplitSample splitSample, FetchInputFormatSplit[] splits) { long totalSize = 0; for (FetchInputFormatSplit split: splits) { totalSize += split.getLength(); } List<FetchInputFormatSplit> result = new ArrayList<FetchInputFormatSplit>(splits.length); * long targetSize = splitSample.getTargetSize(totalSize);* int startIndex = splitSample.getSeedNum() % splits.length; long size = 0; for (int i = 0; i < splits.length; i++) { FetchInputFormatSplit split = splits[(startIndex + i) % splits.length]; result.add(split); long splitgLength = split.getLength(); if (size + splitgLength >= targetSize) { * if (size + splitgLength > targetSize) {* * split.shrinkedLength = targetSize - size;* * }* * break;* * }* size += splitgLength; } return result.toArray(new FetchInputFormatSplit[result.size()]); } HIve bug for this is , https://issues.apache.org/jira/browse/HIVE-3401 . -- Thanks and regards Sandeep Khurana