On similar lines, I want to have hive inlcude subdirs. That is.. I have an external table paritioned by month (data for each month under a folder). Under the current month I want to keep adding folders daily . Is this possible without having to subclass InputFormat ?
On Aug 19, 2011, at 1:22 PM, Dave wrote: > I solved my own problem. For anyone who's curious: > > It turns out that subclassing an InputFormat allows one to override the > listStatus method, which returns the list of files for Hive (or mapreduce in > general) to process. All I had to do was subclass > org.apache.hadoop.mapred.TextInputFormat and override the listStatus method > and voila; I was able to make it ignore directories. Here's the java code > that I used: > > public class TextFileInputFormatIgnoreSubDir extends TextInputFormat { > @Override > protected FileStatus[] listStatus (JobConf job) throws IOException { > FileStatus[] files = super.listStatus(job); > List<FileStatus> newFiles = new ArrayList<FileStatus>(); > int len = files.length; > for (int i = 0; i < len; ++i) { > FileStatus file = files[i]; > if (!file.isDir()) { > newFiles.add(file); > } > } > > files = new FileStatus[newFiles.size()]; > for (int i = 0; i < newFiles.size(); ++i) { > files[i] = newFiles.get(i); > } > > return files; > } > } > > And the HiveQL code I used to define the table: > > CREATE EXTERNAL TABLE users (id STRING, user_name STRING) > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' > STORED AS INPUTFORMAT > 'com.example.mapreduce.input.TextFileInputFormatIgnoreSubDir' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' > LOCATION '/data/test/users'; > > Hope this saves someone else the trouble of figuring it out... > > -Dave > > On Thu, Aug 18, 2011 at 3:53 PM, Dave <drive...@gmail.com> wrote: > Hi, > > I have a partitioned external table in Hive, and in the partition directories > there are other subdirectories that are not related to the table itself. Hive > seems to want to scan those directories, as I am getting an error message > when trying to do a SELECT on the table: > > Failed with exception java.io.IOException:java.io.IOException: Not a file: > hdfs://path/to/partition/path/to/subdir > > Also, it seems to ignore directories prefixed by an underscore (_directory). > > I am using hive 0.7.1 on Hadoop 0.20.2. > > Is there a way to force Hive to ignore all subdirectories in external tables > and only look at files? > > Thanks in advance, > -Dave > Sam William sa...@stumbleupon.com