On similar lines,  I want to  have hive inlcude   subdirs.   That is..

I have an external  table paritioned by month (data for each month under a 
folder).  Under  the current month I want to  keep adding  folders daily . Is 
this possible without having to subclass InputFormat ?




On Aug 19, 2011, at 1:22 PM, Dave wrote:

> I solved my own problem. For anyone who's curious:
> 
> It turns out that subclassing an InputFormat allows one to override the 
> listStatus method, which returns the list of files for Hive (or mapreduce in 
> general) to process. All I had to do was subclass 
> org.apache.hadoop.mapred.TextInputFormat and override the listStatus method 
> and voila; I was able to make it ignore directories. Here's the java code 
> that I used:
> 
> public class TextFileInputFormatIgnoreSubDir extends TextInputFormat {
>     @Override
>     protected FileStatus[] listStatus (JobConf job) throws IOException {
>         FileStatus[] files = super.listStatus(job);
>         List<FileStatus> newFiles = new ArrayList<FileStatus>();
>         int len = files.length;
>         for (int i = 0; i < len; ++i) {
>             FileStatus file = files[i];
>             if (!file.isDir()) {
>                 newFiles.add(file);
>             }
>         }
> 
>         files = new FileStatus[newFiles.size()];
>         for (int i = 0; i < newFiles.size(); ++i) {
>             files[i] = newFiles.get(i);
>         }
> 
>         return files;
>     }
> }
> 
> And the HiveQL code I used to define the table:
> 
> CREATE EXTERNAL TABLE users (id STRING, user_name STRING)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> STORED AS INPUTFORMAT 
> 'com.example.mapreduce.input.TextFileInputFormatIgnoreSubDir'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION '/data/test/users';
> 
> Hope this saves someone else the trouble of figuring it out...
> 
> -Dave
> 
> On Thu, Aug 18, 2011 at 3:53 PM, Dave <drive...@gmail.com> wrote:
> Hi,
> 
> I have a partitioned external table in Hive, and in the partition directories 
> there are other subdirectories that are not related to the table itself. Hive 
> seems to want to scan those directories, as I am getting an error message 
> when trying to do a SELECT on the table:
> 
> Failed with exception java.io.IOException:java.io.IOException: Not a file: 
> hdfs://path/to/partition/path/to/subdir
> 
> Also, it seems to ignore directories prefixed by an underscore (_directory).
> 
> I am using hive 0.7.1 on Hadoop 0.20.2.
> 
> Is there a way to force Hive to ignore all subdirectories in external tables 
> and only look at files?
> 
> Thanks in advance,
> -Dave
> 

Sam William
sa...@stumbleupon.com



Reply via email to