Hello All,

I have been using the AWS setup for EMR for some time now and I am currently in 
the process of implementing spark/shark on my own cluster. I am installing from 
https://github.com/downloads/mesos/spark/spark-0.6.0-sources.tar.gz. Which 
includes hive0.9.0. I am using this with s3 and am unable to recover partitions 
from a directory with a series of other directories (partitions)  inside of it. 
I want to have 2 partitions 2012-10-25 and 2012-10-26 which contain their 
respective files. For example I have the following files located at 
s3://varickTest3/nn/.


drwxrwxrwx   -          0 1970-01-01 00:00 /nn/ds=2012-10-25

-rwxrwxrwx   1   49696432 2012-12-10 20:55 /nn/ds=2012-10-25/part-00000

-rwxrwxrwx   1   49696432 2012-12-10 20:55 /nn/ds=2012-10-25/part-00001

drwxrwxrwx   -          0 1970-01-01 00:00 /nn/ds=2012-10-26

-rwxrwxrwx   1   49696432 2012-12-10 20:55 /nn/ds=2012-10-26/part-00000

-rwxrwxrwx   1   49696432 2012-12-10 20:55 /nn/ds=2012-10-26/part-00001


When I run the query in hive (not shark):


CREATE EXTERNAL TABLE wiki(id BIGINT, title STRING, last_modified STRING, xml 
STRING, text STRING)

PARTITIONED BY (ds STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 's3n://varickTest3/nn';

ALTER TABLE wiki RECOVER PARTITIONS;


This will result in an empty table.


I have tried many iterations of this and nothing has worked so far. Including 
adding:

MSCK REPAIR TABLE wiki;

And using s3 rather than s3n (credentials for both types are set in 
core-site.xml)


And setting the options:

SET hive.exec.dynamic.partition=true;

SET hive.exec.dynamic.partition.mode=nonstrict;


Although if I use:

LOCATION 's3n://varickTest3/nn/*


The table will have content but I am still unable to recover partitions.


Is there any way to do this using settings or data structure (rather than 
writing a script) to partition the table using the directories as I can in AWS?


Thank you for any help anyone can give me.

Reply via email to