Hello,

Following on from my earlier post concerning syncing Hive data from an on
premise cluster to the cloud, I've been experimenting with the
IMPORT/EXPORT functionality to move data from an on-premise HDP cluster to
Amazon EMR. I started out with some simple Exports/Imports as these can be
the core operations on which replication is founded. This worked fine with
some on-premise clusters running HDP-2.2.4.


// on cluster 1

EXPORT TABLE my_table PARTITION (year_month='2015-12')
TO '/exports/my_table'
FOR REPLICATION ('1');

// Copy from cluster1:/exports/my_table to cluster2:/staging/my_table

// on cluster 2

IMPORT FROM '/staging/my_table'
LOCATION '/warehouse/my_table';

// Table created, partition created, data relocated to
/warehouse/my_table/year_month=2015-12


I next tried similar with HDP-2.2.4 → EMR (4.2.0) like so:

// On premise HDP2.2.4
SET hiveconf:hive.exim.uri.scheme.whitelist=hdfs,pfile,s3n;

EXPORT TABLE my_table PARTITION (year_month='2015-12')
TO 's3n://API_KEY:SECRET_KEY@exports-bucket/my_table'

// on EMR
SET hiveconf:hive.exim.uri.scheme.whitelist=hdfs,pfile,s3n;

IMPORT FROM 's3n://exports-bucket/my_table'
LOCATION 's3n://hive-warehouse-bucket/my_table'


The IMPORT behaviour I see is bizarre:

   1. Creates the folder 's3n://hive-warehouse/my_table'
   2. Copies the part file from
   's3n://exports-bucket/my_table/year_month=2015-12' to
   's3n://exports-bucket/my_table' (i.e. to the parent)
   3. Fails with: "ERROR exec.Task: Failed with exception checkPaths:
   s3n://exports-bucket/my_table has nested
   directorys3n://exports-bucket/my_table/year_month=2015-12"

It is as if it is attempting to set the final partition location to
's3n://exports-bucket/my_table' and not
's3n://hive-warehouse-bucket/my_table/year_month=2015-12' as happens with
HDP → HDP.

I've tried variations, specifying the partition on import, excluding the
location, all with the same result. Any thoughts or assistance would be
appreciated.

Thanks - Elliot.

Reply via email to