Teng,

There is no need to alter hive.metastore.warehouse.dir.   Leave it as is and 
just create external tables with location pointing to S3.   What I suspect you 
are seeing is that spark-sql is writing to a temp directory within S3 then 
issuing a rename to the final location as would be done with HDFS.  But in S3, 
there is not a rename operation so there is a performance hit as S3 performs a 
copy then delete.  I tested 1TB from/to S3 external tables and it worked, it is 
just there the additional delay for the rename (copy).

EMR has modified Hive to avoid the expensive rename and you can take advantage 
of this, too with Spark SQL by just copying the EMR Hive jars into the Spark 
class path.   Like:
/bin/ls /home/hadoop/.versions/hive-*/lib/*.jar | xargs -n 1 -I %% cp %% 
~/spark/classpath/emr

Please note that since EMR Hive is 0.13 at this time, this does break some 
other features already supported by spark-sql if using the built-in Hive 
library (for example, AVRO support).  So if using this workaround to make a 
better performant query when writing to S3 be sure to test your use-case.

Thanks
Christopher


-----Original Message-----
From: chutium [mailto:teng....@gmail.com] 
Sent: Wednesday, April 01, 2015 9:34 AM
To: user@spark.apache.org
Subject: Issue on Spark SQL insert or create table with Spark running on AWS 
EMR -- s3n.S3NativeFileSystem: rename never finished

Hi,

we always get issues on inserting or creating table with Amazon EMR Spark 
version, by inserting about 1GB resultset, the spark sql query will never be 
finished.

by inserting small resultset (like 500MB), works fine. 

*spark.sql.shuffle.partitions* by default 200 or *set 
spark.sql.shuffle.partitions=1* do not help.

the log stopped at:
*/15/04/01 15:48:13 INFO s3n.S3NativeFileSystem: rename
s3://hive-db/tmp/hive-hadoop/hive_2015-04-01_15-47-43_036_1196347178448825102-15/-ext-10000
s3://hive-db/db_xxx/some_huge_table/*

then only metrics.MetricsSaver logs.

we set
/  <property>
    <name>hive.metastore.warehouse.dir</name>
    <value>s3://hive-db</value>
  </property>/
but hive.exec.scratchdir ist not set, i have no idea why the tmp files were 
created in /s3://hive-db/tmp/hive-hadoop//

we just tried the newest Spark 1.3.0 on AMI 3.5.x and AMI 3.6 
(https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/VersionInformation.md),
still not work.

anyone get same issue? any idea about how to fix it?

i believe Amazon EMR's Spark version use 
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem to access s3, but not the 
original hadoop s3n implementation, right?

/home/hadoop/spark/classpath/emr/*
and
/home/hadoop/spark/classpath/emrfs/*
is in classpath

btw. is there any plan to use the new hadoop s3a implementation instead of s3n ?

Thanks for any help.

Teng




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Issue-on-Spark-SQL-insert-or-create-table-with-Spark-running-on-AWS-EMR-s3n-S3NativeFileSystem-renamd-tp22340.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to