Teng, There is no need to alter hive.metastore.warehouse.dir. Leave it as is and just create external tables with location pointing to S3. What I suspect you are seeing is that spark-sql is writing to a temp directory within S3 then issuing a rename to the final location as would be done with HDFS. But in S3, there is not a rename operation so there is a performance hit as S3 performs a copy then delete. I tested 1TB from/to S3 external tables and it worked, it is just there the additional delay for the rename (copy).
EMR has modified Hive to avoid the expensive rename and you can take advantage of this, too with Spark SQL by just copying the EMR Hive jars into the Spark class path. Like: /bin/ls /home/hadoop/.versions/hive-*/lib/*.jar | xargs -n 1 -I %% cp %% ~/spark/classpath/emr Please note that since EMR Hive is 0.13 at this time, this does break some other features already supported by spark-sql if using the built-in Hive library (for example, AVRO support). So if using this workaround to make a better performant query when writing to S3 be sure to test your use-case. Thanks Christopher -----Original Message----- From: chutium [mailto:teng....@gmail.com] Sent: Wednesday, April 01, 2015 9:34 AM To: user@spark.apache.org Subject: Issue on Spark SQL insert or create table with Spark running on AWS EMR -- s3n.S3NativeFileSystem: rename never finished Hi, we always get issues on inserting or creating table with Amazon EMR Spark version, by inserting about 1GB resultset, the spark sql query will never be finished. by inserting small resultset (like 500MB), works fine. *spark.sql.shuffle.partitions* by default 200 or *set spark.sql.shuffle.partitions=1* do not help. the log stopped at: */15/04/01 15:48:13 INFO s3n.S3NativeFileSystem: rename s3://hive-db/tmp/hive-hadoop/hive_2015-04-01_15-47-43_036_1196347178448825102-15/-ext-10000 s3://hive-db/db_xxx/some_huge_table/* then only metrics.MetricsSaver logs. we set / <property> <name>hive.metastore.warehouse.dir</name> <value>s3://hive-db</value> </property>/ but hive.exec.scratchdir ist not set, i have no idea why the tmp files were created in /s3://hive-db/tmp/hive-hadoop// we just tried the newest Spark 1.3.0 on AMI 3.5.x and AMI 3.6 (https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/VersionInformation.md), still not work. anyone get same issue? any idea about how to fix it? i believe Amazon EMR's Spark version use com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem to access s3, but not the original hadoop s3n implementation, right? /home/hadoop/spark/classpath/emr/* and /home/hadoop/spark/classpath/emrfs/* is in classpath btw. is there any plan to use the new hadoop s3a implementation instead of s3n ? Thanks for any help. Teng -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Issue-on-Spark-SQL-insert-or-create-table-with-Spark-running-on-AWS-EMR-s3n-S3NativeFileSystem-renamd-tp22340.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org