I think that’s a known performance issue(Compared to Hive) of Spark SQL in multi-inserts. A workaround is create a temp cached table for the projection first, and then do the multiple inserts base on the cached table.
We are actually working on the POC of some similar cases, hopefully it comes out soon. Hao From: Daniel Haviv [mailto:daniel.ha...@veracity-group.com] Sent: Friday, October 9, 2015 3:08 AM To: user Subject: Re: Insert via HiveContext is slow Forgot to mention that my insert is a multi table insert : sqlContext2.sql("""from avro_events lateral view explode(usChnlList) usParamLine as usParamLine lateral view explode(dsChnlList) dsParamLine as dsParamLine insert into table UpStreamParam partition(day_ts, cmtsid) select cmtstimestamp,datats,macaddress, usParamLine['chnlidx'] chnlidx, usParamLine['modulation'] modulation, usParamLine['severity'] severity, usParamLine['rxpower'] rxpower, usParamLine['sigqnoise'] sigqnoise, usParamLine['noisedeviation'] noisedeviation, usParamLine['prefecber'] prefecber, usParamLine['postfecber'] postfecber, usParamLine['txpower'] txpower, usParamLine['txpowerdrop'] txpowerdrop, usParamLine['nmter'] nmter, usParamLine['premtter'] premtter, usParamLine['postmtter'] postmtter, usParamLine['unerroreds'] unerroreds, usParamLine['corrected'] corrected, usParamLine['uncorrectables'] uncorrectables, from_unixtime(cast(datats/1000 as bigint),'yyyyMMdd') day_ts, cmtsid insert into table DwnStreamParam partition(day_ts, cmtsid) select cmtstimestamp,datats,macaddress, dsParamLine['chnlidx'] chnlidx, dsParamLine['modulation'] modulation, dsParamLine['severity'] severity, dsParamLine['rxpower'] rxpower, dsParamLine['sigqnoise'] sigqnoise, dsParamLine['noisedeviation'] noisedeviation, dsParamLine['prefecber'] prefecber, dsParamLine['postfecber'] postfecber, dsParamLine['sigqrxmer'] sigqrxmer, dsParamLine['sigqmicroreflection'] sigqmicroreflection, dsParamLine['unerroreds'] unerroreds, dsParamLine['corrected'] corrected, dsParamLine['uncorrectables'] uncorrectables, from_unixtime(cast(datats/1000 as bigint),'yyyyMMdd') day_ts, cmtsid """) On Thu, Oct 8, 2015 at 9:51 PM, Daniel Haviv <daniel.ha...@veracity-group.com<mailto:daniel.ha...@veracity-group.com>> wrote: Hi, I'm inserting into a partitioned ORC table using an insert sql statement passed via HiveContext. The performance I'm getting is pretty bad and I was wondering if there are ways to speed things up. Would saving the DF like this df.write().mode(SaveMode.Append).partitionBy("date").saveAsTable("Tablename") be faster ? Thank you. Daniel