I think that’s a known performance issue(Compared to Hive) of Spark SQL in 
multi-inserts.
A workaround is create a temp cached table for the projection first, and then 
do the multiple inserts base on the cached table.

We are actually working on the POC of some similar cases, hopefully it comes 
out soon.

Hao

From: Daniel Haviv [mailto:daniel.ha...@veracity-group.com]
Sent: Friday, October 9, 2015 3:08 AM
To: user
Subject: Re: Insert via HiveContext is slow

Forgot to mention that my insert is a multi table insert :
sqlContext2.sql("""from avro_events
               lateral view explode(usChnlList) usParamLine as usParamLine
               lateral view explode(dsChnlList) dsParamLine as dsParamLine
               insert into table UpStreamParam partition(day_ts, cmtsid)
               select cmtstimestamp,datats,macaddress,
                    usParamLine['chnlidx'] chnlidx,
                    usParamLine['modulation'] modulation,
                    usParamLine['severity'] severity,
                    usParamLine['rxpower'] rxpower,
                    usParamLine['sigqnoise'] sigqnoise,
                    usParamLine['noisedeviation'] noisedeviation,
                    usParamLine['prefecber'] prefecber,
                    usParamLine['postfecber'] postfecber,
                    usParamLine['txpower'] txpower,
                    usParamLine['txpowerdrop'] txpowerdrop,
                    usParamLine['nmter'] nmter,
                    usParamLine['premtter'] premtter,
                    usParamLine['postmtter'] postmtter,
                    usParamLine['unerroreds'] unerroreds,
                    usParamLine['corrected'] corrected,
                    usParamLine['uncorrectables'] uncorrectables,
                    from_unixtime(cast(datats/1000 as bigint),'yyyyMMdd') 
day_ts,
                                                            cmtsid
               insert into table DwnStreamParam partition(day_ts, cmtsid)
               select  cmtstimestamp,datats,macaddress,
                    dsParamLine['chnlidx'] chnlidx,
                    dsParamLine['modulation'] modulation,
                    dsParamLine['severity'] severity,
                    dsParamLine['rxpower'] rxpower,
                    dsParamLine['sigqnoise'] sigqnoise,
                    dsParamLine['noisedeviation'] noisedeviation,
                    dsParamLine['prefecber'] prefecber,
                    dsParamLine['postfecber'] postfecber,
                    dsParamLine['sigqrxmer'] sigqrxmer,
                    dsParamLine['sigqmicroreflection'] sigqmicroreflection,
                    dsParamLine['unerroreds'] unerroreds,
                    dsParamLine['corrected'] corrected,
                    dsParamLine['uncorrectables'] uncorrectables,
                    from_unixtime(cast(datats/1000 as bigint),'yyyyMMdd') 
day_ts,
                                                            cmtsid
                                                                                
    """)



On Thu, Oct 8, 2015 at 9:51 PM, Daniel Haviv 
<daniel.ha...@veracity-group.com<mailto:daniel.ha...@veracity-group.com>> wrote:
Hi,
I'm inserting into a partitioned ORC table using an insert sql statement passed 
via HiveContext.
The performance I'm getting is pretty bad and I was wondering if there are ways 
to speed things up.
Would saving the DF like this 
df.write().mode(SaveMode.Append).partitionBy("date").saveAsTable("Tablename") 
be faster ?


Thank you.
Daniel

Reply via email to