Re: Performance issue with Spak's foreachpartition method

2015-07-27 Thread diplomatic Guru
Bagavath, Sometimes we need to merge existing records, due to recomputations of the whole data. I don't think we could achieve this with pure insert, or is there a way? On 24 July 2015 at 08:53, Bagavath wrote: > Try using insert instead of merge. Typically we use insert append to do > bulk

Re: Performance issue with Spak's foreachpartition method

2015-07-24 Thread Bagavath
Try using insert instead of merge. Typically we use insert append to do bulk inserts to oracle. On Thu, Jul 23, 2015 at 1:12 AM, diplomatic Guru wrote: > Thanks Robin for your reply. > > I'm pretty sure that writing to Oracle is taking longer as when writing to > HDFS is only taking ~5minutes.

Re: Performance issue with Spak's foreachpartition method

2015-07-22 Thread diplomatic Guru
Thanks Robin for your reply. I'm pretty sure that writing to Oracle is taking longer as when writing to HDFS is only taking ~5minutes. The job is writing about ~5 Million of records. I've set the job to call executeBatch() when the batchSize reaches 200,000 of records, so I assume that commit wil

Re: Performance issue with Spak's foreachpartition method

2015-07-22 Thread Robin East
The first question I would ask is have you determined whether you have a performance issue writing to Oracle? In particular how many commits are you making? If you are issuing a lot of commits that would be a performance problem. Robin > On 22 Jul 2015, at 19:11, diplomatic Guru wrote: > > He

Performance issue with Spak's foreachpartition method

2015-07-22 Thread diplomatic Guru
Hello all, We are having a major performance issue with the Spark, which is holding us from going live. We have a job that carries out computation on log files and write the results into Oracle DB. The reducer 'reduceByKey' have been set to parallelize by 4 as we don't want to establish too man