Re: spark session jdbc performance

Gourav Sengupta Tue, 24 Oct 2017 23:42:07 -0700

Hi Naveen,

I do not think that it is prudent to use the PK as the partitionColumn.
That is too many partitions for any system to handle. The numPartitions
will be valid in case of JDBC very differently.


Please keep me updated on how things go.


Regards,
Gourav Sengupta

On Tue, Oct 24, 2017 at 10:54 PM, Naveen Madhire <vmadh...@umail.iu.edu>
wrote:

>
> Hi,
>
>
>
> I am trying to fetch data from Oracle DB using a subquery and experiencing
> lot of performance issues.
>
>
>
> Below is the query I am using,
>
>
>
> *Using Spark 2.0.2*
>
>
>
> *val *df = spark_session.read.format(*"jdbc"*)
> .option(*"driver"*,*"*oracle.jdbc.OracleDriver*"*)
> .option(*"url"*, jdbc_url)
>    .option(*"user"*, user)
>    .option(*"password"*, pwd)
>    .option(*"dbtable"*, *"subquery"*)
>    .option(*"partitionColumn"*, *"id"*)  //primary key column uniformly
> distributed
>    .option(*"lowerBound"*, *"1"*)
>    .option(*"upperBound"*, *"500000"*)
> .option(*"numPartitions"*, 30)
> .load()
>
>
>
> The above query is running using the 30 partitions, but when I see the UI
> it is only using 1 partiton to run the query.
>
>
>
> Can anyone tell if I am missing anything or do I need to anything else to
> tune the performance of the query.
>
>  *Thanks*
>

Re: spark session jdbc performance

Reply via email to