Re: spark session jdbc performance

lucas.g...@gmail.com Wed, 25 Oct 2017 13:27:04 -0700

Are we seeing the UI is showing only one partition to run the query?  The
original poster hasn't replied yet.


My assumption is that there's only one executor configured / deployed.  But
we only know what the OP stated which wasn't enough to be sure of anything.

Why are you suggesting that partitioning on the PK isn't prudent? and did
you mean to say that 30 partitions were far to many for any system to
handle?  (I'm assuming you misread the original code)

Gary

On 25 October 2017 at 13:21, Gourav Sengupta <gourav.sengu...@gmail.com>
wrote:

> Hi Lucas,
>
> so if I am assuming things, can you please explain why the UI is showing
> only one partition to run the query?
>
>
> Regards,
> Gourav Sengupta
>
> On Wed, Oct 25, 2017 at 6:03 PM, lucas.g...@gmail.com <
> lucas.g...@gmail.com> wrote:
>
>> Gourav, I'm assuming you misread the code.  It's 30 partitions, which
>> isn't a ridiculous value.  Maybe you misread the upperBound for the
>> partitions?  (That would be ridiculous)
>>
>> Why not use the PK as the partition column?  Obviously it depends on the
>> downstream queries.  If you're going to be performing joins (which I assume
>> is the case) then partitioning on the join column would be advisable, but
>> what about the case where the join column would be heavily skewed?
>>
>> Thanks!
>>
>> Gary
>>
>> On 24 October 2017 at 23:41, Gourav Sengupta <gourav.sengu...@gmail.com>
>> wrote:
>>
>>> Hi Naveen,
>>>
>>> I do not think that it is prudent to use the PK as the partitionColumn.
>>> That is too many partitions for any system to handle. The numPartitions
>>> will be valid in case of JDBC very differently.
>>>
>>> Please keep me updated on how things go.
>>>
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>> On Tue, Oct 24, 2017 at 10:54 PM, Naveen Madhire <vmadh...@umail.iu.edu>
>>> wrote:
>>>
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> I am trying to fetch data from Oracle DB using a subquery and
>>>> experiencing lot of performance issues.
>>>>
>>>>
>>>>
>>>> Below is the query I am using,
>>>>
>>>>
>>>>
>>>> *Using Spark 2.0.2*
>>>>
>>>>
>>>>
>>>> *val *df = spark_session.read.format(*"jdbc"*)
>>>> .option(*"driver"*,*"*oracle.jdbc.OracleDriver*"*)
>>>> .option(*"url"*, jdbc_url)
>>>>    .option(*"user"*, user)
>>>>    .option(*"password"*, pwd)
>>>>    .option(*"dbtable"*, *"subquery"*)
>>>>    .option(*"partitionColumn"*, *"id"*)  //primary key column
>>>> uniformly distributed
>>>>    .option(*"lowerBound"*, *"1"*)
>>>>    .option(*"upperBound"*, *"500000"*)
>>>> .option(*"numPartitions"*, 30)
>>>> .load()
>>>>
>>>>
>>>>
>>>> The above query is running using the 30 partitions, but when I see the
>>>> UI it is only using 1 partiton to run the query.
>>>>
>>>>
>>>>
>>>> Can anyone tell if I am missing anything or do I need to anything else
>>>> to tune the performance of the query.
>>>>
>>>>  *Thanks*
>>>>
>>>
>>>
>>
>

Re: spark session jdbc performance

Reply via email to