Re: spark session jdbc performance

Gourav Sengupta Wed, 25 Oct 2017 22:50:47 -0700

Hi Naveen,

Can you please copy and paste the lines in your original email again, and
perhaps then Lucas can go through it completely & kindly stop thinking that
others are responding by assuming things?


On other hand, please try to let me know how things are going on, there is
another post on this a few weeks back and several other users are equally
finding this issue very interesting to resolve.

I might just have the solution for this.

Regards,
Gourav Sengupta

On Wed, Oct 25, 2017 at 9:26 PM, lucas.g...@gmail.com <lucas.g...@gmail.com>
wrote:

> Are we seeing the UI is showing only one partition to run the query?  The
> original poster hasn't replied yet.
>
> My assumption is that there's only one executor configured / deployed.
> But we only know what the OP stated which wasn't enough to be sure of
> anything.
>
> Why are you suggesting that partitioning on the PK isn't prudent? and did
> you mean to say that 30 partitions were far to many for any system to
> handle?  (I'm assuming you misread the original code)
>
> Gary
>
> On 25 October 2017 at 13:21, Gourav Sengupta <gourav.sengu...@gmail.com>
> wrote:
>
>> Hi Lucas,
>>
>> so if I am assuming things, can you please explain why the UI is showing
>> only one partition to run the query?
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Wed, Oct 25, 2017 at 6:03 PM, lucas.g...@gmail.com <
>> lucas.g...@gmail.com> wrote:
>>
>>> Gourav, I'm assuming you misread the code.  It's 30 partitions, which
>>> isn't a ridiculous value.  Maybe you misread the upperBound for the
>>> partitions?  (That would be ridiculous)
>>>
>>> Why not use the PK as the partition column?  Obviously it depends on the
>>> downstream queries.  If you're going to be performing joins (which I assume
>>> is the case) then partitioning on the join column would be advisable, but
>>> what about the case where the join column would be heavily skewed?
>>>
>>> Thanks!
>>>
>>> Gary
>>>
>>> On 24 October 2017 at 23:41, Gourav Sengupta <gourav.sengu...@gmail.com>
>>> wrote:
>>>
>>>> Hi Naveen,
>>>>
>>>> I do not think that it is prudent to use the PK as the partitionColumn.
>>>> That is too many partitions for any system to handle. The numPartitions
>>>> will be valid in case of JDBC very differently.
>>>>
>>>> Please keep me updated on how things go.
>>>>
>>>>
>>>> Regards,
>>>> Gourav Sengupta
>>>>
>>>> On Tue, Oct 24, 2017 at 10:54 PM, Naveen Madhire <vmadh...@umail.iu.edu
>>>> > wrote:
>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>> I am trying to fetch data from Oracle DB using a subquery and
>>>>> experiencing lot of performance issues.
>>>>>
>>>>>
>>>>>
>>>>> Below is the query I am using,
>>>>>
>>>>>
>>>>>
>>>>> *Using Spark 2.0.2*
>>>>>
>>>>>
>>>>>
>>>>> *val *df = spark_session.read.format(*"jdbc"*)
>>>>> .option(*"driver"*,*"*oracle.jdbc.OracleDriver*"*)
>>>>> .option(*"url"*, jdbc_url)
>>>>>    .option(*"user"*, user)
>>>>>    .option(*"password"*, pwd)
>>>>>    .option(*"dbtable"*, *"subquery"*)
>>>>>    .option(*"partitionColumn"*, *"id"*)  //primary key column
>>>>> uniformly distributed
>>>>>    .option(*"lowerBound"*, *"1"*)
>>>>>    .option(*"upperBound"*, *"500000"*)
>>>>> .option(*"numPartitions"*, 30)
>>>>> .load()
>>>>>
>>>>>
>>>>>
>>>>> The above query is running using the 30 partitions, but when I see the
>>>>> UI it is only using 1 partiton to run the query.
>>>>>
>>>>>
>>>>>
>>>>> Can anyone tell if I am missing anything or do I need to anything else
>>>>> to tune the performance of the query.
>>>>>
>>>>>  *Thanks*
>>>>>
>>>>
>>>>
>>>
>>
>

Re: spark session jdbc performance

Reply via email to