Hi Naveen, Can you please copy and paste the lines in your original email again, and perhaps then Lucas can go through it completely & kindly stop thinking that others are responding by assuming things?
On other hand, please try to let me know how things are going on, there is another post on this a few weeks back and several other users are equally finding this issue very interesting to resolve. I might just have the solution for this. Regards, Gourav Sengupta On Wed, Oct 25, 2017 at 9:26 PM, lucas.g...@gmail.com <lucas.g...@gmail.com> wrote: > Are we seeing the UI is showing only one partition to run the query? The > original poster hasn't replied yet. > > My assumption is that there's only one executor configured / deployed. > But we only know what the OP stated which wasn't enough to be sure of > anything. > > Why are you suggesting that partitioning on the PK isn't prudent? and did > you mean to say that 30 partitions were far to many for any system to > handle? (I'm assuming you misread the original code) > > Gary > > On 25 October 2017 at 13:21, Gourav Sengupta <gourav.sengu...@gmail.com> > wrote: > >> Hi Lucas, >> >> so if I am assuming things, can you please explain why the UI is showing >> only one partition to run the query? >> >> >> Regards, >> Gourav Sengupta >> >> On Wed, Oct 25, 2017 at 6:03 PM, lucas.g...@gmail.com < >> lucas.g...@gmail.com> wrote: >> >>> Gourav, I'm assuming you misread the code. It's 30 partitions, which >>> isn't a ridiculous value. Maybe you misread the upperBound for the >>> partitions? (That would be ridiculous) >>> >>> Why not use the PK as the partition column? Obviously it depends on the >>> downstream queries. If you're going to be performing joins (which I assume >>> is the case) then partitioning on the join column would be advisable, but >>> what about the case where the join column would be heavily skewed? >>> >>> Thanks! >>> >>> Gary >>> >>> On 24 October 2017 at 23:41, Gourav Sengupta <gourav.sengu...@gmail.com> >>> wrote: >>> >>>> Hi Naveen, >>>> >>>> I do not think that it is prudent to use the PK as the partitionColumn. >>>> That is too many partitions for any system to handle. The numPartitions >>>> will be valid in case of JDBC very differently. >>>> >>>> Please keep me updated on how things go. >>>> >>>> >>>> Regards, >>>> Gourav Sengupta >>>> >>>> On Tue, Oct 24, 2017 at 10:54 PM, Naveen Madhire <vmadh...@umail.iu.edu >>>> > wrote: >>>> >>>>> >>>>> Hi, >>>>> >>>>> >>>>> >>>>> I am trying to fetch data from Oracle DB using a subquery and >>>>> experiencing lot of performance issues. >>>>> >>>>> >>>>> >>>>> Below is the query I am using, >>>>> >>>>> >>>>> >>>>> *Using Spark 2.0.2* >>>>> >>>>> >>>>> >>>>> *val *df = spark_session.read.format(*"jdbc"*) >>>>> .option(*"driver"*,*"*oracle.jdbc.OracleDriver*"*) >>>>> .option(*"url"*, jdbc_url) >>>>> .option(*"user"*, user) >>>>> .option(*"password"*, pwd) >>>>> .option(*"dbtable"*, *"subquery"*) >>>>> .option(*"partitionColumn"*, *"id"*) //primary key column >>>>> uniformly distributed >>>>> .option(*"lowerBound"*, *"1"*) >>>>> .option(*"upperBound"*, *"500000"*) >>>>> .option(*"numPartitions"*, 30) >>>>> .load() >>>>> >>>>> >>>>> >>>>> The above query is running using the 30 partitions, but when I see the >>>>> UI it is only using 1 partiton to run the query. >>>>> >>>>> >>>>> >>>>> Can anyone tell if I am missing anything or do I need to anything else >>>>> to tune the performance of the query. >>>>> >>>>> *Thanks* >>>>> >>>> >>>> >>> >> >