Gourav, I'm assuming you misread the code. It's 30 partitions, which isn't a ridiculous value. Maybe you misread the upperBound for the partitions? (That would be ridiculous)
Why not use the PK as the partition column? Obviously it depends on the downstream queries. If you're going to be performing joins (which I assume is the case) then partitioning on the join column would be advisable, but what about the case where the join column would be heavily skewed? Thanks! Gary On 24 October 2017 at 23:41, Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > Hi Naveen, > > I do not think that it is prudent to use the PK as the partitionColumn. > That is too many partitions for any system to handle. The numPartitions > will be valid in case of JDBC very differently. > > Please keep me updated on how things go. > > > Regards, > Gourav Sengupta > > On Tue, Oct 24, 2017 at 10:54 PM, Naveen Madhire <vmadh...@umail.iu.edu> > wrote: > >> >> Hi, >> >> >> >> I am trying to fetch data from Oracle DB using a subquery and >> experiencing lot of performance issues. >> >> >> >> Below is the query I am using, >> >> >> >> *Using Spark 2.0.2* >> >> >> >> *val *df = spark_session.read.format(*"jdbc"*) >> .option(*"driver"*,*"*oracle.jdbc.OracleDriver*"*) >> .option(*"url"*, jdbc_url) >> .option(*"user"*, user) >> .option(*"password"*, pwd) >> .option(*"dbtable"*, *"subquery"*) >> .option(*"partitionColumn"*, *"id"*) //primary key column uniformly >> distributed >> .option(*"lowerBound"*, *"1"*) >> .option(*"upperBound"*, *"500000"*) >> .option(*"numPartitions"*, 30) >> .load() >> >> >> >> The above query is running using the 30 partitions, but when I see the UI >> it is only using 1 partiton to run the query. >> >> >> >> Can anyone tell if I am missing anything or do I need to anything else to >> tune the performance of the query. >> >> *Thanks* >> > >