You have 2 options a) don't use partitioning, if the table is small spark will only use one task to load it
val jdbcDF = sqlContext.read.format("jdbc").options( Map("url" -> "jdbc:postgresql:dbserver", "dbtable" -> "schema.tablename")).load() b) create a view that includes hashcode column similar to ora_hash(r.rowid, 100) and use that for partitioning. Deenar On 20 October 2015 at 15:19, satish chandra j <jsatishchan...@gmail.com> wrote: > Hi Deenar, > Thanks for your valuable inputs > > Here is a situation, if a Source Table does not have any such > column(unique values,numeric and sequential) which is suitable as Partition > Column to be specified for JDBCRDD Constructor or DataSource API.How to > proceed further on this scenario and also let me know if any default > approach Spark is going to implement if do not give any such inputs as > "lowerbound" and "upperbound" to JDBCRDD Constructor or DataSourceAPI > > Thanks in advance for your inputs > > Regards, > Satish Chandra J > > > On Thu, Sep 24, 2015 at 10:18 PM, Deenar Toraskar < > deenar.toras...@thinkreactive.co.uk> wrote: > >> you are interpreting the JDBCRDD API incorrectly. If you want to use >> partitions, then the column used to partition and present in the where >> clause must be numeric and the lower bound and upper bound must be the min >> and max values of the column. Spark will equally distribute the range over >> the number of partitions selected. So in your case OFFSET is the first >> placeholder and LIMIT the second >> >> numPartitions 1 - Your query will be called once with first placeholder 0 >> and second placeholder 100, this explains how you get 100 rows >> select * from schema.Table OFFSET 0 LIMIT 100 >> >> numPartitions 2 - Your query will be called twice with first placeholder >> 0 and second placeholder 50, and second time with 51,100. Again this >> explains why you get 150 records >> >> select * from schema.Table OFFSET 0 LIMIT 50 >> select * from schema.Table OFFSET 51 LIMIT 100 >> >> numPartitions 3 - Your query will be called thrice >> >> select * from schema.Table OFFSET 0 LIMIT 34 >> select * from schema.Table OFFSET 35 LIMIT 67 >> select * from schema.Table OFFSET 68 LIMIT 100 >> >> That explains why you get 201 records. You need to amend the query and >> provide correct lower and upper bounds aligned to the column used in the >> where clause. >> >> See >> http://www.sparkexpert.com/2015/01/02/load-database-data-into-spark-using-jdbcrdd-in-java/ >> >> Deenar >> >> >> >> >> *Think Reactive Ltd* >> deenar.toras...@thinkreactive.co.uk >> 07714140812 >> >> >> >> On 24 September 2015 at 11:55, satish chandra j <jsatishchan...@gmail.com >> > wrote: >> >>> HI Deenar, >>> >>> Please find the SQL query below: >>> >>> var SQL_RDD= new JdbcRDD( sc, ()=> >>> DriverManager.getConnection(url,user,pass),"select col1, col2, >>> col3......col 37 from schema.Table LIMIT ? OFFSET ?",100,0,*1*,(r: >>> ResultSet) => (r.getInt("col1"),r.getInt("col2").......r.getInt("col37"))) >>> >>> >>> When I have the above 100,0,*1 * I am getting SQL_RDD.count as 100 >>> When set to 100,0,2 I am getting SQL_RDD.count as 151 >>> When set to 100,0,3 I am getting SQL RDD.count as 201 >>> >>> But where as I expect every execution count should be 100, let me know >>> if I am missing anything here >>> >>> Regards, >>> Satish Chandra >>> >>> >>> On Thu, Sep 24, 2015 at 12:48 AM, Deenar Toraskar < >>> deenar.toras...@thinkreactive.co.uk> wrote: >>> >>>> Satish >>>> >>>> Can you post the SQL query you are using? >>>> >>>> The SQL query must have 2 placeholders and both of them should be an >>>> inclusive range (<= and >=).. >>>> >>>> e.g. select title, author from books where ? <= id and id <= ? >>>> >>>> Are you doing this? >>>> >>>> Deenar >>>> >>>> >>>> >>>> >>>> *Think Reactive Ltd* >>>> deenar.toras...@thinkreactive.co.uk >>>> 07714140812 >>>> >>>> >>>> >>>> On 23 September 2015 at 13:47, satish chandra j < >>>> jsatishchan...@gmail.com> wrote: >>>> >>>>> HI, >>>>> Could anybody provide inputs if they have came across similar issue >>>>> >>>>> @Rishitesh >>>>> Could you provide if any sample code to use JdbcRDDSuite >>>>> >>>>> >>>>> Regards, >>>>> Satish Chandra >>>>> >>>>> On Wed, Sep 23, 2015 at 5:14 PM, Rishitesh Mishra < >>>>> rishi80.mis...@gmail.com> wrote: >>>>> >>>>>> I am using Spark 1.5. I always get count = 100, irrespective of num >>>>>> partitions. >>>>>> >>>>>> On Wed, Sep 23, 2015 at 5:00 PM, satish chandra j < >>>>>> jsatishchan...@gmail.com> wrote: >>>>>> >>>>>>> HI, >>>>>>> Currently using Spark 1.2.2, could you please let me know correct >>>>>>> results output count which you got it by using JdbcRDDSuite >>>>>>> >>>>>>> Regards, >>>>>>> Satish Chandra >>>>>>> >>>>>>> On Wed, Sep 23, 2015 at 4:02 PM, Rishitesh Mishra < >>>>>>> rishi80.mis...@gmail.com> wrote: >>>>>>> >>>>>>>> Which version of Spark you are using ?? I can get correct results >>>>>>>> using JdbcRDD. Infact there is a test suite precisely for this ( >>>>>>>> JdbcRDDSuite) . >>>>>>>> I changed according to your input and got correct results from this >>>>>>>> test suite. >>>>>>>> >>>>>>>> On Wed, Sep 23, 2015 at 11:00 AM, satish chandra j < >>>>>>>> jsatishchan...@gmail.com> wrote: >>>>>>>> >>>>>>>>> HI All, >>>>>>>>> >>>>>>>>> JdbcRDD constructor has following parameters, >>>>>>>>> >>>>>>>>> *JdbcRDD >>>>>>>>> <https://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/rdd/JdbcRDD.html#JdbcRDD(org.apache.spark.SparkContext,%20scala.Function0,%20java.lang.String,%20long,%20long,%20int,%20scala.Function1,%20scala.reflect.ClassTag)>* >>>>>>>>> (SparkContext >>>>>>>>> <https://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/SparkContext.html> >>>>>>>>> sc, >>>>>>>>> scala.Function0<java.sql.Connection> getConnection, String sql, *long >>>>>>>>> lowerBound, >>>>>>>>> long upperBound, int numPartitions*, >>>>>>>>> scala.Function1<java.sql.ResultSet,T >>>>>>>>> <https://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/rdd/JdbcRDD.html>> >>>>>>>>> mapRow, >>>>>>>>> scala.reflect.ClassTag<T >>>>>>>>> <https://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/rdd/JdbcRDD.html> >>>>>>>>> > evidence$1) >>>>>>>>> >>>>>>>>> where the below parameters *lowerBound* refers to Lower boundary >>>>>>>>> of entire data, *upperBound *refers to Upper boundary of entire >>>>>>>>> data and *numPartitions *refer to Number of partitions >>>>>>>>> >>>>>>>>> Source table to which JbdcRDD is fetching data from Oracle DB has >>>>>>>>> more than 500 records but its confusing when I tried several >>>>>>>>> executions by >>>>>>>>> changing "numPartitions" parameter >>>>>>>>> >>>>>>>>> LowerBound,UpperBound,numPartitions: Output Count >>>>>>>>> >>>>>>>>> 0 ,100 ,1 : 100 >>>>>>>>> >>>>>>>>> 0 ,100 ,2 : 151 >>>>>>>>> >>>>>>>>> 0 ,100 ,3 : 201 >>>>>>>>> >>>>>>>>> >>>>>>>>> Please help me in understanding the why Output count is 151 if >>>>>>>>> numPartitions is 2 and Output count is 201 if numPartitions is 3 >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> >>>>>>>>> Satish Chandra >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >