Re: JdbcRDD Constructor

Deenar Toraskar Tue, 20 Oct 2015 09:32:53 -0700

You have 2 options

a) don't use partitioning, if the table is small spark will only use one
task to load it


val jdbcDF = sqlContext.read.format("jdbc").options(
  Map("url" -> "jdbc:postgresql:dbserver",
  "dbtable" -> "schema.tablename")).load()

b) create a view that includes  hashcode column similar to
 ora_hash(r.rowid, 100) and use that for partitioning.

Deenar


On 20 October 2015 at 15:19, satish chandra j <jsatishchan...@gmail.com>
wrote:

> Hi Deenar,
> Thanks for your valuable inputs
>
> Here is a situation, if a Source Table does not have any such
> column(unique values,numeric and sequential) which is suitable as Partition
> Column to be specified for JDBCRDD Constructor or DataSource API.How to
> proceed further on this scenario and also let me know if any default
> approach Spark is going to implement if do not give any such inputs as
> "lowerbound" and "upperbound" to JDBCRDD Constructor or DataSourceAPI
>
> Thanks in advance for your inputs
>
> Regards,
> Satish Chandra J
>
>
> On Thu, Sep 24, 2015 at 10:18 PM, Deenar Toraskar <
> deenar.toras...@thinkreactive.co.uk> wrote:
>
>> you are interpreting the JDBCRDD API incorrectly. If you want to use
>> partitions, then the column used to partition and present in the where
>> clause must be numeric and the lower bound and upper bound must be the min
>> and max values of the column. Spark will equally distribute the range over
>> the number of partitions selected. So in your case OFFSET is the first
>> placeholder and LIMIT the second
>>
>> numPartitions 1 - Your query will be called once with first placeholder 0
>> and second placeholder 100, this explains how you get 100 rows
>> select *  from schema.Table OFFSET 0 LIMIT 100
>>
>> numPartitions 2 - Your query will be called twice with first placeholder
>> 0 and second placeholder 50, and second time with 51,100. Again this
>> explains why you get 150 records
>>
>> select *  from schema.Table OFFSET 0 LIMIT 50
>> select *  from schema.Table OFFSET 51 LIMIT 100
>>
>> numPartitions 3 - Your query will be called thrice
>>
>> select *  from schema.Table OFFSET 0 LIMIT 34
>> select *  from schema.Table OFFSET 35 LIMIT 67
>> select *  from schema.Table OFFSET 68 LIMIT 100
>>
>> That explains why you get 201 records. You need to amend the query and
>> provide correct lower and upper bounds aligned to the column used in the
>> where clause.
>>
>> See
>> http://www.sparkexpert.com/2015/01/02/load-database-data-into-spark-using-jdbcrdd-in-java/
>>
>> Deenar
>>
>>
>>
>>
>> *Think Reactive Ltd*
>> deenar.toras...@thinkreactive.co.uk
>> 07714140812
>>
>>
>>
>> On 24 September 2015 at 11:55, satish chandra j <jsatishchan...@gmail.com
>> > wrote:
>>
>>> HI Deenar,
>>>
>>> Please find the SQL query below:
>>>
>>> var SQL_RDD= new JdbcRDD( sc, ()=>
>>> DriverManager.getConnection(url,user,pass),"select col1, col2,
>>> col3......col 37 from schema.Table LIMIT ? OFFSET ?",100,0,*1*,(r:
>>> ResultSet) => (r.getInt("col1"),r.getInt("col2").......r.getInt("col37")))
>>>
>>>
>>> When I have the above 100,0,*1 * I am getting SQL_RDD.count as 100
>>> When set to 100,0,2 I am getting SQL_RDD.count as 151
>>> When set to 100,0,3 I am getting SQL RDD.count as 201
>>>
>>> But where as I expect every execution count should be 100, let me know
>>> if I am missing anything here
>>>
>>> Regards,
>>> Satish Chandra
>>>
>>>
>>> On Thu, Sep 24, 2015 at 12:48 AM, Deenar Toraskar <
>>> deenar.toras...@thinkreactive.co.uk> wrote:
>>>
>>>> Satish
>>>>
>>>> Can you post the SQL query you are using?
>>>>
>>>> The SQL query must have 2 placeholders and both of them should be an
>>>> inclusive range (<= and >=)..
>>>>
>>>> e.g. select title, author from books where ? <= id and id <= ?
>>>>
>>>> Are you doing this?
>>>>
>>>> Deenar
>>>>
>>>>
>>>>
>>>>
>>>> *Think Reactive Ltd*
>>>> deenar.toras...@thinkreactive.co.uk
>>>> 07714140812
>>>>
>>>>
>>>>
>>>> On 23 September 2015 at 13:47, satish chandra j <
>>>> jsatishchan...@gmail.com> wrote:
>>>>
>>>>> HI,
>>>>> Could anybody provide inputs if they have came across similar issue
>>>>>
>>>>> @Rishitesh
>>>>> Could you provide if any sample code to use JdbcRDDSuite
>>>>>
>>>>>
>>>>> Regards,
>>>>> Satish Chandra
>>>>>
>>>>> On Wed, Sep 23, 2015 at 5:14 PM, Rishitesh Mishra <
>>>>> rishi80.mis...@gmail.com> wrote:
>>>>>
>>>>>> I am using Spark 1.5. I always get count = 100, irrespective of num
>>>>>> partitions.
>>>>>>
>>>>>> On Wed, Sep 23, 2015 at 5:00 PM, satish chandra j <
>>>>>> jsatishchan...@gmail.com> wrote:
>>>>>>
>>>>>>> HI,
>>>>>>> Currently using Spark 1.2.2, could you please let me know correct
>>>>>>> results output count which you got it by using JdbcRDDSuite
>>>>>>>
>>>>>>> Regards,
>>>>>>> Satish Chandra
>>>>>>>
>>>>>>> On Wed, Sep 23, 2015 at 4:02 PM, Rishitesh Mishra <
>>>>>>> rishi80.mis...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Which version of Spark you are using ??  I can get correct results
>>>>>>>> using JdbcRDD. Infact there is a test suite precisely for this (
>>>>>>>> JdbcRDDSuite) .
>>>>>>>> I changed according to your input and got correct results from this
>>>>>>>> test suite.
>>>>>>>>
>>>>>>>> On Wed, Sep 23, 2015 at 11:00 AM, satish chandra j <
>>>>>>>> jsatishchan...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> HI All,
>>>>>>>>>
>>>>>>>>> JdbcRDD constructor has following parameters,
>>>>>>>>>
>>>>>>>>> *JdbcRDD
>>>>>>>>> <https://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/rdd/JdbcRDD.html#JdbcRDD(org.apache.spark.SparkContext,%20scala.Function0,%20java.lang.String,%20long,%20long,%20int,%20scala.Function1,%20scala.reflect.ClassTag)>*
>>>>>>>>> (SparkContext
>>>>>>>>> <https://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/SparkContext.html>
>>>>>>>>>  sc,
>>>>>>>>> scala.Function0<java.sql.Connection> getConnection, String sql, *long 
>>>>>>>>> lowerBound,
>>>>>>>>> long upperBound, int numPartitions*,
>>>>>>>>> scala.Function1<java.sql.ResultSet,T
>>>>>>>>> <https://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/rdd/JdbcRDD.html>>
>>>>>>>>>  mapRow,
>>>>>>>>> scala.reflect.ClassTag<T
>>>>>>>>> <https://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/rdd/JdbcRDD.html>
>>>>>>>>> > evidence$1)
>>>>>>>>>
>>>>>>>>> where the below parameters *lowerBound* refers to Lower boundary
>>>>>>>>> of entire data, *upperBound *refers to Upper boundary of entire
>>>>>>>>> data and *numPartitions *refer to Number of partitions
>>>>>>>>>
>>>>>>>>> Source table to which JbdcRDD is fetching data from Oracle DB has
>>>>>>>>> more than 500 records but its confusing when I tried several 
>>>>>>>>> executions by
>>>>>>>>> changing "numPartitions" parameter
>>>>>>>>>
>>>>>>>>> LowerBound,UpperBound,numPartitions: Output Count
>>>>>>>>>
>>>>>>>>> 0                 ,100              ,1                   : 100
>>>>>>>>>
>>>>>>>>> 0                 ,100              ,2                   : 151
>>>>>>>>>
>>>>>>>>> 0                 ,100              ,3                   : 201
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Please help me in understanding the why Output count is 151 if
>>>>>>>>> numPartitions is 2 and Output count is 201 if numPartitions is 3
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Satish Chandra
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: JdbcRDD Constructor

Reply via email to