Re: parallel processing with JDBC

ayan guha Mon, 15 Aug 2016 00:18:53 -0700

Hi

I would suggest you to look at sqoop as well. Essentially, you can provide
a splitBy/partitionBy column using which data will be distributed among
your stated number of mappers


On Mon, Aug 15, 2016 at 5:07 PM, Madabhattula Rajesh Kumar <
mrajaf...@gmail.com> wrote:

> Hi Mich,
>
> I have a below question.
>
> I want to join two tables and return the result based on the input value.
> In this case, how we need to specify lower bound and upper bound values ?
>
> select t1.id, t1.name, t2.course, t2.qualification from t1, t2 where
> t1.transactionid=*11111* and t1.id = t2.id
>
> *11111 => dynamic input value.*
>
> Regards,
> Rajesh
>
> On Mon, Aug 15, 2016 at 12:05 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> If you have your RDBMS table partitioned, then you need to consider how
>> much data you want to extract in other words the result set returned by the
>> JDBC call.
>>
>> If you want all the data, then the number of partitions specified in the
>> JDBC call should be equal to the number of partitions in your RDBMS table.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 14 August 2016 at 21:44, Ashok Kumar <ashok34...@yahoo.com> wrote:
>>
>>> Thank you very much sir.
>>>
>>> I forgot to mention that two of these Oracle tables are range
>>> partitioned. In that case what would be the optimum number of partitions if
>>> you can share?
>>>
>>> Warmest
>>>
>>>
>>> On Sunday, 14 August 2016, 21:37, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>
>>> If you have primary keys on these tables then you can parallelise the
>>> process reading data.
>>>
>>> You have to be careful not to set the number of partitions too many.
>>> Certainly there is a balance between the number of partitions supplied to
>>> JDBC and the load on the network and the source DB.
>>>
>>> Assuming that your underlying table has primary key ID, then this will
>>> create 20 parallel processes to Oracle DB
>>>
>>>  val d = HiveContext.read.format("jdbc").options(
>>>  Map("url" -> _ORACLEserver,
>>>  "dbtable" -> "(SELECT <COL1>, <COL2>, ....FROM <TABLE>)",
>>>  "partitionColumn" -> "ID",
>>>  "lowerBound" -> "1",
>>>  "upperBound" -> "maxID",
>>>  "numPartitions" -> "20",
>>>  "user" -> _username,
>>>  "password" -> _password)).load
>>>
>>> assuming your upper bound on ID is maxID
>>>
>>>
>>> This will open multiple connections to RDBMS, each getting a subset of
>>> data that you want.
>>>
>>> You need to test it to ensure that you get the numPartitions optimum and
>>> you don't overload any component.
>>>
>>> HTH
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>> On 14 August 2016 at 21:15, Ashok Kumar <ashok34...@yahoo.com.invalid>
>>> wrote:
>>>
>>> Hi,
>>>
>>> There are 4 tables ranging from 10 million to 100 million rows but they
>>> all have primary keys.
>>>
>>> The network is fine but our Oracle is RAC and we can only connect to a
>>> designated Oracle node (where we have a DQ account only).
>>>
>>> We have a limited time window of few hours to get the required data out.
>>>
>>> Thanks
>>>
>>>
>>> On Sunday, 14 August 2016, 21:07, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>
>>> How big are your tables and is there any issue with the network between
>>> your Spark nodes and your Oracle DB that adds to issues?
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>> LinkedIn * https://www.linkedin.com/ profile/view?id=
>>> AAEAAAAWh2gBxianrbJd6zP6AcPCCd OABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>> http://talebzadehmich. wordpress.com
>>> <http://talebzadehmich.wordpress.com/>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>> On 14 August 2016 at 20:50, Ashok Kumar <ashok34...@yahoo.com.invalid>
>>> wrote:
>>>
>>> Hi Gurus,
>>>
>>> I have few large tables in rdbms (ours is Oracle). We want to access
>>> these tables through Spark JDBC
>>>
>>> What is the quickest way of getting data into Spark Dataframe say
>>> multiple connections from Spark
>>>
>>> thanking you
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>


-- 
Best Regards,
Ayan Guha

Re: parallel processing with JDBC

Reply via email to