Hi I would suggest you to look at sqoop as well. Essentially, you can provide a splitBy/partitionBy column using which data will be distributed among your stated number of mappers
On Mon, Aug 15, 2016 at 5:07 PM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi Mich, > > I have a below question. > > I want to join two tables and return the result based on the input value. > In this case, how we need to specify lower bound and upper bound values ? > > select t1.id, t1.name, t2.course, t2.qualification from t1, t2 where > t1.transactionid=*11111* and t1.id = t2.id > > *11111 => dynamic input value.* > > Regards, > Rajesh > > On Mon, Aug 15, 2016 at 12:05 PM, Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > >> If you have your RDBMS table partitioned, then you need to consider how >> much data you want to extract in other words the result set returned by the >> JDBC call. >> >> If you want all the data, then the number of partitions specified in the >> JDBC call should be equal to the number of partitions in your RDBMS table. >> >> HTH >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> On 14 August 2016 at 21:44, Ashok Kumar <ashok34...@yahoo.com> wrote: >> >>> Thank you very much sir. >>> >>> I forgot to mention that two of these Oracle tables are range >>> partitioned. In that case what would be the optimum number of partitions if >>> you can share? >>> >>> Warmest >>> >>> >>> On Sunday, 14 August 2016, 21:37, Mich Talebzadeh < >>> mich.talebza...@gmail.com> wrote: >>> >>> >>> If you have primary keys on these tables then you can parallelise the >>> process reading data. >>> >>> You have to be careful not to set the number of partitions too many. >>> Certainly there is a balance between the number of partitions supplied to >>> JDBC and the load on the network and the source DB. >>> >>> Assuming that your underlying table has primary key ID, then this will >>> create 20 parallel processes to Oracle DB >>> >>> val d = HiveContext.read.format("jdbc").options( >>> Map("url" -> _ORACLEserver, >>> "dbtable" -> "(SELECT <COL1>, <COL2>, ....FROM <TABLE>)", >>> "partitionColumn" -> "ID", >>> "lowerBound" -> "1", >>> "upperBound" -> "maxID", >>> "numPartitions" -> "20", >>> "user" -> _username, >>> "password" -> _password)).load >>> >>> assuming your upper bound on ID is maxID >>> >>> >>> This will open multiple connections to RDBMS, each getting a subset of >>> data that you want. >>> >>> You need to test it to ensure that you get the numPartitions optimum and >>> you don't overload any component. >>> >>> HTH >>> >>> >>> Dr Mich Talebzadeh >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> http://talebzadehmich.wordpress.com >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> On 14 August 2016 at 21:15, Ashok Kumar <ashok34...@yahoo.com.invalid> >>> wrote: >>> >>> Hi, >>> >>> There are 4 tables ranging from 10 million to 100 million rows but they >>> all have primary keys. >>> >>> The network is fine but our Oracle is RAC and we can only connect to a >>> designated Oracle node (where we have a DQ account only). >>> >>> We have a limited time window of few hours to get the required data out. >>> >>> Thanks >>> >>> >>> On Sunday, 14 August 2016, 21:07, Mich Talebzadeh < >>> mich.talebza...@gmail.com> wrote: >>> >>> >>> How big are your tables and is there any issue with the network between >>> your Spark nodes and your Oracle DB that adds to issues? >>> >>> HTH >>> >>> Dr Mich Talebzadeh >>> >>> LinkedIn * https://www.linkedin.com/ profile/view?id= >>> AAEAAAAWh2gBxianrbJd6zP6AcPCCd OABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> http://talebzadehmich. wordpress.com >>> <http://talebzadehmich.wordpress.com/> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> On 14 August 2016 at 20:50, Ashok Kumar <ashok34...@yahoo.com.invalid> >>> wrote: >>> >>> Hi Gurus, >>> >>> I have few large tables in rdbms (ours is Oracle). We want to access >>> these tables through Spark JDBC >>> >>> What is the quickest way of getting data into Spark Dataframe say >>> multiple connections from Spark >>> >>> thanking you >>> >>> >>> >>> >>> >>> >>> >>> >>> >> > -- Best Regards, Ayan Guha