Re: parallel processing with JDBC

2016-08-15 Thread Madabhattula Rajesh Kumar
Hi Mich, Thank you Regards,, Rajesh On Mon, Aug 15, 2016 at 6:35 PM, Mich Talebzadeh wrote: > Ok Rajesh > > This is standalone. > > In that case it ought to be at least 4 connections as one executor will > use one worker. > > I am hesitant in here as you can see with (at least) as with Standal

Re: parallel processing with JDBC

2016-08-15 Thread Mich Talebzadeh
Ok Rajesh This is standalone. In that case it ought to be at least 4 connections as one executor will use one worker. I am hesitant in here as you can see with (at least) as with Standalone mode you may end up with more executors on each worker. But try it and see whether numPartitions" -> "4"

Re: parallel processing with JDBC

2016-08-15 Thread Madabhattula Rajesh Kumar
Hi Mich, Thank you for detailed explanation. One more question In my cluster, I have one master and 4 workers. In this case, 4 connections will be opened to Oracle ? Regards, Rajesh On Mon, Aug 15, 2016 at 3:59 PM, Mich Talebzadeh wrote: > It happens that the number of parallel processes open

Re: parallel processing with JDBC

2016-08-15 Thread Mich Talebzadeh
It happens that the number of parallel processes open from Spark to RDBMS is determined by the number of executors. I just tested this. With Yarn client using to executors I see two connections to RDBMS EXECUTIONS USERNAME SID SERIAL# USERS_EXECUTING SQL_TEXT -- -- ---

Re: parallel processing with JDBC

2016-08-15 Thread Mich Talebzadeh
Hi. This is a very good question I did some tests on this. If you are joining two tables then you are creating a result set based on some conditions. In this case what I normally do is to specify an ID column from either tables and will base my partitioning on that ID column. This is pretty stra

Re: parallel processing with JDBC

2016-08-15 Thread ayan guha
Hi I would suggest you to look at sqoop as well. Essentially, you can provide a splitBy/partitionBy column using which data will be distributed among your stated number of mappers On Mon, Aug 15, 2016 at 5:07 PM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi Mich, > > I have a bel

Re: parallel processing with JDBC

2016-08-15 Thread Madabhattula Rajesh Kumar
Hi Mich, I have a below question. I want to join two tables and return the result based on the input value. In this case, how we need to specify lower bound and upper bound values ? select t1.id, t1.name, t2.course, t2.qualification from t1, t2 where t1.transactionid=*1* and t1.id = t2.id *

Re: parallel processing with JDBC

2016-08-14 Thread Mich Talebzadeh
If you have your RDBMS table partitioned, then you need to consider how much data you want to extract in other words the result set returned by the JDBC call. If you want all the data, then the number of partitions specified in the JDBC call should be equal to the number of partitions in your RDBM

Re: parallel processing with JDBC

2016-08-14 Thread Ashok Kumar
Thank you very much sir. I forgot to mention that two of these Oracle tables are range partitioned. In that case what would be the optimum number of partitions if you can share? Warmest On Sunday, 14 August 2016, 21:37, Mich Talebzadeh wrote: If you have primary keys on these tables th

Re: parallel processing with JDBC

2016-08-14 Thread Mich Talebzadeh
If you have primary keys on these tables then you can parallelise the process reading data. You have to be careful not to set the number of partitions too many. Certainly there is a balance between the number of partitions supplied to JDBC and the load on the network and the source DB. Assuming t

Re: parallel processing with JDBC

2016-08-14 Thread Ashok Kumar
Hi, There are 4 tables ranging from 10 million to 100 million rows but they all have primary keys. The network is fine but our Oracle is RAC and we can only connect to a designated Oracle node (where we have a DQ account only). We have a limited time window of few hours to get the required data o

Re: parallel processing with JDBC

2016-08-14 Thread Mich Talebzadeh
How big are your tables and is there any issue with the network between your Spark nodes and your Oracle DB that adds to issues? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw