Well you seem to have performance and consistency problems. Using a CDC tool fitting for your database you might be able to fix both. However, streaming the change events of the database log might be a bit more complicated. Tools like https://debezium.io/ could be useful - depending on your source database.
Best, Georg Am Mo., 25. Mai 2020 um 08:16 Uhr schrieb Manjunath Shetty H < manjunathshe...@live.com>: > Hi Georg, > > Thanks for the response, can please elaborate what do mean by change data > capture ? > > Thanks > Manjunath > ------------------------------ > *From:* Georg Heiler <georg.kf.hei...@gmail.com> > *Sent:* Monday, May 25, 2020 11:14 AM > *To:* Manjunath Shetty H <manjunathshe...@live.com> > *Cc:* Mike Artz <michaelea...@gmail.com>; user <user@spark.apache.org> > *Subject:* Re: Parallelising JDBC reads in spark > > Why don't you apply proper change data capture? > This will be more complex though. > > Am Mo., 25. Mai 2020 um 07:38 Uhr schrieb Manjunath Shetty H < > manjunathshe...@live.com>: > > Hi Mike, > > Thanks for the response. > > Even with that flag set data miss can happen right ?. As the fetch is > based on the last watermark (maximum timestamp of the row that last batch > job fetched ), Take a scenario like this with table > > a : 1 > b : 2 > c : 3 > d : 4 > *f : 6* > g : 7 > h : 8 > e : 5 > > > - a,b,c,d,e get picked by 1 task > - by the time second task starts, e has been updated, so the row order > changes > - As f moves up, it will completely get missed in the fetch > > > Thanks > Manjunath > > ------------------------------ > *From:* Mike Artz <michaelea...@gmail.com> > *Sent:* Monday, May 25, 2020 10:50 AM > *To:* Manjunath Shetty H <manjunathshe...@live.com> > *Cc:* user <user@spark.apache.org> > *Subject:* Re: Parallelising JDBC reads in spark > > Does anything different happened when you set the isolationLevel to do > Dirty Reads i.e. "READ_UNCOMMITTED" > > On Sun, May 24, 2020 at 7:50 PM Manjunath Shetty H < > manjunathshe...@live.com> wrote: > > Hi, > > We are writing a ETL pipeline using Spark, that fetch the data from SQL > server in batch mode (every 15mins). Problem we are facing when we try to > parallelising single table reads into multiple tasks without missing any > data. > > We have tried this, > > > - Use `ROW_NUMBER` window function in the SQL query > - Then do > - > > DataFrame df = > hiveContext > .read() > .jdbc( > *<url>*, > query, > "row_num", > 1, > <upper_limit>, > noOfPartitions, > jdbcOptions); > > > > The problem with this approach is if our tables get updated in between in SQL > Server while tasks are still running then the `ROW_NUMBER` will change and we > may miss some records. > > > Any approach to how to fix this issue ? . Any pointers will be helpful > > > *Note*: I am on spark 1.6 > > > Thanks > > Manjiunath Shetty > >