Re: Parallelising JDBC reads in spark

Manjunath Shetty H Sun, 24 May 2020 23:16:48 -0700

Hi Georg,

Thanks for the response, can please elaborate what do mean by change data 
capture ?

Thanks
Manjunath
________________________________
From: Georg Heiler <georg.kf.hei...@gmail.com>
Sent: Monday, May 25, 2020 11:14 AM
To: Manjunath Shetty H <manjunathshe...@live.com>
Cc: Mike Artz <michaelea...@gmail.com>; user <user@spark.apache.org>
Subject: Re: Parallelising JDBC reads in spark

Why don't you apply proper change data capture?
This will be more complex though.

Am Mo., 25. Mai 2020 um 07:38 Uhr schrieb Manjunath Shetty H 
<manjunathshe...@live.com<mailto:manjunathshe...@live.com>>:
Hi Mike,

Thanks for the response.

Even with that flag set data miss can happen right ?. As the fetch is based on 
the last watermark (maximum timestamp of the row that last batch job fetched ), 
Take a scenario like this with table

a :  1
b :  2
c :  3
d :  4
f  :  6
g :  7
h :  8
e :  5

  *   a,b,c,d,e get picked by 1 task
  *   by the time second task starts, e has been updated, so the row order 
changes
  *   As f moves up, it will completely get missed in the fetch

Thanks
Manjunath

________________________________
From: Mike Artz <michaelea...@gmail.com<mailto:michaelea...@gmail.com>>
Sent: Monday, May 25, 2020 10:50 AM
To: Manjunath Shetty H 
<manjunathshe...@live.com<mailto:manjunathshe...@live.com>>
Cc: user <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: Parallelising JDBC reads in spark

Does anything different happened when you set the isolationLevel to do Dirty 
Reads i.e. "READ_UNCOMMITTED"

On Sun, May 24, 2020 at 7:50 PM Manjunath Shetty H 
<manjunathshe...@live.com<mailto:manjunathshe...@live.com>> wrote:
Hi,

We are writing a ETL pipeline using Spark, that fetch the data from SQL server 
in batch mode (every 15mins). Problem we are facing when we try to 
parallelising single table reads into multiple tasks without missing any data.

We have tried this,

  *   Use `ROW_NUMBER` window function in the SQL query
  *   Then do
  *

DataFrame df =
    hiveContext
        .read()
        .jdbc(
            <url>,
            query,
            "row_num",
            1,
            <upper_limit>,
            noOfPartitions,
            jdbcOptions);

The problem with this approach is if our tables get updated in between in SQL 
Server while tasks are still running then the `ROW_NUMBER` will change and we 
may miss some records.

Any approach to how to fix this issue ? . Any pointers will be helpful

Note: I am on spark 1.6

Thanks

Manjiunath Shetty

Re: Parallelising JDBC reads in spark

Reply via email to