Re: Parallelising JDBC reads in spark

Georg Heiler Sun, 24 May 2020 23:23:15 -0700

Well you seem to have performance and consistency problems. Using a CDC
tool fitting for your database you might be able to fix both.
However, streaming the change events of the database log might be a bit
more complicated. Tools like https://debezium.io/ could be useful -
depending on your source database.


Best,
Georg

Am Mo., 25. Mai 2020 um 08:16 Uhr schrieb Manjunath Shetty H <
manjunathshe...@live.com>:

> Hi Georg,
>
> Thanks for the response, can please elaborate what do mean by change data
> capture ?
>
> Thanks
> Manjunath
> ------------------------------
> *From:* Georg Heiler <georg.kf.hei...@gmail.com>
> *Sent:* Monday, May 25, 2020 11:14 AM
> *To:* Manjunath Shetty H <manjunathshe...@live.com>
> *Cc:* Mike Artz <michaelea...@gmail.com>; user <user@spark.apache.org>
> *Subject:* Re: Parallelising JDBC reads in spark
>
> Why don't you apply proper change data capture?
> This will be more complex though.
>
> Am Mo., 25. Mai 2020 um 07:38 Uhr schrieb Manjunath Shetty H <
> manjunathshe...@live.com>:
>
> Hi Mike,
>
> Thanks for the response.
>
> Even with that flag set data miss can happen right ?. As the fetch is
> based on the last watermark (maximum timestamp of the row that last batch
> job fetched ), Take a scenario like this with table
>
> a :  1
> b :  2
> c :  3
> d :  4
> *f  :  6*
> g :  7
> h :  8
> e :  5
>
>
>    - a,b,c,d,e get picked by 1 task
>    - by the time second task starts, e has been updated, so the row order
>    changes
>    - As f moves up, it will completely get missed in the fetch
>
>
> Thanks
> Manjunath
>
> ------------------------------
> *From:* Mike Artz <michaelea...@gmail.com>
> *Sent:* Monday, May 25, 2020 10:50 AM
> *To:* Manjunath Shetty H <manjunathshe...@live.com>
> *Cc:* user <user@spark.apache.org>
> *Subject:* Re: Parallelising JDBC reads in spark
>
> Does anything different happened when you set the isolationLevel to do
> Dirty Reads i.e. "READ_UNCOMMITTED"
>
> On Sun, May 24, 2020 at 7:50 PM Manjunath Shetty H <
> manjunathshe...@live.com> wrote:
>
> Hi,
>
> We are writing a ETL pipeline using Spark, that fetch the data from SQL
> server in batch mode (every 15mins). Problem we are facing when we try to
> parallelising single table reads into multiple tasks without missing any
> data.
>
> We have tried this,
>
>
>    - Use `ROW_NUMBER` window function in the SQL query
>    - Then do
>    -
>
>    DataFrame df =
>        hiveContext
>            .read()
>            .jdbc(
>                *<url>*,
>                query,
>                "row_num",
>                1,
>                <upper_limit>,
>                noOfPartitions,
>                jdbcOptions);
>
>
>
> The problem with this approach is if our tables get updated in between in SQL 
> Server while tasks are still running then the `ROW_NUMBER` will change and we 
> may miss some records.
>
>
> Any approach to how to fix this issue ? . Any pointers will be helpful
>
>
> *Note*: I am on spark 1.6
>
>
> Thanks
>
> Manjiunath Shetty
>
>

Re: Parallelising JDBC reads in spark

Reply via email to