Hi,

We are writing a ETL pipeline using Spark, that fetch the data from SQL server 
in batch mode (every 15mins). Problem we are facing when we try to 
parallelising single table reads into multiple tasks without missing any data.

We have tried this,


  *   Use `ROW_NUMBER` window function in the SQL query
  *   Then do
  *

DataFrame df =
    hiveContext
        .read()
        .jdbc(
            <url>,
            query,
            "row_num",
            1,
            <upper_limit>,
            noOfPartitions,
            jdbcOptions);



The problem with this approach is if our tables get updated in between in SQL 
Server while tasks are still running then the `ROW_NUMBER` will change and we 
may miss some records.


Any approach to how to fix this issue ? . Any pointers will be helpful


Note: I am on spark 1.6


Thanks

Manjiunath Shetty

Reply via email to