date:20200524

Re: Parallelising JDBC reads in spark

2020-05-24 Thread Georg Heiler

Well you seem to have performance and consistency problems. Using a CDC tool fitting for your database you might be able to fix both. However, streaming the change events of the database log might be a bit more complicated. Tools like https://debezium.io/ could be useful - depending on your source

Re: Parallelising JDBC reads in spark

2020-05-24 Thread Manjunath Shetty H

Hi Georg, Thanks for the response, can please elaborate what do mean by change data capture ? Thanks Manjunath From: Georg Heiler Sent: Monday, May 25, 2020 11:14 AM To: Manjunath Shetty H Cc: Mike Artz ; user Subject: Re: Parallelising JDBC reads in spark Wh

Re: Parallelising JDBC reads in spark

2020-05-24 Thread Georg Heiler

Why don't you apply proper change data capture? This will be more complex though. Am Mo., 25. Mai 2020 um 07:38 Uhr schrieb Manjunath Shetty H < manjunathshe...@live.com>: > Hi Mike, > > Thanks for the response. > > Even with that flag set data miss can happen right ?. As the fetch is > based on

Re: Parallelising JDBC reads in spark

2020-05-24 Thread Manjunath Shetty H

Hi Mike, Thanks for the response. Even with that flag set data miss can happen right ?. As the fetch is based on the last watermark (maximum timestamp of the row that last batch job fetched ), Take a scenario like this with table a : 1 b : 2 c : 3 d : 4 f : 6 g : 7 h : 8 e : 5 *

Re: Parallelising JDBC reads in spark

2020-05-24 Thread Mike Artz

Does anything different happened when you set the isolationLevel to do Dirty Reads i.e. "READ_UNCOMMITTED" On Sun, May 24, 2020 at 7:50 PM Manjunath Shetty H wrote: > Hi, > > We are writing a ETL pipeline using Spark, that fetch the data from SQL > server in batch mode (every 15mins). Problem we

Parallelising JDBC reads in spark

2020-05-24 Thread Manjunath Shetty H

Hi, We are writing a ETL pipeline using Spark, that fetch the data from SQL server in batch mode (every 15mins). Problem we are facing when we try to parallelising single table reads into multiple tasks without missing any data. We have tried this, * Use `ROW_NUMBER` window function in th

Re: unsubscribe

2020-05-24 Thread Sunil Prabhakara

On Sat, 16 May 2020, 22:34 Punna Yenumala, wrote: >

Re: ETL Using Spark

2020-05-24 Thread vijay.bvp

Hi Avadhut Narayan JoshiThe use case is achievable using Spark. Connection to SQL Server possible as Mich mentioned below as longs as there a JDBC driver that can connect to SQL ServerFor a production workloads important points to consider, >> what is the QoS requirements for your case? at least o

Re: [apache-spark]-spark-shuffle

2020-05-24 Thread vijay.bvp

How a Spark job reads datasources depends on the underlying source system,the job configuration about number of executors and cores per executor. https://spark.apache.org/docs/latest/rdd-programming-guide.html#external-datasets About Shuffle operations. https://spark.apache.org/docs/latest/rdd-p

[no subject]

2020-05-24 Thread Vijaya Phanindra Sarma B

Cleanup hook for temporary files produced as part of a spark job

2020-05-24 Thread jelmer

I am writing something that partitions a data set and then trains a machine learning model on the data in each partition The resulting model is very big and right now i am storing it in an rdd as a pair of : partition_id and very_big_model_that_is_hundreds_of_megabytes_big but it is becoming in

spar kafka option properties

2020-05-24 Thread Gunjan Kumar

Hi, while reading streaming data from kafka we use following API. df = spark \ .readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "host1:port1,host2:port2") \ .option("subscribe", "topic1") \ .option("startingOffsets", "earliest") \ .load() My Question is how to see

Re: Parallelising JDBC reads in spark

Re: Parallelising JDBC reads in spark

Re: Parallelising JDBC reads in spark

Re: Parallelising JDBC reads in spark

Re: Parallelising JDBC reads in spark

Parallelising JDBC reads in spark

Re: unsubscribe

Re: ETL Using Spark

Re: [apache-spark]-spark-shuffle

[no subject]

Cleanup hook for temporary files produced as part of a spark job

spar kafka option properties

12 matches

Site Navigation

Mail list logo

Footer information