Well you seem to have performance and consistency problems. Using a CDC
tool fitting for your database you might be able to fix both.
However, streaming the change events of the database log might be a bit
more complicated. Tools like https://debezium.io/ could be useful -
depending on your source
Hi Georg,
Thanks for the response, can please elaborate what do mean by change data
capture ?
Thanks
Manjunath
From: Georg Heiler
Sent: Monday, May 25, 2020 11:14 AM
To: Manjunath Shetty H
Cc: Mike Artz ; user
Subject: Re: Parallelising JDBC reads in spark
Wh
Why don't you apply proper change data capture?
This will be more complex though.
Am Mo., 25. Mai 2020 um 07:38 Uhr schrieb Manjunath Shetty H <
manjunathshe...@live.com>:
> Hi Mike,
>
> Thanks for the response.
>
> Even with that flag set data miss can happen right ?. As the fetch is
> based on
Hi Mike,
Thanks for the response.
Even with that flag set data miss can happen right ?. As the fetch is based on
the last watermark (maximum timestamp of the row that last batch job fetched ),
Take a scenario like this with table
a : 1
b : 2
c : 3
d : 4
f : 6
g : 7
h : 8
e : 5
*
Does anything different happened when you set the isolationLevel to do
Dirty Reads i.e. "READ_UNCOMMITTED"
On Sun, May 24, 2020 at 7:50 PM Manjunath Shetty H
wrote:
> Hi,
>
> We are writing a ETL pipeline using Spark, that fetch the data from SQL
> server in batch mode (every 15mins). Problem we
Hi,
We are writing a ETL pipeline using Spark, that fetch the data from SQL server
in batch mode (every 15mins). Problem we are facing when we try to
parallelising single table reads into multiple tasks without missing any data.
We have tried this,
* Use `ROW_NUMBER` window function in th
On Sat, 16 May 2020, 22:34 Punna Yenumala, wrote:
>
Hi Avadhut Narayan JoshiThe use case is achievable using Spark. Connection to
SQL Server possible as Mich mentioned below as longs as there a JDBC driver
that can connect to SQL ServerFor a production workloads important points to
consider, >> what is the QoS requirements for your case? at least o
How a Spark job reads datasources depends on the underlying source system,the
job configuration about number of executors and cores per executor.
https://spark.apache.org/docs/latest/rdd-programming-guide.html#external-datasets
About Shuffle operations.
https://spark.apache.org/docs/latest/rdd-p
I am writing something that partitions a data set and then trains a machine
learning model on the data in each partition
The resulting model is very big and right now i am storing it in an rdd as
a pair of :
partition_id and very_big_model_that_is_hundreds_of_megabytes_big
but it is becoming in
Hi,
while reading streaming data from kafka we use following API.
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
.option("subscribe", "topic1") \
.option("startingOffsets", "earliest") \
.load()
My Question is how to see
12 matches
Mail list logo