Spark Streaming with Kafka and Python

2020-08-12 Thread Hamish Whittal
Hi folks, Thought I would ask here because it's somewhat confusing. I'm using Spark 2.4.5 on EMR 5.30.1 with Amazon MSK. The version of Scala used is 2.11.12. I'm using this version of the libraries spark-streaming-kafka-0-8_2.11-2.4.5.jar Now I'm wanting to read from Kafka topics using Python (

Re: Spark Streaming with Kafka and Python

2020-08-12 Thread German Schiavon
Hey, Maybe I'm missing some restriction with EMR, but have you tried to use Structured Streaming instead of Spark Streaming? https://spark.apache.org/docs/2.4.5/structured-streaming-kafka-integration.html Regards On Wed, 12 Aug 2020 at 14:12, Hamish Whittal wrote: > Hi folks, > > Thought I wo

How can I use pyspark to upsert one row without replacing entire table

2020-08-12 Thread Siavash Namvar
Hi, I have a use case, and read data from a db table and need to update few rows based on primary key without replacing the entire table. for instance if I have 3 following rows --- id | fname --- 1 | jo

Re: Spark Streaming with Kafka and Python

2020-08-12 Thread Sean Owen
What supports Python in (Kafka?) 0.8? I don't think Spark ever had a specific Python-Kafka integration. But you have always been able to use it to read DataFrames as in Structured Streaming. Kafka 0.8 support is deprecated (gone in 3.0) but 0.10 means 0.10+ - works with the latest 2.x. What is the

Re: How can I use pyspark to upsert one row without replacing entire table

2020-08-12 Thread Sean Owen
It's not so much Spark but the data format, whether it supports upserts. Parquet, CSV, JSON, etc would not. That is what Delta, Hudi et al are for, and yes you can upsert them in Spark. On Wed, Aug 12, 2020 at 9:57 AM Siavash Namvar wrote: > > Hi, > > I have a use case, and read data from a db ta

Re: How can I use pyspark to upsert one row without replacing entire table

2020-08-12 Thread Siavash Namvar
Thanks Sean, Do you have any URL or reference to help me how to upsert in Spark? I need to update Sybase db On Wed, Aug 12, 2020 at 11:06 AM Sean Owen wrote: > It's not so much Spark but the data format, whether it supports > upserts. Parquet, CSV, JSON, etc would not. > That is what Delta, Hud

Re: How can I use pyspark to upsert one row without replacing entire table

2020-08-12 Thread Nicholas Gustafson
The delta docs have examples of upserting: https://docs.delta.io/0.4.0/delta-update.html#upsert-into-a-table-using-merge > On Aug 12, 2020, at 08:31, Siavash Namvar wrote: > >  > Thanks Sean, > > Do you have any URL or reference to help me how to upsert in Spark? I need to > update Sybase db

[Spark SQL]: Rationale for access modifiers and qualifiers in Spark

2020-08-12 Thread 김민우
https://gist.github.com/JoeyValentine/23821c27c1f540a4ac63e446e2243dbc I wonder why the constructor and stateStoreCoordinator are private[sql]. In other words, I want to know the reason why the scope has to be [sql] and not [streaming]. And the next question is: "The reason for using private[sql

Re: How can I use pyspark to upsert one row without replacing entire table

2020-08-12 Thread ed elliott
You’ll need to do an insert and use a trigger on the table to change it into an upsert, also make sure your mode is append rather than overwrite. Ed From: Siavash Namvar Sent: Wednesday, August 12, 2020 4:09:07 PM To: Sean Owen Cc: User Subject: Re: How can I

Re: [SPARK-STRUCTURED-STREAMING] IllegalStateException: Race while writing batch 4

2020-08-12 Thread Jungtaek Lim
File stream sink doesn't support the functionality. There're several approaches to do so: 1) two queries write to Kafka (or any intermediate storage which allows concurrent writes), and let next Spark application read and write to the final path 2) two queries write to two different directories, a