Re: OOM concern

2024-05-27 Thread Perez
Thank you everyone for your response. I am not getting any errors as of now. I am just trying to choose the right tool for my task which is data loading from an external source into s3 via Glue/EMR. I think Glue job would be the best fit for me because I can calculate DPUs needed (maybe keeping s

Re: Spark Protobuf Deserialization

2024-05-27 Thread Sandish Kumar HN
Did you try using to_protobuf and from_protobuf ? https://spark.apache.org/docs/latest/sql-data-sources-protobuf.html On Mon, May 27, 2024 at 15:45 Satyam Raj wrote: > Hello guys, > We're using Spark 3.5.0 for processing Kafka source that contains protobuf > serialized data. The format is as f

Re: OOM concern

2024-05-27 Thread Russell Jurney
If you’re using EMR and Spark, you need to choose nodes with enough RAM to accommodate any given partition in your data or you can get an OOM error. Not sure if this job involves a reduce, but I would choose a single 128GB+ memory optimized instance and then adjust parallelism as via the Dpark docs

Re: OOM concern

2024-05-27 Thread Meena Rajani
What exactly is the error? Is it erroring out while reading the data from db? How are you partitioning the data? How much memory currently do you have? What is the network time out? Regards, Meena On Mon, May 27, 2024 at 4:22 PM Perez wrote: > Hi Team, > > I want to extract the data from DB a

Re: [Spark SQL]: Does Spark support processing records with timestamp NULL in stateful streaming?

2024-05-27 Thread Mich Talebzadeh
When you use applyInPandasWithState, Spark processes each input row as it arrives, regardless of whether certain columns, such as the timestamp column, contain NULL values. This behavior is useful where you want to handle incomplete or missing data gracefully within your stateful processing logic.

Spark Protobuf Deserialization

2024-05-27 Thread Satyam Raj
Hello guys, We're using Spark 3.5.0 for processing Kafka source that contains protobuf serialized data. The format is as follows: message Request { long sent_ts = 1; Event[] event = 2; } message Event { string event_name = 1; bytes event_bytes = 2; } The event_bytes contains the data for t

[Spark SQL]: Does Spark support processing records with timestamp NULL in stateful streaming?

2024-05-27 Thread Juan Casse
I am using applyInPandasWithState in PySpark 3.5.0. I noticed that records with timestamp==NULL are processed (i.e., trigger a call to the stateful function). And, as you would expect, does not advance the watermark. I am taking advantage of this in my application. My question: Is this a support

OOM concern

2024-05-27 Thread Perez
Hi Team, I want to extract the data from DB and just dump it into S3. I don't have to perform any transformations on the data yet. My data size would be ~100 GB (historical load). Choosing the right DPUs(Glue jobs) should solve this problem right? Or should I move to EMR. I don't feel the need t

Re: Subject: [Spark SQL] [Debug] Spark Memory Issue with DataFrame Processing

2024-05-27 Thread Shay Elbaz
Seen this before; had a very(!) complex plan behind a DataFrame, to the point where any additional transformation went OOM on the driver. A quick and ugly solution was to break the plan - convert the DataFrame to rdd and back to DF at certain points to make the plan shorter. This has obvious dr

Re: Subject: [Spark SQL] [Debug] Spark Memory Issue with DataFrame Processing

2024-05-27 Thread Mich Talebzadeh
Few ideas on top of my head for how to go about solving the problem 1. Try with subsets: Try reproducing the issue with smaller subsets of your data to pinpoint the specific operation causing the memory problems. 2. Explode or Flatten Nested Structures: If your DataFrame schema involv

Subject: [Spark SQL] [Debug] Spark Memory Issue with DataFrame Processing

2024-05-27 Thread Gaurav Madan
Dear Community, I'm reaching out to seek your assistance with a memory issue we've been facing while processing certain large and nested DataFrames using Apache Spark. We have encountered a scenario where the driver runs out of memory when applying the `withColumn` method on specific DataFrames in