Thank you everyone for your response.
I am not getting any errors as of now. I am just trying to choose the right
tool for my task which is data loading from an external source into s3 via
Glue/EMR.
I think Glue job would be the best fit for me because I can calculate DPUs
needed (maybe keeping s
Did you try using to_protobuf and from_protobuf ?
https://spark.apache.org/docs/latest/sql-data-sources-protobuf.html
On Mon, May 27, 2024 at 15:45 Satyam Raj wrote:
> Hello guys,
> We're using Spark 3.5.0 for processing Kafka source that contains protobuf
> serialized data. The format is as f
If you’re using EMR and Spark, you need to choose nodes with enough RAM to
accommodate any given partition in your data or you can get an OOM error.
Not sure if this job involves a reduce, but I would choose a single 128GB+
memory optimized instance and then adjust parallelism as via the Dpark docs
What exactly is the error? Is it erroring out while reading the data from
db? How are you partitioning the data?
How much memory currently do you have? What is the network time out?
Regards,
Meena
On Mon, May 27, 2024 at 4:22 PM Perez wrote:
> Hi Team,
>
> I want to extract the data from DB a
When you use applyInPandasWithState, Spark processes each input row as it
arrives, regardless of whether certain columns, such as the timestamp
column, contain NULL values. This behavior is useful where you want to
handle incomplete or missing data gracefully within your stateful
processing logic.
Hello guys,
We're using Spark 3.5.0 for processing Kafka source that contains protobuf
serialized data. The format is as follows:
message Request {
long sent_ts = 1;
Event[] event = 2;
}
message Event {
string event_name = 1;
bytes event_bytes = 2;
}
The event_bytes contains the data for t
I am using applyInPandasWithState in PySpark 3.5.0.
I noticed that records with timestamp==NULL are processed (i.e., trigger a
call to the stateful function). And, as you would expect, does not advance
the watermark.
I am taking advantage of this in my application.
My question: Is this a support
Hi Team,
I want to extract the data from DB and just dump it into S3. I
don't have to perform any transformations on the data yet. My data size
would be ~100 GB (historical load).
Choosing the right DPUs(Glue jobs) should solve this problem right? Or
should I move to EMR.
I don't feel the need t
Seen this before; had a very(!) complex plan behind a DataFrame, to the point
where any additional transformation went OOM on the driver.
A quick and ugly solution was to break the plan - convert the DataFrame to rdd
and back to DF at certain points to make the plan shorter. This has obvious
dr
Few ideas on top of my head for how to go about solving the problem
1. Try with subsets: Try reproducing the issue with smaller subsets of
your data to pinpoint the specific operation causing the memory problems.
2. Explode or Flatten Nested Structures: If your DataFrame schema
involv
Dear Community,
I'm reaching out to seek your assistance with a memory issue we've been
facing while processing certain large and nested DataFrames using Apache
Spark. We have encountered a scenario where the driver runs out of memory
when applying the `withColumn` method on specific DataFrames in
11 matches
Mail list logo