catching up a bit late on this, I mentioned optimising RockDB as below in
my earlier thread, specifically
# Add RocksDB configurations here
spark.conf.set("spark.sql.streaming.stateStore.providerClass",
"org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider
If you use RocksDB state store provider, you can turn on changelog
checkpoint to put the single changelog file per partition per batch. With
disabling changelog checkpoint, Spark uploads newly created SST files and
some log files. If compaction had happened, most SST files have to be
re-uploaded. U
Yes, I agree. But apart from maintaining this state internally (in memory
or in memory+disk as in case of RocksDB), every trigger it saves some
information about this state in a checkpoint location. I'm afraid we can't
do much about this checkpointing operation. I'll continue looking for
informatio
Hi,
You may have a point on scenario 2.
Caching Streaming DataFrames: In Spark Streaming, each batch of data is
processed incrementally, and it may not fit the typical caching we
discussed. Instead, Spark Streaming has its mechanisms to manage and
optimize the processing of streaming data. Case i
Hey,
Yes, that's how I understood it (scenario 1). However, I'm not sure if
scenario 2 is possible. I think cache on streaming DataFrame is supported
only in forEachBatch (in which it's actually no longer a streaming DF).
śr., 10 sty 2024 o 15:01 Mich Talebzadeh
napisał(a):
> Hi,
>
> With rega
Hi,
With regard to your point
- Caching: Can you please explain what you mean by caching? I know that
when you have batch and streaming sources in a streaming query, then you
can try to cache batch ones to save on reads. But I'm not sure if it's what
you mean, and I don't know how to apply what
Thank you very much for your suggestions. Yes, my main concern is
checkpointing costs.
I went through your suggestions and here're my comments:
- Caching: Can you please explain what you mean by caching? I know that
when you have batch and streaming sources in a streaming query, then you
can try
OK I assume that your main concern is checkpointing costs.
- Caching: If your queries read the same data multiple times, caching the
data might reduce the amount of data that needs to be checkpointed.
- Optimize Checkpointing Frequency i.e
- Consider Changelog Checkpointing with RocksDB. Thi
Usually one or two topics per query. Each query has its own checkpoint
directory. Each topic has a few partitions.
Performance-wise I don't experience any bottlenecks in terms of
checkpointing. It's all about the number of requests (including a high
number of LIST requests) and the associated cost
How many topics and checkpoint directories are you dealing with?
Does each topic has its own checkpoint on S3?
All these checkpoints are sequential writes so even SSD would not really
help
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view
Hey,
I'm running a few Structured Streaming jobs (with Spark 3.5.0) that require
near-real time accuracy with trigger intervals in the level of 5-10
seconds. I usually run 3-6 streaming queries as part of the job and each
query includes at least one stateful operation (and usually two or more).
My
11 matches
Mail list logo