Hi all, I've got a problem that really has me stumped. I'm running a
Structured Streaming query that reads from Kafka, performs some
transformations and stateful aggregations (using flatMapGroupsWithState),
and outputs any updated aggregates to another Kafka topic.
I'm running this job using Spark
I noticed that Spark 2.4.0 implemented support for reading only committed
messages in Kafka, and was excited. Are there currently any plans to update
the Kafka output sink to support exactly-once delivery?
Thanks,
Will
I recently upgraded a Structured Streaming application from Spark 2.2.1 ->
Spark 2.3.0. This application runs in yarn-cluster mode, and it made use of
the spark.yarn.{driver|executor}.memoryOverhead properties. I noticed the
job started crashing unexpectedly, and after doing a bunch of digging, it
I am running a Structured Streaming job (Spark 2.2.0) using EMR 5.9. The
job sources data from a Kafka topic, performs a variety of filters and
transformations, and sinks data back into a different Kafka topic.
Once per day, we stop the query in order to merge the namenode edit logs
with the fsima
When submitting to YARN, you can specify two different operation modes for
the driver with the "--master" parameter: yarn-client or yarn-cluster. For
more information on submitting to YARN, see this page in the Spark docs:
http://spark.apache.org/docs/latest/running-on-yarn.html
yarn-cluster mode
Could you share your pattern matching expression that is failing?
On Tue, Aug 18, 2015, 3:38 PM wrote:
> Hi all,
>
> I am trying to run a spark job, in which I receive *java.math.BigDecimal*
> objects,
> instead of the scala equivalents, and I am trying to convert them into
> Doubles.
> If I t
ives a few
suggestions on how to increase the number of partitions.
-Will
On Mon, Jun 15, 2015 at 5:00 PM, William Briggs wrote:
> There are a lot of variables to consider. I'm not an expert on Spark, and
> my ML knowledge is rudimentary at best, but here are some questions whose
> answers m
There are a lot of variables to consider. I'm not an expert on Spark, and
my ML knowledge is rudimentary at best, but here are some questions whose
answers might help us to help you:
- What type of Spark cluster are you running (e.g., Stand-alone, Mesos,
YARN)?
- What does the HTTP UI tel
I don't know anything about your use case, so take this with a grain of
salt, but typically if you are operating at a scale that benefits from
Spark, then you likely will not want to write your output records as
individual files into HDFS. Spark has built-in support for the Hadoop
"SequenceFile" co
Hi Lee, I'm stuck with only mobile devices for correspondence right now, so
I can't get to shell to play with this issue - this is all supposition; I
think that the lambdas are closing over the context because it's a
constructor parameter to your Runnable class, which is why inlining the
lambdas in
Hi Lee,
You should be able to create a PairRDD using the Nonce as the key, and the
AnalyticsEvent as the value. I'm very new to Spark, but here is some
uncompilable pseudo code that may or may not help:
events.map(event => (event.getNonce, event)).reduceByKey((a, b) =>
a).map(_._2)
The above cod
Hi Kaspar,
This is definitely doable, but in my opinion, it's important to remember
that, at its core, Spark is based around a functional programming paradigm
- you're taking input sets of data and, by applying various
transformations, you end up with a dataset that represents your "answer".
Witho
12 matches
Mail list logo