Hi all,
What is the correct way to schedule multiple jobs inside foreachRDD method
and importantly await on result to ensure those jobs have completed
successfully?
E.g.:
kafkaDStream.foreachRDD{ rdd =>
val rdd1 = rdd.map(...)
val rdd2 = rdd1.map(...)
val job1Future = Future{
rdd1.saveToCassandr
mber of partitions, you’re potentially doing less work in parallel
> depending on your cluster setup.
>
> On May 23, 2017, at 4:23 PM, Andrii Biletskyi INVALID > wrote:
>
>
> No, I didn't try to use repartition, how exactly it impacts the
> parallelism?
> In my understa
the preceding computation.
Have you tried using repartition instead?
On Tue, May 23, 2017 at 12:14 PM, Andrii Biletskyi
wrote:
Hi all,
I'm trying to understand the impact of coalesce operation on spark job
performance.
As a side note: were are using emrfs (i.e. aws s3) as source and a
computation. Have you tried using repartition instead?
>
> On Tue, May 23, 2017 at 12:14 PM, Andrii Biletskyi <
> andrii.bilets...@yahoo.com.invalid> wrote:
>
>> Hi all,
>>
>> I'm trying to understand the impact of coalesce operation on spark job
>> perf
Hi all,
I'm trying to understand the impact of coalesce operation on spark job
performance.
As a side note: were are using emrfs (i.e. aws s3) as source and a target
for the job.
Omitting unnecessary details job can be explained as: join 200M records
Dataframe stored in orc format on emrfs with
Cody Koeninger :
> You may know that those streams share the same keys, but Spark doesn't
> unless you tell it.
>
> mapWithState takes a StateSpec, which should allow you to specify a
> partitioner.
>
> On Mon, Oct 31, 2016 at 9:40 AM, Andrii Biletskyi
> wrote:
> &
Hi all,
I'm using Spark Streaming mapWithState operation to do a stateful operation
on my Kafka stream (though I think similar arguments would apply for any
source).
Trying to understand a way to control mapWithState's partitioning schema.
My transformations are simple:
1) create KafkaDStream
2