thanks for adding RDD lineage graph.
I could see 18 parallel tasks for HDFS Read was it changed.
what is the spark job configuration, how many executors and cores per
exeuctor
i would say keep the partitioning multiple of (no of executors * cores) for
all the RDD's
if you have 3 executors wit
Hi Vijay,
Thanks for the follow-up.
The reason why we have 90 HDFS files (causing the parallelism of 90 for HDFS
read stage) is because we load the same HDFS data in different jobs, and
these jobs have parallelisms (executors X cores) of 9, 18, 30. The uneven
assignment problem that we had before
apologies for the long answer.
understanding partitioning at each stage of the the RDD graph/lineage is
important for efficient parallelism and having load balanced. This applies
to working with any sources streaming or static.
you have tricky situation here of one source kafka with 9 partitions
Hi Vijay,
Thank you very much for your reply. Setting the number of partitions
explicitly in the join, and memory pressure influence on partitioning were
definitely very good insights.
At the end, we avoid the issue of uneven load balancing completely by doing
the following two:
a) Reducing the
Summarizing
1) Static data set read from Parquet files as DataFrame in HDFS has initial
parallelism of 90 (based on no input files)
2) static data set DataFrame is converted as rdd, and rdd has parallelism of
18 this was not expected
dataframe.rdd is lazy evaluation there must be some operation y
Hello everyone,
We are running Spark Streaming jobs in Spark 2.1 in cluster mode in YARN. We
have an RDD (3GB) that we periodically (every 30min) refresh by reading from
HDFS. Namely, we create a DataFrame /df / using /sqlContext.read.parquet/,
and then we create /RDD rdd = df.as[T].rdd/. The firs