Thanks Tianlang. I saw the DAG on YARN, but what really solved my problem
is adding intermediate steps and evaluating them eagerly to find out where
the bottleneck was.
My process now runs in 6 min. :D
Thanks for the help.
[]s
On Thu, 15 Aug 2019 at 07:25, Tianlang
wrote:
> Hi,
>
> Maybe you c
Hi,
Maybe you can look at the spark ui. The physical plan has no time
consuming information.
在 2019/8/13 下午10:45, Marcelo Valle 写道:
Hi,
I have a job running on AWS EMR. It's basically a join between 2
tables (parquet files on s3), one somehow large (around 50 gb) and
other small (less than
Hi,
I have a job running on AWS EMR. It's basically a join between 2 tables
(parquet files on s3), one somehow large (around 50 gb) and other small
(less than 1gb).
The small table is the result of other operations, but it was a dataframe
with `.persist(StorageLevel.MEMORY_AND_DISK_SER)` and the c