Re: help understanding physical plan

2019-08-16 Thread Marcelo Valle
Thanks Tianlang. I saw the DAG on YARN, but what really solved my problem is adding intermediate steps and evaluating them eagerly to find out where the bottleneck was. My process now runs in 6 min. :D Thanks for the help. []s On Thu, 15 Aug 2019 at 07:25, Tianlang wrote: > Hi, > > Maybe you c

Re: help understanding physical plan

2019-08-14 Thread Tianlang
Hi, Maybe you can look at the spark ui. The physical plan has no time consuming information. 在 2019/8/13 下午10:45, Marcelo Valle 写道: Hi, I have a job running on AWS EMR. It's basically a join between 2 tables (parquet files on s3), one somehow large (around 50 gb) and other small (less than

help understanding physical plan

2019-08-13 Thread Marcelo Valle
Hi, I have a job running on AWS EMR. It's basically a join between 2 tables (parquet files on s3), one somehow large (around 50 gb) and other small (less than 1gb). The small table is the result of other operations, but it was a dataframe with `.persist(StorageLevel.MEMORY_AND_DISK_SER)` and the c