Hi,
A few thoughts to add to Nicholas' apt reply.
We were loading multiple files from AWS S3 in our Spark application. When
the spark step of load files is called, the driver spends significant time
fetching the exact path of files from AWS s3.
Especially because we specified S3 paths like regex
One potential case that can cause this is the optimizer being a little
overzealous with determining if a table can be broadcasted or not. Have you
checked the UI or query plan to see if any steps include a
BroadcastHashJoin? Its possible that the optimizer thinks that it should be
able to fit the t
Hi,
We have a quite long winded Spark application we inherited with many stages.
When we run on our spark cluster, things start off well enough. Workers are
busy, lots of progress made, etc. etc. However, 30 minutes into processing, we
see CPU usage of the workers drop drastically. At this time,