Hi all, I am running a Spark job that gets stuck attempting to join two dataframes. The dataframes are not very large, one is about 2 M rows, and the other a couple of thousand rows and the resulting joined dataframe should be about the same size as the smaller dataframe. I have tried triggering execution of the join using the 'first' operator, which as far as I understand would not require processing the entire resulting dataframe (maybe I am mistaken though). The Spark UI is not telling me anything, just showing the task to be stuck.
When I run the exact same job on a slightly smaller dataset it works without hanging. I have used the same environment to run joins on much larger dataframes, so I am confused as to why in this particular case my Spark job is just hanging. I have also tried running the same join operation using pyspark on two 2 Million row dataframes (exactly like the one I am trying to join in the job that gets stuck) and it runs succesfully. I have tried caching the joined dataframe to see how much memory it is requiring but the job gets stuck on this action too. I have also tried using persist to memory and disk on the join, and the job seems to be stuck all the same. Any help as to where to look for the source of the problem would be much appreciated. Cheers, Tamara