Re: Spark seems to think that a particular broadcast variable is large in size

2018-10-16 Thread Dillon Dukek
You keep mentioning that you're viewing this after the fact in the spark history server. Also the spark-shell isn't a UI so I'm not sure what you mean by saying that the storage tab is blank in the spark-shell. Just so I'm clear about what you're doing, are you looking at this info while your appli

Re: Spark seems to think that a particular broadcast variable is large in size

2018-10-16 Thread Venkat Dabri
The same problem is mentioned here : https://forums.databricks.com/questions/117/why-is-my-rdd-not-showing-up-in-the-storage-tab-of.html https://stackoverflow.com/questions/44792213/blank-storage-tab-in-spark-history-server On Tue, Oct 16, 2018 at 8:06 AM Venkat Dabri wrote: > > I did try that mec

Re: Spark seems to think that a particular broadcast variable is large in size

2018-10-16 Thread Venkat Dabri
I did try that mechanism before but the data never shows up in the storage tab. The storage tab is always blank. I have tried it in Zeppelin as well as spark-shell. scala> val classCount = spark.read.parquet("s3:// /classCount") scala> classCount.persist scala> classCount.count Nothing shows

Re: Spark seems to think that a particular broadcast variable is large in size

2018-10-15 Thread Dillon Dukek
In your program persist the smaller table and use count to force it to materialize. Then in the Spark UI go to the Storage tab. The size of your table as spark sees it should be displayed there. Out of curiosity what version / language of Spark are you using? On Mon, Oct 15, 2018 at 11:53 AM Venka

Spark seems to think that a particular broadcast variable is large in size

2018-10-15 Thread Venkat Dabri
I am trying to do a broadcast join on two tables. The size of the smaller table will vary based upon the parameters but the size of the larger table is close to 2TB. What I have noticed is that if I don't set the spark.sql.autoBroadcastJoinThreshold to 10G some of these operations do a SortMergeJoi