The same problem is mentioned here : https://forums.databricks.com/questions/117/why-is-my-rdd-not-showing-up-in-the-storage-tab-of.html https://stackoverflow.com/questions/44792213/blank-storage-tab-in-spark-history-server On Tue, Oct 16, 2018 at 8:06 AM Venkat Dabri <venkatda...@gmail.com> wrote: > > I did try that mechanism before but the data never shows up in the > storage tab. The storage tab is always blank. I have tried it in > Zeppelin as well as spark-shell. > > scala> val classCount = spark.read.parquet("s3:// ..../classCount") > scala> classCount.persist > scala> classCount.count > > Nothing shows up in the Storage tab of either Zeppelin or spark-shell. > However, I have several running applications in production that does > show the data in cache. I am using Scala and Spark 2.2.1 in EMR. Any > workarounds to see the data in cache. > On Mon, Oct 15, 2018 at 2:53 PM Dillon Dukek <dillon.du...@placed.com> wrote: > > > > In your program persist the smaller table and use count to force it to > > materialize. Then in the Spark UI go to the Storage tab. The size of your > > table as spark sees it should be displayed there. Out of curiosity what > > version / language of Spark are you using? > > > > On Mon, Oct 15, 2018 at 11:53 AM Venkat Dabri <venkatda...@gmail.com> wrote: > >> > >> I am trying to do a broadcast join on two tables. The size of the > >> smaller table will vary based upon the parameters but the size of the > >> larger table is close to 2TB. What I have noticed is that if I don't > >> set the spark.sql.autoBroadcastJoinThreshold to 10G some of these > >> operations do a SortMergeJoin instead of a broadcast join. But the > >> size of the smaller table shouldn't be this big at all. I wrote the > >> smaller table to a s3 folder and it took only 12.6 MB of space. I > >> didn't some operations on the smaller table so the shuffle size > >> appears on the Spark History Server and the size in memory seemed to > >> be 150 MB nowhere near 10G. Also if I force a broadcast join on the > >> smaller table it takes a long time to broadcast, leading me to think > >> that the table might not just be 150 MB in size. What would be a good > >> way to figure out the actual size that Spark is seeing and deciding > >> whether it crosses the spark.sql.autoBroadcastJoinThreshold? > >> > >> --------------------------------------------------------------------- > >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >>
--------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org