I did try that mechanism before but the data never shows up in the storage tab. The storage tab is always blank. I have tried it in Zeppelin as well as spark-shell.
scala> val classCount = spark.read.parquet("s3:// ..../classCount") scala> classCount.persist scala> classCount.count Nothing shows up in the Storage tab of either Zeppelin or spark-shell. However, I have several running applications in production that does show the data in cache. I am using Scala and Spark 2.2.1 in EMR. Any workarounds to see the data in cache. On Mon, Oct 15, 2018 at 2:53 PM Dillon Dukek <dillon.du...@placed.com> wrote: > > In your program persist the smaller table and use count to force it to > materialize. Then in the Spark UI go to the Storage tab. The size of your > table as spark sees it should be displayed there. Out of curiosity what > version / language of Spark are you using? > > On Mon, Oct 15, 2018 at 11:53 AM Venkat Dabri <venkatda...@gmail.com> wrote: >> >> I am trying to do a broadcast join on two tables. The size of the >> smaller table will vary based upon the parameters but the size of the >> larger table is close to 2TB. What I have noticed is that if I don't >> set the spark.sql.autoBroadcastJoinThreshold to 10G some of these >> operations do a SortMergeJoin instead of a broadcast join. But the >> size of the smaller table shouldn't be this big at all. I wrote the >> smaller table to a s3 folder and it took only 12.6 MB of space. I >> didn't some operations on the smaller table so the shuffle size >> appears on the Spark History Server and the size in memory seemed to >> be 150 MB nowhere near 10G. Also if I force a broadcast join on the >> smaller table it takes a long time to broadcast, leading me to think >> that the table might not just be 150 MB in size. What would be a good >> way to figure out the actual size that Spark is seeing and deciding >> whether it crosses the spark.sql.autoBroadcastJoinThreshold? >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org