Re: Spark seems to think that a particular broadcast variable is large in size

Venkat Dabri Tue, 16 Oct 2018 08:05:43 -0700

The same problem is mentioned here :
https://forums.databricks.com/questions/117/why-is-my-rdd-not-showing-up-in-the-storage-tab-of.html
https://stackoverflow.com/questions/44792213/blank-storage-tab-in-spark-history-server
On Tue, Oct 16, 2018 at 8:06 AM Venkat Dabri <venkatda...@gmail.com> wrote:
>
> I did try that mechanism before but the data never shows up in the
> storage tab. The storage tab is always blank. I have tried it in
> Zeppelin as well as spark-shell.
>
> scala> val classCount = spark.read.parquet("s3:// ..../classCount")
> scala> classCount.persist
> scala> classCount.count
>
> Nothing shows up in the Storage tab of either Zeppelin or spark-shell.
> However, I have several running applications in production that does
> show the data in cache. I am using Scala and Spark 2.2.1 in EMR. Any
> workarounds to see the data in cache.
> On Mon, Oct 15, 2018 at 2:53 PM Dillon Dukek <dillon.du...@placed.com> wrote:
> >
> > In your program persist the smaller table and use count to force it to 
> > materialize. Then in the Spark UI go to the Storage tab. The size of your 
> > table as spark sees it should be displayed there. Out of curiosity what 
> > version / language of Spark are you using?
> >
> > On Mon, Oct 15, 2018 at 11:53 AM Venkat Dabri <venkatda...@gmail.com> wrote:
> >>
> >> I am trying to do a broadcast join on two tables. The size of the
> >> smaller table will vary based upon the parameters but the size of the
> >> larger table is close to 2TB. What I have noticed is that if I don't
> >> set the spark.sql.autoBroadcastJoinThreshold to 10G some of these
> >> operations do a SortMergeJoin instead of a broadcast join. But the
> >> size of the smaller table shouldn't be this big at all. I wrote the
> >> smaller table to a s3 folder and it took only 12.6 MB of space. I
> >> didn't some operations on the smaller table so the shuffle size
> >> appears on the Spark History Server and the size in memory seemed to
> >> be 150 MB nowhere near 10G. Also if I force a broadcast join on the
> >> smaller table it takes a long time to broadcast, leading me to think
> >> that the table might not just be 150 MB in size. What would be a good
> >> way to figure out the actual size that Spark is seeing and deciding
> >> whether it crosses the spark.sql.autoBroadcastJoinThreshold?
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >>


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark seems to think that a particular broadcast variable is large in size

Reply via email to