I did try that mechanism before but the data never shows up in the
storage tab. The storage tab is always blank. I have tried it in
Zeppelin as well as spark-shell.

scala> val classCount = spark.read.parquet("s3:// ..../classCount")
scala> classCount.persist
scala> classCount.count

Nothing shows up in the Storage tab of either Zeppelin or spark-shell.
However, I have several running applications in production that does
show the data in cache. I am using Scala and Spark 2.2.1 in EMR. Any
workarounds to see the data in cache.
On Mon, Oct 15, 2018 at 2:53 PM Dillon Dukek <dillon.du...@placed.com> wrote:
>
> In your program persist the smaller table and use count to force it to 
> materialize. Then in the Spark UI go to the Storage tab. The size of your 
> table as spark sees it should be displayed there. Out of curiosity what 
> version / language of Spark are you using?
>
> On Mon, Oct 15, 2018 at 11:53 AM Venkat Dabri <venkatda...@gmail.com> wrote:
>>
>> I am trying to do a broadcast join on two tables. The size of the
>> smaller table will vary based upon the parameters but the size of the
>> larger table is close to 2TB. What I have noticed is that if I don't
>> set the spark.sql.autoBroadcastJoinThreshold to 10G some of these
>> operations do a SortMergeJoin instead of a broadcast join. But the
>> size of the smaller table shouldn't be this big at all. I wrote the
>> smaller table to a s3 folder and it took only 12.6 MB of space. I
>> didn't some operations on the smaller table so the shuffle size
>> appears on the Spark History Server and the size in memory seemed to
>> be 150 MB nowhere near 10G. Also if I force a broadcast join on the
>> smaller table it takes a long time to broadcast, leading me to think
>> that the table might not just be 150 MB in size. What would be a good
>> way to figure out the actual size that Spark is seeing and deciding
>> whether it crosses the spark.sql.autoBroadcastJoinThreshold?
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to