SizeEstimator.estimate(df) will not give the size of dataframe rt? I think it will give size of df object.
With RDD, I sample() and collect() and sum size of each row. If I do the same with dataframe it will no longer be size when represented in columnar format. I'd also like to know how spark.sql.autoBroadcastJoinThreshold estimates size of dataframe. Is it going to broadcast when columnar storage size is less that 10 MB? Srikanth On Fri, Aug 7, 2015 at 2:51 PM, Ted Yu <yuzhih...@gmail.com> wrote: > Have you tried calling SizeEstimator.estimate() on a DataFrame ? > > I did the following in REPL: > > scala> SizeEstimator.estimate(df) > res1: Long = 17769680 > > FYI > > On Fri, Aug 7, 2015 at 6:48 AM, Srikanth <srikanth...@gmail.com> wrote: > >> Hello, >> >> Is there a way to estimate the approximate size of a dataframe? I know we >> can cache and look at the size in UI but I'm trying to do this >> programatically. With RDD, I can sample and sum up size using >> SizeEstimator. Then extrapolate it to the entire RDD. That will give me >> approx size of RDD. With dataframes, its tricky due to columnar storage. >> How do we do it? >> >> On a related note, I see size of RDD object to be ~60MB. Is that the >> footprint of RDD in driver JVM? >> >> scala> val temp = sc.parallelize(Array(1,2,3,4,5,6)) >> scala> SizeEstimator.estimate(temp) >> res13: Long = 69507320 >> >> Srikanth >> > >