>From a quick glance of SparkStrategies.scala , when statistics.sizeInBytes
of the LogicalPlan is <= autoBroadcastJoinThreshold, the plan's output
would be used in broadcast join as the 'build' relation.
FYI
On Mon, Aug 10, 2015 at 8:04 AM, Srikanth wrote:
> SizeEstimator.estimate(df) will not
SizeEstimator.estimate(df) will not give the size of dataframe rt? I think
it will give size of df object.
With RDD, I sample() and collect() and sum size of each row. If I do the
same with dataframe it will no longer be size when represented in columnar
format.
I'd also like to know how spark.sq
Have you tried calling SizeEstimator.estimate() on a DataFrame ?
I did the following in REPL:
scala> SizeEstimator.estimate(df)
res1: Long = 17769680
FYI
On Fri, Aug 7, 2015 at 6:48 AM, Srikanth wrote:
> Hello,
>
> Is there a way to estimate the approximate size of a dataframe? I know we
> ca
Hello,
Is there a way to estimate the approximate size of a dataframe? I know we
can cache and look at the size in UI but I'm trying to do this
programatically. With RDD, I can sample and sum up size using
SizeEstimator. Then extrapolate it to the entire RDD. That will give me
approx size of RDD.