Re: Estimate size of Dataframe programatically

2015-08-10 Thread Ted Yu
>From a quick glance of SparkStrategies.scala , when statistics.sizeInBytes of the LogicalPlan is <= autoBroadcastJoinThreshold, the plan's output would be used in broadcast join as the 'build' relation. FYI On Mon, Aug 10, 2015 at 8:04 AM, Srikanth wrote: > SizeEstimator.estimate(df) will not

Re: Estimate size of Dataframe programatically

2015-08-10 Thread Srikanth
SizeEstimator.estimate(df) will not give the size of dataframe rt? I think it will give size of df object. With RDD, I sample() and collect() and sum size of each row. If I do the same with dataframe it will no longer be size when represented in columnar format. I'd also like to know how spark.sq

Re: Estimate size of Dataframe programatically

2015-08-07 Thread Ted Yu
Have you tried calling SizeEstimator.estimate() on a DataFrame ? I did the following in REPL: scala> SizeEstimator.estimate(df) res1: Long = 17769680 FYI On Fri, Aug 7, 2015 at 6:48 AM, Srikanth wrote: > Hello, > > Is there a way to estimate the approximate size of a dataframe? I know we > ca

Estimate size of Dataframe programatically

2015-08-07 Thread Srikanth
Hello, Is there a way to estimate the approximate size of a dataframe? I know we can cache and look at the size in UI but I'm trying to do this programatically. With RDD, I can sample and sum up size using SizeEstimator. Then extrapolate it to the entire RDD. That will give me approx size of RDD.