Don't do that. Union them all at once with SparkContext.union On Thu, Dec 29, 2016, 17:21 assaf.mendelson <assaf.mendel...@rsa.com> wrote:
> Hi, > > > > I have been playing around with doing union between a large number of > dataframes and saw that the performance of the actual union (not the > action) is worse than O(N^2). Since a union basically defines a lineage > (i.e. current + union with of other as a child) this should be almost > instantaneous, however in practice this can be very costly. > > > > I was wondering why this is and if there is a way to fix this. > > > > A sample test: > > *def *testUnion(n: Int): Long = { > *val *dataframes = *for *{ > x <- 0 until n > } *yield **spark*.range(1000) > > *val *t0 = System.*currentTimeMillis*() > *val *allDF = dataframes.reduceLeft(_.union(_)) > *val *t1 = System.*currentTimeMillis*() > *val *totalTime = t1 - t0 > *println*(*s"**$*totalTime* miliseconds"*) > totalTime > } > > > > scala> testUnion(100) > > 193 miliseconds > > res5: Long = 193 > > > > scala> testUnion(200) > > 759 miliseconds > > res1: Long = 759 > > > > scala> testUnion(500) > > 4438 miliseconds > > res2: Long = 4438 > > > > scala> testUnion(1000) > > 18441 miliseconds > > res6: Long = 18441 > > > > scala> testUnion(2000) > > 88498 miliseconds > > res7: Long = 88498 > > > > scala> testUnion(5000) > > 822305 miliseconds > > res8: Long = 822305 > > > > > > ------------------------------ > View this message in context: repeated unioning of dataframes take worse > than O(N^2) time > <http://apache-spark-developers-list.1001551.n3.nabble.com/repeated-unioning-of-dataframes-take-worse-than-O-N-2-time-tp20394.html> > Sent from the Apache Spark Developers List mailing list archive > <http://apache-spark-developers-list.1001551.n3.nabble.com/> at > Nabble.com. >