Don't do that. Union them all at once with SparkContext.union

On Thu, Dec 29, 2016, 17:21 assaf.mendelson <assaf.mendel...@rsa.com> wrote:

> Hi,
>
>
>
> I have been playing around with doing union between a large number of
> dataframes and saw that the performance of the actual union (not the
> action) is worse than O(N^2). Since a union basically defines a lineage
> (i.e. current + union with of other as a child) this should be almost
> instantaneous, however in practice this can be very costly.
>
>
>
> I was wondering why this is and if there is a way to fix this.
>
>
>
> A sample test:
>
> *def *testUnion(n: Int): Long = {
>   *val *dataframes = *for *{
>     x <- 0 until n
>   } *yield **spark*.range(1000)
>
>   *val *t0 = System.*currentTimeMillis*()
>   *val *allDF = dataframes.reduceLeft(_.union(_))
>   *val *t1 = System.*currentTimeMillis*()
>   *val *totalTime = t1 - t0
>   *println*(*s"**$*totalTime* miliseconds"*)
>   totalTime
> }
>
>
>
> scala> testUnion(100)
>
> 193 miliseconds
>
> res5: Long = 193
>
>
>
> scala> testUnion(200)
>
> 759 miliseconds
>
> res1: Long = 759
>
>
>
> scala> testUnion(500)
>
> 4438 miliseconds
>
> res2: Long = 4438
>
>
>
> scala> testUnion(1000)
>
> 18441 miliseconds
>
> res6: Long = 18441
>
>
>
> scala> testUnion(2000)
>
> 88498 miliseconds
>
> res7: Long = 88498
>
>
>
> scala> testUnion(5000)
>
> 822305 miliseconds
>
> res8: Long = 822305
>
>
>
>
>
> ------------------------------
> View this message in context: repeated unioning of dataframes take worse
> than O(N^2) time
> <http://apache-spark-developers-list.1001551.n3.nabble.com/repeated-unioning-of-dataframes-take-worse-than-O-N-2-time-tp20394.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>

Reply via email to