Hi Liang-Chi, Thank you for the updates. This looks promising.
On 02/03/2017 08:34 AM, Liang-Chi Hsieh wrote: > Hi Maciej, > > FYI, this fix is submitted at https://github.com/apache/spark/pull/16785. > > > Liang-Chi Hsieh wrote >> Hi Maciej, >> >> After looking into the details of the time spent on preparing the executed >> plan, the cause of the significant difference between 1.6 and current >> codebase when running the example, is the optimization process to generate >> constraints. >> >> There seems few operations in generating constraints which are not >> optimized. Plus the fact the query plan grows continuously, the time spent >> on generating constraints increases more and more. >> >> I am trying to reduce the time cost. Although not as low as 1.6 because we >> can't remove the process of generating constraints, it is significantly >> lower than current codebase (74294 ms -> 2573 ms). >> >> 385 ms >> 107 ms >> 46 ms >> 58 ms >> 64 ms >> 105 ms >> 86 ms >> 122 ms >> 115 ms >> 114 ms >> 100 ms >> 109 ms >> 169 ms >> 196 ms >> 174 ms >> 212 ms >> 290 ms >> 254 ms >> 318 ms >> 405 ms >> 347 ms >> 443 ms >> 432 ms >> 500 ms >> 544 ms >> 619 ms >> 697 ms >> 683 ms >> 807 ms >> 802 ms >> 960 ms >> 1010 ms >> 1155 ms >> 1251 ms >> 1298 ms >> 1388 ms >> 1503 ms >> 1613 ms >> 2279 ms >> 2349 ms >> 2573 ms >> >> Liang-Chi Hsieh wrote >>> Hi Maciej, >>> >>> Thanks for the info you provided. >>> >>> I tried to run the same example with 1.6 and current branch and record >>> the difference between the time cost on preparing the executed plan. >>> >>> Current branch: >>> >>> 292 ms >>> >>> 95 ms >>> 57 ms >>> 34 ms >>> 128 ms >>> 120 ms >>> 63 ms >>> 106 ms >>> 179 ms >>> 159 ms >>> 235 ms >>> 260 ms >>> 334 ms >>> 464 ms >>> 547 ms >>> 719 ms >>> 942 ms >>> 1130 ms >>> 1928 ms >>> 1751 ms >>> 2159 ms >>> 2767 ms >>> 3333 ms >>> 4175 ms >>> 5106 ms >>> 6269 ms >>> 7683 ms >>> 9210 ms >>> 10931 ms >>> 13237 ms >>> 15651 ms >>> 19222 ms >>> 23841 ms >>> 26135 ms >>> 31299 ms >>> 38437 ms >>> 47392 ms >>> 51420 ms >>> 60285 ms >>> 69840 ms >>> 74294 ms >>> >>> 1.6: >>> >>> 3 ms >>> 4 ms >>> 10 ms >>> 4 ms >>> 17 ms >>> 8 ms >>> 12 ms >>> 21 ms >>> 15 ms >>> 15 ms >>> 19 ms >>> 23 ms >>> 28 ms >>> 28 ms >>> 58 ms >>> 39 ms >>> 43 ms >>> 61 ms >>> 56 ms >>> 60 ms >>> 81 ms >>> 73 ms >>> 100 ms >>> 91 ms >>> 96 ms >>> 116 ms >>> 111 ms >>> 140 ms >>> 127 ms >>> 142 ms >>> 148 ms >>> 165 ms >>> 171 ms >>> 198 ms >>> 200 ms >>> 233 ms >>> 237 ms >>> 253 ms >>> 256 ms >>> 271 ms >>> 292 ms >>> 452 ms >>> >>> Although they both take more time after each iteration due to the grown >>> query plan, it is obvious that current branch takes much more time than >>> 1.6 branch. The optimizer and query planning in current branch is much >>> more complicated than 1.6. >>> zero323 wrote >>>> Hi Liang-Chi, >>>> >>>> Thank you for your answer and PR but what I think I wasn't specific >>>> enough. In hindsight I should have illustrate this better. What really >>>> troubles me here is a pattern of growing delays. Difference between >>>> 1.6.3 (roughly 20s runtime since the first job): >>>> >>>> >>>> 1.6 timeline >>>> >>>> vs 2.1.0 (45 minutes or so in a bad case): >>>> >>>> 2.1.0 timeline >>>> >>>> The code is just an example and it is intentionally dumb. You easily >>>> mask this with caching, or using significantly larger data sets. So I >>>> guess the question I am really interested in is - what changed between >>>> 1.6.3 and 2.x (this is more or less consistent across 2.0, 2.1 and >>>> current master) to cause this and more important, is it a feature or is >>>> it a bug? I admit, I choose a lazy path here, and didn't spend much time >>>> (yet) trying to dig deeper. >>>> >>>> I can see a bit higher memory usage, a bit more intensive GC activity, >>>> but nothing I would really blame for this behavior, and duration of >>>> individual jobs is comparable with some favor of 2.x. Neither >>>> StringIndexer nor OneHotEncoder changed much in 2.x. They used RDDs for >>>> fitting in 1.6 and, as far as I can tell, they still do that in 2.x. And >>>> the problem doesn't look that related to the data processing part in the >>>> first place. >>>> >>>> >>>> On 02/02/2017 07:22 AM, Liang-Chi Hsieh wrote: >>>>> Hi Maciej, >>>>> >>>>> FYI, the PR is at https://github.com/apache/spark/pull/16775. >>>>> >>>>> >>>>> Liang-Chi Hsieh wrote >>>>>> Hi Maciej, >>>>>> >>>>>> Basically the fitting algorithm in Pipeline is an iterative operation. >>>>>> Running iterative algorithm on Dataset would have RDD lineages and >>>>>> query >>>>>> plans that grow fast. Without cache and checkpoint, it gets slower >>>>>> when >>>>>> the iteration number increases. >>>>>> >>>>>> I think it is why when you run a Pipeline with long stages, it gets >>>>>> much >>>>>> longer time to finish. As I think it is not uncommon to have long >>>>>> stages >>>>>> in a Pipeline, we should improve this. I will submit a PR for this. >>>>>> zero323 wrote >>>>>>> Hi everyone, >>>>>>> >>>>>>> While experimenting with ML pipelines I experience a significant >>>>>>> performance regression when switching from 1.6.x to 2.x. >>>>>>> >>>>>>> import org.apache.spark.ml.{Pipeline, PipelineStage} >>>>>>> import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, >>>>>>> VectorAssembler} >>>>>>> >>>>>>> val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3, >>>>>>> "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0")) >>>>>>> val indexers = df.columns.tail.map(c => new StringIndexer() >>>>>>> .setInputCol(c) >>>>>>> .setOutputCol(s"${c}_indexed") >>>>>>> .setHandleInvalid("skip")) >>>>>>> >>>>>>> val encoders = indexers.map(indexer => new OneHotEncoder() >>>>>>> .setInputCol(indexer.getOutputCol) >>>>>>> .setOutputCol(s"${indexer.getOutputCol}_encoded") >>>>>>> .setDropLast(true)) >>>>>>> >>>>>>> val assembler = new >>>>>>> VectorAssembler().setInputCols(encoders.map(_.getOutputCol)) >>>>>>> val stages: Array[PipelineStage] = indexers ++ encoders :+ assembler >>>>>>> >>>>>>> new Pipeline().setStages(stages).fit(df).transform(df).show >>>>>>> >>>>>>> Task execution time is comparable and executors are most of the time >>>>>>> idle so it looks like it is a problem with the optimizer. Is it a >>>>>>> known >>>>>>> issue? Are there any changes I've missed, that could lead to this >>>>>>> behavior? >>>>>>> >>>>>>> -- >>>>>>> Best, >>>>>>> Maciej >>>>>>> >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe e-mail: >>>>>>> dev-unsubscribe@.apache >>>>> >>>>> >>>>> >>>>> ----- >>>>> Liang-Chi Hsieh | @viirya >>>>> Spark Technology Center >>>>> http://www.spark.tc/ >>>>> -- >>>>> View this message in context: >>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-tp20803p20822.html >>>>> Sent from the Apache Spark Developers List mailing list archive at >>>>> Nabble.com. >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe e-mail: >>>> dev-unsubscribe@.apache >>>> -- >>>> Maciej Szymkiewicz >>>> >>>> >>>> >>>> nM15AWH.png (19K) >>>> <http://apache-spark-developers-list.1001551.n3.nabble.com/attachment/20827/0/nM15AWH.png> >>>> KHZa7hL.png (26K) >>>> <http://apache-spark-developers-list.1001551.n3.nabble.com/attachment/20827/1/KHZa7hL.png> > > > > > ----- > Liang-Chi Hsieh | @viirya > Spark Technology Center > http://www.spark.tc/ > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-tp20803p20837.html > Sent from the Apache Spark Developers List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org