[ https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16098612#comment-16098612 ]
Maciej BryĆski commented on SPARK-20392: ---------------------------------------- Is it safe to merge it to 2.2 ? I'm tracing problems with Catalyst performance and this could be a solution. > Slow performance when calling fit on ML pipeline for dataset with many > columns but few rows > ------------------------------------------------------------------------------------------- > > Key: SPARK-20392 > URL: https://issues.apache.org/jira/browse/SPARK-20392 > Project: Spark > Issue Type: Improvement > Components: ML > Affects Versions: 2.1.0 > Reporter: Barry Becker > Assignee: Liang-Chi Hsieh > Priority: Blocker > Attachments: blockbuster.csv, blockbuster_fewCols.csv, > giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip > > > This started as a [question on stack > overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro], > but it seems like a bug. > I am testing spark pipelines using a simple dataset (attached) with 312 > (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 > minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. > This seems much to long for such a tiny dataset. Similar pipelines run > quickly on datasets that have fewer columns and more rows. It's something > about the number of columns that is causing the slow performance. > Here are a list of the stages in my pipeline: > {code} > 000_strIdx_5708525b2b6c > 001_strIdx_ec2296082913 > 002_bucketizer_3cbc8811877b > 003_bucketizer_5a01d5d78436 > 004_bucketizer_bf290d11364d > 005_bucketizer_c3296dfe94b2 > 006_bucketizer_7071ca50eb85 > 007_bucketizer_27738213c2a1 > 008_bucketizer_bd728fd89ba1 > 009_bucketizer_e1e716f51796 > 010_bucketizer_38be665993ba > 011_bucketizer_5a0e41e5e94f > 012_bucketizer_b5a3d5743aaa > 013_bucketizer_4420f98ff7ff > 014_bucketizer_777cc4fe6d12 > 015_bucketizer_f0f3a3e5530e > 016_bucketizer_218ecca3b5c1 > 017_bucketizer_0b083439a192 > 018_bucketizer_4520203aec27 > 019_bucketizer_462c2c346079 > 020_bucketizer_47435822e04c > 021_bucketizer_eb9dccb5e6e8 > 022_bucketizer_b5f63dd7451d > 023_bucketizer_e0fd5041c841 > 024_bucketizer_ffb3b9737100 > 025_bucketizer_e06c0d29273c > 026_bucketizer_36ee535a425f > 027_bucketizer_ee3a330269f1 > 028_bucketizer_094b58ea01c0 > 029_bucketizer_e93ea86c08e2 > 030_bucketizer_4728a718bc4b > 031_bucketizer_08f6189c7fcc > 032_bucketizer_11feb74901e6 > 033_bucketizer_ab4add4966c7 > 034_bucketizer_4474f7f1b8ce > 035_bucketizer_90cfa5918d71 > 036_bucketizer_1a9ff5e4eccb > 037_bucketizer_38085415a4f4 > 038_bucketizer_9b5e5a8d12eb > 039_bucketizer_082bb650ecc3 > 040_bucketizer_57e1e363c483 > 041_bucketizer_337583fbfd65 > 042_bucketizer_73e8f6673262 > 043_bucketizer_0f9394ed30b8 > 044_bucketizer_8530f3570019 > 045_bucketizer_c53614f1e507 > 046_bucketizer_8fd99e6ec27b > 047_bucketizer_6a8610496d8a > 048_bucketizer_888b0055c1ad > 049_bucketizer_974e0a1433a6 > 050_bucketizer_e848c0937cb9 > 051_bucketizer_95611095a4ac > 052_bucketizer_660a6031acd9 > 053_bucketizer_aaffe5a3140d > 054_bucketizer_8dc569be285f > 055_bucketizer_83d1bffa07bc > 056_bucketizer_0c6180ba75e6 > 057_bucketizer_452f265a000d > 058_bucketizer_38e02ddfb447 > 059_bucketizer_6fa4ad5d3ebd > 060_bucketizer_91044ee766ce > 061_bucketizer_9a9ef04a173d > 062_bucketizer_3d98eb15f206 > 063_bucketizer_c4915bb4d4ed > 064_bucketizer_8ca2b6550c38 > 065_bucketizer_417ee9b760bc > 066_bucketizer_67f3556bebe8 > 067_bucketizer_0556deb652c6 > 068_bucketizer_067b4b3d234c > 069_bucketizer_30ba55321538 > 070_bucketizer_ad826cc5d746 > 071_bucketizer_77676a898055 > 072_bucketizer_05c37a38ce30 > 073_bucketizer_6d9ae54163ed > 074_bucketizer_8cd668b2855d > 075_bucketizer_d50ea1732021 > 076_bucketizer_c68f467c9559 > 077_bucketizer_ee1dfc840db1 > 078_bucketizer_83ec06a32519 > 079_bucketizer_741d08c1b69e > 080_bucketizer_b7402e4829c7 > 081_bucketizer_8adc590dc447 > 082_bucketizer_673be99bdace > 083_bucketizer_77693b45f94c > 084_bucketizer_53529c6b1ac4 > 085_bucketizer_6a3ca776a81e > 086_bucketizer_6679d9588ac1 > 087_bucketizer_6c73af456f65 > 088_bucketizer_2291b2c5ab51 > 089_bucketizer_cb3d0fe669d8 > 090_bucketizer_e71f913c1512 > 091_bucketizer_156528f65ce7 > 092_bucketizer_f3ec5dae079b > 093_bucketizer_809fab77eee1 > 094_bucketizer_6925831511e6 > 095_bucketizer_c5d853b95707 > 096_bucketizer_e677659ca253 > 097_bucketizer_396e35548c72 > 098_bucketizer_78a6410d7a84 > 099_bucketizer_e3ae6e54bca1 > 100_bucketizer_9fed5923fe8a > 101_bucketizer_8925ba4c3ee2 > 102_bucketizer_95750b6942b8 > 103_bucketizer_6e8b50a1918b > 104_bucketizer_36cfcc13d4ba > 105_bucketizer_2716d0455512 > 106_bucketizer_9bcf2891652f > 107_bucketizer_8c3d352915f7 > 108_bucketizer_0786c17d5ef9 > 109_bucketizer_f22df23ef56f > 110_bucketizer_bad04578bd20 > 111_bucketizer_35cfbde7e28f > 112_bucketizer_cf89177a528b > 113_bucketizer_183a0d393ef0 > 114_bucketizer_467c78156a67 > 115_bucketizer_380345e651ab > 116_bucketizer_0f39f6de1625 > 117_bucketizer_d8500b2c0c2f > 118_bucketizer_dc5f1fd09ff1 > 119_bucketizer_eeaf9e6cdaef > 120_bucketizer_5614cd4533d7 > 121_bucketizer_2f1230e2871e > 122_bucketizer_f8bf9d47e57e > 123_bucketizer_2df774393575 > 124_bucketizer_259320b7fc86 > 125_bucketizer_e334afc63030 > 126_bucketizer_f17d4d6b4d94 > 127_bucketizer_da7834230ecd > 128_bucketizer_8dbb503f658e > 129_bucketizer_e09e2eb2b181 > 130_bucketizer_faa04fa16f3c > 131_bucketizer_d0bd348a5613 > 132_bucketizer_de6da796e294 > 133_bucketizer_0395526346ce > 134_bucketizer_ea3b5eb6058f > 135_bucketizer_ad83472038f7 > 136_bucketizer_4a17c440fd16 > 137_bucketizer_d468637d4b86 > 138_bucketizer_4fc473a72f1d > 139_vecAssembler_bd87cd105650 > 140_nb_f134e0890a0d > 141_sql_a8590b83c826 > {code} > There are 2 string columns that are converted to ints with > StringIndexerModel. Then there are bucketizers that bin all the numeric > columns into 2 or 3 mins each. Is there a way to bin many columns at once > with a single stage? I did not see a way. Next there is a VectorAssembler to > combine all the columns into one for the NaiveBayes classifier. Lastly, there > is a simple SQLTransformer to cast one the prection column to an int. > Here is what the metadata for the two StringIndexerModelss looks like: > {code} > {"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1492551461778,"sparkVersion":"2.1.1","uid":"strIdx_5708525b2b6c","paramMap":{"outputCol":"ADI_IDX__","handleInvalid":"skip","inputCol":"ADI_CLEANED__"}} > {"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1492551462004,"sparkVersion":"2.1.1","uid":"strIdx_ec2296082913","paramMap":{"outputCol":"State_IDX__","inputCol":"State_CLEANED__","handleInvalid":"skip"}} > {code} > The bucketizers all look very similar. Here is what the meta data for few of > them look like: > {code} > {"class":"org.apache.spark.ml.feature.Bucketizer","timestamp":1492551462636,"sparkVersion":"2.1.1","uid":"bucketizer_bd728fd89ba1","paramMap":{"outputCol":"HH_02_BINNED__","inputCol":"HH_02_CLEANED__","handleInvalid":"keep","splits":["-Inf",7521.0,12809.5,20299.0,"Inf"]}} > {"class":"org.apache.spark.ml.feature.Bucketizer","timestamp":1492551462711,"sparkVersion":"2.1.1","uid":"bucketizer_e1e716f51796","paramMap":{"splits":["-Inf",6698.0,13690.5,"Inf"],"handleInvalid":"keep","outputCol":"HH_97_BINNED__","inputCol":"HH_97_CLEANED__"}} > {"class":"org.apache.spark.ml.feature.Bucketizer","timestamp":1492551462784,"sparkVersion":"2.1.1","uid":"bucketizer_38be665993ba","paramMap":{"splits":["-Inf",4664.0,7242.5,11770.0,14947.0,"Inf"],"outputCol":"HH_90_BINNED__","handleInvalid":"keep","inputCol":"HH_90_CLEANED__"}} > {"class":"org.apache.spark.ml.feature.Bucketizer","timestamp":1492551462858,"sparkVersion":"2.1.1","uid":"bucketizer_5a0e41e5e94f","paramMap":{"splits":["-Inf",6107.5,10728.5,"Inf"],"outputCol":"HH_80_BINNED__","inputCol":"HH_80_CLEANED__","handleInvalid":"keep"}} > {"class":"org.apache.spark.ml.feature.Bucketizer","timestamp":1492551462931,"sparkVersion":"2.1.1","uid":"bucketizer_b5a3d5743aaa","paramMap":{"outputCol":"HHPG9702_BINNED__","splits":["-Inf",8.895000457763672,"Inf"],"handleInvalid":"keep","inputCol":"HHPG9702_CLEANED__"}} > {"class":"org.apache.spark.ml.feature.Bucketizer","timestamp":1492551463004,"sparkVersion":"2.1.1","uid":"bucketizer_4420f98ff7ff","paramMap":{"splits":["-Inf",54980.5,"Inf"],"outputCol":"MEDHI97_BINNED__","handleInvalid":"keep","inputCol":"MEDHI97_CLEANED__"}} > {code} > Here is the metadata for the NaiveBayes model: > {code} > {"class":"org.apache.spark.ml.classification.NaiveBayesModel","timestamp":1492551472568,"sparkVersion":"2.1.1","uid":"nb_f134e0890a0d","paramMap":{"modelType":"multinomial","probabilityCol":"_class_probability_column__","smoothing":1.0,"predictionCol":"_prediction_column_","rawPredictionCol":"rawPrediction","featuresCol":"_features_column__","labelCol":"DAYPOP_BINNED__"}} > {code} > and for the final SQLTransformer > {code} > {"class":"org.apache.spark.ml.feature.SQLTransformer","timestamp":1492551472804,"sparkVersion":"2.1.1","uid":"sql_a8590b83c826","paramMap":{"statement":"SELECT > *, CAST(_prediction_column_ AS INT) AS `_*_prediction_label_column_*__` FROM > __THIS__"}} > {code} > Why is it that the duration gets extremely slow when more than a couple > hundred columns (and only a few rows), but having millions of rows (with > fewer columns) performs fine? In addition to it being slow when applying this > pipeline, it is also slow to create it. The fit and evaluate steps take a few > minutes each. Is there anything that can be done to make it faster? > I get similar results using 2.1.1RC, 2.1.2(tip) and 2.2.0(tip). Spark 2.1.0 > gives a Janino 64k limit error when trying to build this pipeline (see > https://issues.apache.org/jira/browse/SPARK-16845). > I stepped through in the debugger when pipeline.fit was called and noticed > that the queryPlan is a huge nested structure. I don't know how to interpret > this plan, but it is likely related to the performance problem. It is > attached. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org