[PR] [SPARK-48892][ML] Avoid per-row param read in `Tokenizer` [spark]

via GitHub Sun, 14 Jul 2024 04:51:00 -0700


zhengruifeng opened a new pull request, #47342:
URL: https://github.com/apache/spark/pull/47342


   ### What changes were proposed in this pull request?
   Inspired by https://github.com/apache/spark/pull/47258, I am checking other 
ML implementations, and find that we can also optimize `Tokenizer` in the same 
way
   
   
   ### Why are the changes needed?
   the function `createTransformFunc` is to build the udf for 
`UnaryTransformer.transform`:
   
https://github.com/apache/spark/blob/d679dabdd1b5ad04b8c7deb1c06ce886a154a928/mllib/src/main/scala/org/apache/spark/ml/Transformer.scala#L118
   
   existing implementation read the params for each row.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   CI and manually tests:
   
   create test dataset
   ```
   
spark.range(1000000).select(uuid().as("uuid")).write.mode("overwrite").parquet("/tmp/regex_tokenizer.parquet")
   ```
   
   duration
   ```
   val df = spark.read.parquet("/tmp/regex_tokenizer.parquet")
   import org.apache.spark.ml.feature._
   val tokenizer = new RegexTokenizer().setPattern("-").setInputCol("uuid")
   Seq.range(0, 1000).foreach(i => tokenizer.transform(df).count()) // warm up
   val tic = System.currentTimeMillis; Seq.range(0, 1000).foreach(i => 
tokenizer.transform(df).count()); System.currentTimeMillis - tic
   ```
   
   result (before this PR)
   ```
   scala> val tic = System.currentTimeMillis; Seq.range(0, 1000).foreach(i => 
tokenizer.transform(df).count()); System.currentTimeMillis - tic
   val tic: Long = 1720613235068
   val res5: Long = 50397
   ```
   
   result (after this PR)
   ```
   scala> val tic = System.currentTimeMillis; Seq.range(0, 1000).foreach(i => 
tokenizer.transform(df).count()); System.currentTimeMillis - tic
   val tic: Long = 1720612871256
   val res5: Long = 43748
   ```
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-48892][ML] Avoid per-row param read in `Tokenizer` [spark]

Reply via email to