Re: [PR] [SPARK-51340][ML][CONNECT] Model size estimation for linear classification & regression models [spark]

via GitHub Wed, 05 Mar 2025 01:05:07 -0800


WeichenXu123 commented on code in PR #50106:
URL: https://github.com/apache/spark/pull/50106#discussion_r1980979392



##########
mllib/src/main/scala/org/apache/spark/ml/Estimator.scala:
##########
@@ -81,4 +81,21 @@ abstract class Estimator[M <: Model[M]] extends 
PipelineStage {
   }
 
   override def copy(extra: ParamMap): Estimator[M]
+
+  /**
+   * For ml connect only.
+   * Estimate an upper-bound size of the model to be fitted in bytes, based on 
the
+   * parameters and the dataset, e.g., using $(k) and numFeatures to estimate a
+   * k-means model size.
+   * 1, Only driver side memory usage is counted, distributed objects (like 
DataFrame,
+   * RDD, Graph, Summary) are ignored.
+   * 2, Lazy vals are not counted, e.g., an auxiliary object used in 
prediction.
+   * 3, If there is no enough information to get an accurate size, try to 
estimate the
+   * upper-bound size, e.g.
+   *    - Given a LogisticRegression estimator, assume the coefficients are 
dense, even
+   *      though the actual fitted model might be sparse (by L1 penalty).
+   *    - Given a tree model, assume all underlying trees are complete binary 
trees, even
+   *      though some branches might be pruned or truncated.

Review Comment:
    For tree model, we will set a model size threshold for training early stop, 
instead of estimating model size upper-bound before training (to avoid over 
estimating too much)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51340][ML][CONNECT] Model size estimation for linear classification & regression models [spark]

Reply via email to