WeichenXu123 commented on code in PR #50106: URL: https://github.com/apache/spark/pull/50106#discussion_r1980979392
########## mllib/src/main/scala/org/apache/spark/ml/Estimator.scala: ########## @@ -81,4 +81,21 @@ abstract class Estimator[M <: Model[M]] extends PipelineStage { } override def copy(extra: ParamMap): Estimator[M] + + /** + * For ml connect only. + * Estimate an upper-bound size of the model to be fitted in bytes, based on the + * parameters and the dataset, e.g., using $(k) and numFeatures to estimate a + * k-means model size. + * 1, Only driver side memory usage is counted, distributed objects (like DataFrame, + * RDD, Graph, Summary) are ignored. + * 2, Lazy vals are not counted, e.g., an auxiliary object used in prediction. + * 3, If there is no enough information to get an accurate size, try to estimate the + * upper-bound size, e.g. + * - Given a LogisticRegression estimator, assume the coefficients are dense, even + * though the actual fitted model might be sparse (by L1 penalty). + * - Given a tree model, assume all underlying trees are complete binary trees, even + * though some branches might be pruned or truncated. Review Comment: For tree model, we will set a model size threshold for training early stop, instead of estimating model size upper-bound before training (to avoid over estimating too much) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org