Re: [PR] [SPARK-51261][ML][PYTHON][CONNECT] Introduce model size estimation to control ml cache [spark]

via GitHub Tue, 25 Feb 2025 16:44:30 -0800


zhengruifeng commented on code in PR #50013:
URL: https://github.com/apache/spark/pull/50013#discussion_r1970735576



##########
mllib/src/main/scala/org/apache/spark/ml/classification/FMClassifier.scala:
##########
@@ -235,6 +236,13 @@ class FMClassifier @Since("3.0.0") (
     model.setSummary(Some(summary))
   }
 
+  override def estimateModelSize(dataset: Dataset[_]): Long = {
+    val numFeatures = DatasetUtils.getNumFeatures(dataset, $(featuresCol))

Review Comment:
   DatasetUtils.getNumFeatures is quite cheap, it will try to fetch 
`numFeatures` from the metadata, and if there is no such metadata, it just 
infer the `numFeatures` from the first row.
   
   
https://github.com/apache/spark/blob/9cf6dc873ff34412df6256cdc7613eed40716570/mllib/src/main/scala/org/apache/spark/ml/util/DatasetUtils.scala#L206-L214



##########
mllib-local/src/main/scala/org/apache/spark/ml/linalg/Vectors.scala:
##########
@@ -504,6 +506,10 @@ object Vectors {
 
   /** Max number of nonzero entries used in computing hash code. */
   private[linalg] val MAX_HASH_NNZ = 128
+
+  private[ml] def getSparseSize(nnz: Long): Long = nnz * 12 + 20

Review Comment:
   SG



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51261][ML][PYTHON][CONNECT] Introduce model size estimation to control ml cache [spark]

Reply via email to