I think that's true. You're welcome to open a pull request / JIRA to remove that requirement.
On Wed, Aug 19, 2020 at 3:21 AM Jatin Puri <purija...@gmail.com> wrote: > > Hello, > > This is wrt > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala#L244 > > require(vocab.length > 0, "The vocabulary size should be > 0. Lower minDF as > necessary.") > > Currently, if `CountVectorizer` is trained on an empty dataset results in the > following exception. But it is perfectly valid use case to send it empty data > (or if minDF filters everything). > HashingTF works fine in such scenarios. CountVectorizer doesn't. > > Can we remove this constraint? Happy to send a pull-request > > java.lang.IllegalArgumentException: requirement failed: The vocabulary size > should be > 0. Lower minDF as necessary. > at scala.Predef$.require(Predef.scala:224) > at org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:236) > at org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:149) > at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:153) > at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149) > at scala.collection.Iterator$class.foreach(Iterator.scala:891) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org