Re: Ability to have CountVectorizerModel vocab as empty

Sean Owen Wed, 19 Aug 2020 06:28:53 -0700

I think that's true. You're welcome to open a pull request / JIRA to
remove that requirement.


On Wed, Aug 19, 2020 at 3:21 AM Jatin Puri <purija...@gmail.com> wrote:
>
> Hello,
>
> This is wrt 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala#L244
>
> require(vocab.length > 0, "The vocabulary size should be > 0. Lower minDF as 
> necessary.")
>
> Currently, if `CountVectorizer` is trained on an empty dataset results in the 
> following exception. But it is perfectly valid use case to send it empty data 
> (or if minDF filters everything).
> HashingTF works fine in such scenarios. CountVectorizer doesn't.
>
> Can we remove this constraint? Happy to send a pull-request
>
> java.lang.IllegalArgumentException: requirement failed: The vocabulary size 
> should be > 0. Lower minDF as necessary.
> at scala.Predef$.require(Predef.scala:224)
> at org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:236)
> at org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:149)
> at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:153)
> at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149)
> at scala.collection.Iterator$class.foreach(Iterator.scala:891)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Ability to have CountVectorizerModel vocab as empty

Reply via email to