Spark-10848 bug in 2.4.4

2019-10-11 Thread Jatin Puri
Hi. This bug still exists in 2.4.4: https://issues.apache.org/jira/browse/SPARK-10848 The `nullable` value is always set as `true` atleast when reading via `json()`. Should I log a new issue? Is there a temporary workaround? Regards, Jatin

Retraining with (each document as separate file) creates OOME

2018-07-02 Thread Jatin Puri
May be this is a bug. The source can be found at: https://github.com/purijatin/spark-retrain-bug *Issue:* The program takes input a set of documents. Where each document is in a separate file. The spark program tf-idf of the terms (Tokenizer -> Stopword remover -> stemming -> tf -> tfidf). Once

[mllib] Document frequency

2019-01-14 Thread Jatin Puri
Hello. As part of `org.apache.spark.ml.feature.IDFModel`, I think it is a good idea to also expose: 1. Document frequency vector 2. Number of documents We get the above for free currently and they just need to be exposed as public val. This avoids re-implementation for someone who needs to comp

Re: [mllib] Document frequency

2019-01-14 Thread Jatin Puri
one long and is already computed. This would > have to be added to Pyspark too. > > On Mon, Jan 14, 2019 at 7:56 AM Jatin Puri wrote: > > > > Hello. > > > > As part of `org.apache.spark.ml.feature.IDFModel`, I think it is a good > idea to also expose: > > &

Re: [mllib] Document frequency

2019-01-14 Thread Jatin Puri
Thanks. Created: https://issues.apache.org/jira/browse/SPARK-26616 On Mon, Jan 14, 2019 at 9:19 PM Sean Owen wrote: > Yes that seems OK to me. > > On Mon, Jan 14, 2019 at 9:40 AM Jatin Puri wrote: > > > > Thanks for the response. So do I go ahead and create a jira ticket