Hi,

I have been running into memory overflow issues while creating TFIDF vectors
to be used in document classification using MLlib's Naive Baye's
classification implementation. 

http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/

Memory overflow and GC issues occur while collecting idfs for all the terms.
To give an idea of scale, I am reading around 615,000(around 4GB of text
data) small sized documents from HBase  and running the spark program with 8
cores and 6GB of executor memory. I have tried increasing the parallelism
level and shuffle memory fraction but to no avail.

The new TFIDF generation APIs caught my eye in the latest Spark version
1.1.0. The example given in the official documentation mentions creation of
TFIDF vectors based of Hashing Trick. I want to know if it will solve the
mentioned problem by benefiting from reduced memory consumption. 

Also, the example does not state how to create labeled points for a corpus
of pre-classified document data. For example, my training input looks
something like this,

DocumentType  |  Content
-----------------------------------------------------------------
D1                   |  This is Doc1 sample.
D1                   |  This also belongs to Doc1.
D1                   |  Yet another Doc1 sample.
D2                   |  Doc2 sample.
D2                   |  Sample content for Doc2.
D3                   |  The only sample for Doc3.
D4                   |  Doc4 sample looks like this.
D4                   |  This is Doc4 sample content.

I want to create labeled points from this sample data for training. And once
the Naive Bayes model is created, I generate TFIDFs for the test documents
and predict the document type.

If the new API can solve my issue, how can I generate labelled points using
the new APIs? An example would be great.

Also, I have a special requirement of ignoring terms that occur in less than
two documents. This has important implications for the accuracy of my use
case and needs to be accommodated while generating TFIDFs.

Thanks,
Jatin



-----
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/New-API-for-TFIDF-generation-in-Spark-1-1-0-tp14543.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to