Hi,

I cannot argue about other use-cases, however MLLib doesn’t support working 
with text classification out of the box. There was basic support in MLI (thanks 
Sean for correcting me that it is MLI not MLLib), but I don’t know why it is 
not developed anymore.

For text classification in general, there are two major input formats: folders 
with text files and csv files. I can use SparkContext.textFile to load them 
into RDD. However in case of csv, I need to parse the loaded data, which is 
additional overhead. Next, I need to build dictionary of words and convert my 
documents into vector space using this dictionary. Currently I’m trying to 
implement these utilities and probably will share the code.

Best regards, Alexander

From: Debasish Das [mailto:debasish.da...@gmail.com]
Sent: Wednesday, June 25, 2014 8:08 PM
To: user@spark.apache.org
Subject: RE: Prediction using Classification with text attributes in Apache 
Spark MLLib


Libsvm dataset converters are data dependent since your input data can be in 
any serialization format and not necessarily csv...

We have flows that coverts hdfs data to libsvm/sparse vector rdd which is sent 
to mllib....

I am not sure if it will be easy to standardize libsvm converter on data that 
can be on hdfs,hbase, cassandra or solr....but of course libsvm, netflix 
format, csv are standard for algorithms and mllib supports all 3...
On Jun 25, 2014 6:00 AM, "Ulanov, Alexander" 
<alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:
Hi Imk,

I am not aware of any classifier in MLLib that accept nominal type of data. 
They do accept RDD of LabeledPoints, which are label + vector of Double. So, 
you'll need to convert nominal to double.

Best regards, Alexander

-----Original Message-----
From: lmk 
[mailto:lakshmi.muralikrish...@gmail.com<mailto:lakshmi.muralikrish...@gmail.com>]
Sent: Wednesday, June 25, 2014 1:27 PM
To: u...@spark.incubator.apache.org<mailto:u...@spark.incubator.apache.org>
Subject: RE: Prediction using Classification with text attributes in Apache 
Spark MLLib

Hi Alexander,
Just one more question on a related note. Should I be following the same 
procedure even if my data is nominal (categorical), but having a lot of 
combinations? (In Weka I used to have it as nominal data)

Regards,
-lmk



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Prediction-using-Classification-with-text-attributes-in-Apache-Spark-MLLib-tp8166p8249.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to