TF-IDF Question

franco barrientos Thu, 04 Jun 2015 09:47:20 -0700

Hi all!,

I have a .txt file where each row of it it¹s a collection of terms of a
document separated by space. For example:


1 "Hola spark²
2 ..

I followed this example of spark site
https://spark.apache.org/docs/latest/mllib-feature-extraction.html and i get
something like this:

tfidf.first()
org.apache.spark.mllib.linalg.Vector =
(1048576,[35587,884670],[3.458767233,3.458767233])

I think this:

1. First parameter ³1048576² i don¹t know what it is but always it´s the
same number (maybe the number of terms).
2. Second parameter ³[35587,884670]² i think are the terms of the first line
in my .txt file.
3. Third parameter ³[3.458767233,3.458767233]² i think are the tfidf values
for my terms.
Anyone knows the exact interpretation of this and in the second point if
these values are the terms, how can i match this values with the original
terms values (³[35587=>Hola,884670=>spark]²)?.

Regards and thanks in advance.

Franco Barrientos
Data Scientist
Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649
(+569)-76347893
franco.barrien...@exalitica.com <mailto:franco.barrien...@exalitica.com>
www.exalitica.com
 <http://www.exalitica.com/>

TF-IDF Question

Reply via email to