What's the Mahout Version# u r running with?

On Tue, Apr 21, 2015 at 6:37 AM, mw <[email protected]> wrote:

> Hello,
>
> I am trying to get tfidf vectors from a corpus of 100k documents. I
> noticed that tfidf sequence file is empty, while the tf vectors are not.
>
> Here is the log from SparseVectorsFromSequenceFiles:
>
> INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: Maximum
> n-gram size is: 1
> INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: Minimum
> LLR value: 1.0
> INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: Number
> of reduce tasks: 1
> INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles:
> Tokenizing documents in /opt/seq
> INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: Creating
> Term Frequency Vectors
> INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles:
> Calculating IDF
> INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: Pruning
>
> Here is the tfidf output dir:
>
> root@test:[/opt/sparse/tfidf-vectors] # ll
> total 20K
> drwxr-xr-x 2 tomcat7 tomcat7 4.0K Apr 21 12:27 .
> drwxr-xr-x 9 tomcat7 tomcat7 4.0K Apr 21 12:27 ..
> -rw-r--r-- 1 tomcat7 tomcat7   90 Apr 21 12:27 part-r-00000
> -rw-r--r-- 1 tomcat7 tomcat7   12 Apr 21 12:27 .part-r-00000.crc
> -rw-r--r-- 1 tomcat7 tomcat7    0 Apr 21 12:27 _SUCCESS
> -rw-r--r-- 1 tomcat7 tomcat7    8 Apr 21 12:27 ._SUCCESS.crc
>
> Here is the tf output dir:
> root@test:[/opt/sparse/tf-vectors] # ll
> total 3.7M
> drwxr-xr-x 2 tomcat7 tomcat7 4.0K Apr 21 12:27 .
> drwxr-xr-x 9 tomcat7 tomcat7 4.0K Apr 21 12:27 ..
> -rw-r--r-- 1 tomcat7 tomcat7 3.6M Apr 21 12:27 part-r-00000
> -rw-r--r-- 1 tomcat7 tomcat7  29K Apr 21 12:27 .part-r-00000.crc
> -rw-r--r-- 1 tomcat7 tomcat7    0 Apr 21 12:27 _SUCCESS
> -rw-r--r-- 1 tomcat7 tomcat7    8 Apr 21 12:27 ._SUCCESS.crc
>
> Here is the input dir:
> root@test:[/opt/seq] # ll
> total 81M
> drwxr-xr-x 2 tomcat7 tomcat7 4.0K Apr 21 12:25 .
> drwxrwxrwx 9 tomcat7 root    4.0K Apr 21 12:25 ..
> -rw-r--r-- 1 tomcat7 tomcat7  31M Apr 21 12:25 part-m-00000
> -rw-r--r-- 1 tomcat7 tomcat7 242K Apr 21 12:25 .part-m-00000.crc
> -rw-r--r-- 1 tomcat7 tomcat7  31M Apr 21 12:25 part-m-00001
> -rw-r--r-- 1 tomcat7 tomcat7 242K Apr 21 12:25 .part-m-00001.crc
> -rw-r--r-- 1 tomcat7 tomcat7  20M Apr 21 12:25 part-m-00002
> -rw-r--r-- 1 tomcat7 tomcat7 155K Apr 21 12:25 .part-m-00002.crc
> -rw-r--r-- 1 tomcat7 tomcat7    0 Apr 21 12:25 _SUCCESS
> -rw-r--r-- 1 tomcat7 tomcat7    8 Apr 21 12:25 ._SUCCESS.crc
>
>
> I am running it using the toolrunner with the following parameters:
> -i /opt/seq -o /opt/sparse/ -nv --maxDFSigma 2.0 --weight tfidf
>
> Any hints why it might be failing?
>
> Best,
> Max
>
>

Reply via email to