What's the Mahout Version# u r running with? On Tue, Apr 21, 2015 at 6:37 AM, mw <[email protected]> wrote:
> Hello, > > I am trying to get tfidf vectors from a corpus of 100k documents. I > noticed that tfidf sequence file is empty, while the tf vectors are not. > > Here is the log from SparseVectorsFromSequenceFiles: > > INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: Maximum > n-gram size is: 1 > INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: Minimum > LLR value: 1.0 > INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: Number > of reduce tasks: 1 > INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: > Tokenizing documents in /opt/seq > INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: Creating > Term Frequency Vectors > INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: > Calculating IDF > INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: Pruning > > Here is the tfidf output dir: > > root@test:[/opt/sparse/tfidf-vectors] # ll > total 20K > drwxr-xr-x 2 tomcat7 tomcat7 4.0K Apr 21 12:27 . > drwxr-xr-x 9 tomcat7 tomcat7 4.0K Apr 21 12:27 .. > -rw-r--r-- 1 tomcat7 tomcat7 90 Apr 21 12:27 part-r-00000 > -rw-r--r-- 1 tomcat7 tomcat7 12 Apr 21 12:27 .part-r-00000.crc > -rw-r--r-- 1 tomcat7 tomcat7 0 Apr 21 12:27 _SUCCESS > -rw-r--r-- 1 tomcat7 tomcat7 8 Apr 21 12:27 ._SUCCESS.crc > > Here is the tf output dir: > root@test:[/opt/sparse/tf-vectors] # ll > total 3.7M > drwxr-xr-x 2 tomcat7 tomcat7 4.0K Apr 21 12:27 . > drwxr-xr-x 9 tomcat7 tomcat7 4.0K Apr 21 12:27 .. > -rw-r--r-- 1 tomcat7 tomcat7 3.6M Apr 21 12:27 part-r-00000 > -rw-r--r-- 1 tomcat7 tomcat7 29K Apr 21 12:27 .part-r-00000.crc > -rw-r--r-- 1 tomcat7 tomcat7 0 Apr 21 12:27 _SUCCESS > -rw-r--r-- 1 tomcat7 tomcat7 8 Apr 21 12:27 ._SUCCESS.crc > > Here is the input dir: > root@test:[/opt/seq] # ll > total 81M > drwxr-xr-x 2 tomcat7 tomcat7 4.0K Apr 21 12:25 . > drwxrwxrwx 9 tomcat7 root 4.0K Apr 21 12:25 .. > -rw-r--r-- 1 tomcat7 tomcat7 31M Apr 21 12:25 part-m-00000 > -rw-r--r-- 1 tomcat7 tomcat7 242K Apr 21 12:25 .part-m-00000.crc > -rw-r--r-- 1 tomcat7 tomcat7 31M Apr 21 12:25 part-m-00001 > -rw-r--r-- 1 tomcat7 tomcat7 242K Apr 21 12:25 .part-m-00001.crc > -rw-r--r-- 1 tomcat7 tomcat7 20M Apr 21 12:25 part-m-00002 > -rw-r--r-- 1 tomcat7 tomcat7 155K Apr 21 12:25 .part-m-00002.crc > -rw-r--r-- 1 tomcat7 tomcat7 0 Apr 21 12:25 _SUCCESS > -rw-r--r-- 1 tomcat7 tomcat7 8 Apr 21 12:25 ._SUCCESS.crc > > > I am running it using the toolrunner with the following parameters: > -i /opt/seq -o /opt/sparse/ -nv --maxDFSigma 2.0 --weight tfidf > > Any hints why it might be failing? > > Best, > Max > >
