For the Mahout version you could run `mahout` and look for lines that include the version-jar name, such as: "MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.11.1-job.jar"
We don't have a -version flag that I can see but I just opened https://issues.apache.org/jira/browse/MAHOUT-1798 which you're free to take a stab at. On Wed, Feb 3, 2016 at 9:21 PM, Andrew Musselman <[email protected] > wrote: > $ for i in `ls input-directory`; do sed -i '/^$/d' input-directory/$i; done > > On Wed, Feb 3, 2016 at 9:08 PM, Alok Tanna <[email protected]> wrote: > >> This command works thank you , yes I am seeing lot of empty lines in my >> input files. any magic command to remove this lines that would save lot of >> time. >> I would re run this once I have removed empty lines. >> >> It would be great if I can get this working in local mode or else I will >> have to send few days to get it working on hadoop\spark cluster. >> >> Thanks, >> Alok Tanna >> >> On Wed, Feb 3, 2016 at 11:38 PM, Andrew Musselman < >> [email protected]> wrote: >> >>> Ah; looks like that config can be set in Hadoop's core-site.xml but if >>> you're running Mahout in local mode that shouldn't help. >>> >>> Can you try this with local mode off, in other words on a running >>> Hadoop/Spark cluster? >>> >>> Looking for empty lines could be run via a command like `grep -r "^$" >>> input-file-directory`; blank lines will show up before your next prompt if >>> so. >>> >>> On Wed, Feb 3, 2016 at 8:30 PM, Alok Tanna <[email protected]> wrote: >>> >>>> Thank you Andrew for the quick response . I have around 300 input >>>> files. It would take a while for me to go though each file. I will try to >>>> look into that, but then I had successfully generated the sequence file >>>> use mahout >>>> seqdirectory for the same dataset. How can I find which mahout release I am >>>> on? also can you let me know how can I increase io.sort.mb = 100 when >>>> I have Mahout running in local mode. >>>> >>>> In the earlier attach file you can see it says 16/02/03 22:59:04 INFO >>>> mapred.MapTask: Record too large for in-memory buffer: 99614722 bytes >>>> >>>> How can I increase in-memory buffer for Mahout local mode. >>>> >>>> I hope this has nothing to do with this error. >>>> >>>> Thanks, >>>> Alok Tanna >>>> >>>> On Wed, Feb 3, 2016 at 10:50 PM, Andrew Musselman < >>>> [email protected]> wrote: >>>> >>>>> Is it possible you have any empty lines or extra whitespace at the end >>>>> or >>>>> in the middle of any of your input files? I don't know for sure but >>>>> that's >>>>> where I'd start looking. >>>>> >>>>> Are you on the most recent release? >>>>> >>>>> On Wed, Feb 3, 2016 at 7:33 PM, Alok Tanna <[email protected]> >>>>> wrote: >>>>> >>>>> > Mahout in local mode >>>>> > >>>>> > I am able to successfully run the below command on smaller data set, >>>>> but >>>>> > then when I am running this command on large data set I am getting >>>>> below >>>>> > error. Its looks like I need to increase size of some parameter but >>>>> then I >>>>> > am not sure which one. It is failing with this error >>>>> java.io.EOFException >>>>> > which creating the dictionary-0 file >>>>> > >>>>> > Please fine the attached file for more details. >>>>> > >>>>> > command: mahout seq2sparse -i /home/ubuntu/AT/AT-Seq/ -o >>>>> > /home/ubuntu/AT/AT-vectors/ -lnorm -nv -wt tfidf >>>>> > >>>>> > Main error : >>>>> > >>>>> > >>>>> > 16/02/03 23:02:06 INFO mapred.LocalJobRunner: reduce > reduce >>>>> > 16/02/03 23:02:17 INFO mapred.LocalJobRunner: reduce > reduce >>>>> > 16/02/03 23:02:18 WARN mapred.LocalJobRunner: >>>>> job_local1308764206_0003 >>>>> > java.io.EOFException >>>>> > at java.io.DataInputStream.readByte(DataInputStream.java:267) >>>>> > at >>>>> > org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:299) >>>>> > at >>>>> > org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:320) >>>>> > at org.apache.hadoop.io.Text.readFields(Text.java:263) >>>>> > at >>>>> > org.apache.mahout.common.StringTuple.readFields(StringTuple.java:142) >>>>> > at >>>>> > >>>>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) >>>>> > at >>>>> > >>>>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) >>>>> > at >>>>> > >>>>> org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:117) >>>>> > at >>>>> > >>>>> org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92) >>>>> > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) >>>>> > at >>>>> > >>>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649) >>>>> > at >>>>> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418) >>>>> > at >>>>> > >>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398) >>>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Job complete: >>>>> > job_local1308764206_0003 >>>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Counters: 20 >>>>> > 16/02/03 23:02:18 INFO mapred.JobClient: File Output Format >>>>> Counters >>>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Bytes Written=14923244 >>>>> > 16/02/03 23:02:18 INFO mapred.JobClient: FileSystemCounters >>>>> > 16/02/03 23:02:18 INFO mapred.JobClient: >>>>> FILE_BYTES_READ=1412144036729 >>>>> > 16/02/03 23:02:18 INFO mapred.JobClient: >>>>> > FILE_BYTES_WRITTEN=323876626568 >>>>> > 16/02/03 23:02:18 INFO mapred.JobClient: File Input Format Counters >>>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Bytes Read=11885543289 >>>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Map-Reduce Framework >>>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Reduce input groups=223 >>>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Map output materialized >>>>> > bytes=2214020551 >>>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Combine output records=0 >>>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Map input records=223 >>>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Reduce shuffle bytes=0 >>>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Physical memory (bytes) >>>>> > snapshot=0 >>>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Reduce output >>>>> records=222 >>>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Spilled Records=638 >>>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Map output >>>>> bytes=2214019100 >>>>> > 16/02/03 23:02:18 INFO mapred.JobClient: CPU time spent (ms)=0 >>>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Total committed heap >>>>> usaAT >>>>> > (bytes)=735978192896 >>>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Virtual memory (bytes) >>>>> > snapshot=0 >>>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Combine input records=0 >>>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Map output records=223 >>>>> > 16/02/03 23:02:18 INFO mapred.JobClient: SPLIT_RAW_BYTES=9100 >>>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Reduce input records=222 >>>>> > Exception in thread "main" java.lang.IllegalStateException: Job >>>>> failed! >>>>> > at >>>>> > >>>>> org.apache.mahout.vectorizer.DictionaryVectorizer.makePartialVectors(DictionaryVectorizer.java:329) >>>>> > at >>>>> > >>>>> org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:199) >>>>> > at >>>>> > >>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:274) >>>>> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>>>> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) >>>>> > at >>>>> > >>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:56) >>>>> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native >>>>> Method) >>>>> > at >>>>> > >>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >>>>> > at >>>>> > >>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>>> > at java.lang.reflect.Method.invoke(Method.java:606) >>>>> > at >>>>> > >>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) >>>>> > at >>>>> > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >>>>> > at >>>>> > org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) >>>>> > . >>>>> > . >>>>> > >>>>> > >>>>> > >>>>> > -- >>>>> > Thanks & Regards, >>>>> > >>>>> > Alok R. Tanna >>>>> > >>>>> > >>>>> >>>> >>>> >>>> >>>> -- >>>> Thanks & Regards, >>>> >>>> Alok R. Tanna >>>> >>>> >>> >>> >> >> >> -- >> Thanks & Regards, >> >> Alok R. Tanna >> >> > >
