This command works thank you , yes I am seeing lot of empty lines in my input files. any magic command to remove this lines that would save lot of time. I would re run this once I have removed empty lines.
It would be great if I can get this working in local mode or else I will have to send few days to get it working on hadoop\spark cluster. Thanks, Alok Tanna On Wed, Feb 3, 2016 at 11:38 PM, Andrew Musselman < [email protected]> wrote: > Ah; looks like that config can be set in Hadoop's core-site.xml but if > you're running Mahout in local mode that shouldn't help. > > Can you try this with local mode off, in other words on a running > Hadoop/Spark cluster? > > Looking for empty lines could be run via a command like `grep -r "^$" > input-file-directory`; blank lines will show up before your next prompt if > so. > > On Wed, Feb 3, 2016 at 8:30 PM, Alok Tanna <[email protected]> wrote: > >> Thank you Andrew for the quick response . I have around 300 input files. >> It would take a while for me to go though each file. I will try to look >> into that, but then I had successfully generated the sequence file use mahout >> seqdirectory for the same dataset. How can I find which mahout release I am >> on? also can you let me know how can I increase io.sort.mb = 100 when I >> have Mahout running in local mode. >> >> In the earlier attach file you can see it says 16/02/03 22:59:04 INFO >> mapred.MapTask: Record too large for in-memory buffer: 99614722 bytes >> >> How can I increase in-memory buffer for Mahout local mode. >> >> I hope this has nothing to do with this error. >> >> Thanks, >> Alok Tanna >> >> On Wed, Feb 3, 2016 at 10:50 PM, Andrew Musselman < >> [email protected]> wrote: >> >>> Is it possible you have any empty lines or extra whitespace at the end or >>> in the middle of any of your input files? I don't know for sure but >>> that's >>> where I'd start looking. >>> >>> Are you on the most recent release? >>> >>> On Wed, Feb 3, 2016 at 7:33 PM, Alok Tanna <[email protected]> wrote: >>> >>> > Mahout in local mode >>> > >>> > I am able to successfully run the below command on smaller data set, >>> but >>> > then when I am running this command on large data set I am getting >>> below >>> > error. Its looks like I need to increase size of some parameter but >>> then I >>> > am not sure which one. It is failing with this error >>> java.io.EOFException >>> > which creating the dictionary-0 file >>> > >>> > Please fine the attached file for more details. >>> > >>> > command: mahout seq2sparse -i /home/ubuntu/AT/AT-Seq/ -o >>> > /home/ubuntu/AT/AT-vectors/ -lnorm -nv -wt tfidf >>> > >>> > Main error : >>> > >>> > >>> > 16/02/03 23:02:06 INFO mapred.LocalJobRunner: reduce > reduce >>> > 16/02/03 23:02:17 INFO mapred.LocalJobRunner: reduce > reduce >>> > 16/02/03 23:02:18 WARN mapred.LocalJobRunner: job_local1308764206_0003 >>> > java.io.EOFException >>> > at java.io.DataInputStream.readByte(DataInputStream.java:267) >>> > at >>> > org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:299) >>> > at >>> > org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:320) >>> > at org.apache.hadoop.io.Text.readFields(Text.java:263) >>> > at >>> > org.apache.mahout.common.StringTuple.readFields(StringTuple.java:142) >>> > at >>> > >>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) >>> > at >>> > >>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) >>> > at >>> > >>> org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:117) >>> > at >>> > >>> org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92) >>> > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) >>> > at >>> > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649) >>> > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418) >>> > at >>> > >>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398) >>> > 16/02/03 23:02:18 INFO mapred.JobClient: Job complete: >>> > job_local1308764206_0003 >>> > 16/02/03 23:02:18 INFO mapred.JobClient: Counters: 20 >>> > 16/02/03 23:02:18 INFO mapred.JobClient: File Output Format Counters >>> > 16/02/03 23:02:18 INFO mapred.JobClient: Bytes Written=14923244 >>> > 16/02/03 23:02:18 INFO mapred.JobClient: FileSystemCounters >>> > 16/02/03 23:02:18 INFO mapred.JobClient: >>> FILE_BYTES_READ=1412144036729 >>> > 16/02/03 23:02:18 INFO mapred.JobClient: >>> > FILE_BYTES_WRITTEN=323876626568 >>> > 16/02/03 23:02:18 INFO mapred.JobClient: File Input Format Counters >>> > 16/02/03 23:02:18 INFO mapred.JobClient: Bytes Read=11885543289 >>> > 16/02/03 23:02:18 INFO mapred.JobClient: Map-Reduce Framework >>> > 16/02/03 23:02:18 INFO mapred.JobClient: Reduce input groups=223 >>> > 16/02/03 23:02:18 INFO mapred.JobClient: Map output materialized >>> > bytes=2214020551 >>> > 16/02/03 23:02:18 INFO mapred.JobClient: Combine output records=0 >>> > 16/02/03 23:02:18 INFO mapred.JobClient: Map input records=223 >>> > 16/02/03 23:02:18 INFO mapred.JobClient: Reduce shuffle bytes=0 >>> > 16/02/03 23:02:18 INFO mapred.JobClient: Physical memory (bytes) >>> > snapshot=0 >>> > 16/02/03 23:02:18 INFO mapred.JobClient: Reduce output records=222 >>> > 16/02/03 23:02:18 INFO mapred.JobClient: Spilled Records=638 >>> > 16/02/03 23:02:18 INFO mapred.JobClient: Map output >>> bytes=2214019100 >>> > 16/02/03 23:02:18 INFO mapred.JobClient: CPU time spent (ms)=0 >>> > 16/02/03 23:02:18 INFO mapred.JobClient: Total committed heap usaAT >>> > (bytes)=735978192896 >>> > 16/02/03 23:02:18 INFO mapred.JobClient: Virtual memory (bytes) >>> > snapshot=0 >>> > 16/02/03 23:02:18 INFO mapred.JobClient: Combine input records=0 >>> > 16/02/03 23:02:18 INFO mapred.JobClient: Map output records=223 >>> > 16/02/03 23:02:18 INFO mapred.JobClient: SPLIT_RAW_BYTES=9100 >>> > 16/02/03 23:02:18 INFO mapred.JobClient: Reduce input records=222 >>> > Exception in thread "main" java.lang.IllegalStateException: Job failed! >>> > at >>> > >>> org.apache.mahout.vectorizer.DictionaryVectorizer.makePartialVectors(DictionaryVectorizer.java:329) >>> > at >>> > >>> org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:199) >>> > at >>> > >>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:274) >>> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) >>> > at >>> > >>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:56) >>> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>> > at >>> > >>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >>> > at >>> > >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>> > at java.lang.reflect.Method.invoke(Method.java:606) >>> > at >>> > >>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) >>> > at >>> > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >>> > at >>> > org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) >>> > . >>> > . >>> > >>> > >>> > >>> > -- >>> > Thanks & Regards, >>> > >>> > Alok R. Tanna >>> > >>> > >>> >> >> >> >> -- >> Thanks & Regards, >> >> Alok R. Tanna >> >> > > -- Thanks & Regards, Alok R. Tanna
