Re: Mahout error : seq2sparse

Alok Tanna Wed, 03 Feb 2016 21:09:07 -0800

This command works thank you  , yes I am seeing lot of empty lines in my
input files. any magic command to remove this lines that would save lot of
time.
I would re run this once I have removed empty lines.


It would be great if I can get this working in local mode or else I will
have to send few days to get it working on hadoop\spark cluster.

Thanks,
Alok Tanna

On Wed, Feb 3, 2016 at 11:38 PM, Andrew Musselman <
[email protected]> wrote:

> Ah; looks like that config can be set in Hadoop's core-site.xml but if
> you're running Mahout in local mode that shouldn't help.
>
> Can you try this with local mode off, in other words on a running
> Hadoop/Spark cluster?
>
> Looking for empty lines could be run via a command like `grep -r "^$"
> input-file-directory`; blank lines will show up before your next prompt if
> so.
>
> On Wed, Feb 3, 2016 at 8:30 PM, Alok Tanna <[email protected]> wrote:
>
>> Thank you Andrew for the quick response . I have around 300 input files.
>> It would take a while for me to go though each file. I will try to look
>> into that, but then I had successfully generated the sequence file use mahout
>> seqdirectory for the same dataset. How can I find which mahout release I am
>> on? also can you let me know how can I increase io.sort.mb = 100 when I
>> have Mahout running in local mode.
>>
>> In the earlier attach file you can see it says 16/02/03 22:59:04 INFO
>> mapred.MapTask: Record too large for in-memory buffer: 99614722 bytes
>>
>> How can I increase in-memory buffer for Mahout local mode.
>>
>> I hope this has nothing to do with this error.
>>
>> Thanks,
>> Alok Tanna
>>
>> On Wed, Feb 3, 2016 at 10:50 PM, Andrew Musselman <
>> [email protected]> wrote:
>>
>>> Is it possible you have any empty lines or extra whitespace at the end or
>>> in the middle of any of your input files?  I don't know for sure but
>>> that's
>>> where I'd start looking.
>>>
>>> Are you on the most recent release?
>>>
>>> On Wed, Feb 3, 2016 at 7:33 PM, Alok Tanna <[email protected]> wrote:
>>>
>>> > Mahout in local mode
>>> >
>>> > I am able to successfully run the below command on smaller data set,
>>> but
>>> > then when I am running this command on large data set I am getting
>>> below
>>> > error.  Its looks like I need to increase size of some parameter but
>>> then I
>>> > am not sure which one.  It is failing with this error
>>> java.io.EOFException
>>> >   which creating the dictionary-0 file
>>> >
>>> > Please fine the attached file for more details.
>>> >
>>> > command: mahout seq2sparse -i /home/ubuntu/AT/AT-Seq/ -o
>>> > /home/ubuntu/AT/AT-vectors/ -lnorm -nv -wt tfidf
>>> >
>>> > Main error :
>>> >
>>> >
>>> > 16/02/03 23:02:06 INFO mapred.LocalJobRunner: reduce > reduce
>>> > 16/02/03 23:02:17 INFO mapred.LocalJobRunner: reduce > reduce
>>> > 16/02/03 23:02:18 WARN mapred.LocalJobRunner: job_local1308764206_0003
>>> > java.io.EOFException
>>> >         at java.io.DataInputStream.readByte(DataInputStream.java:267)
>>> >         at
>>> > org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:299)
>>> >         at
>>> > org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:320)
>>> >         at org.apache.hadoop.io.Text.readFields(Text.java:263)
>>> >         at
>>> > org.apache.mahout.common.StringTuple.readFields(StringTuple.java:142)
>>> >         at
>>> >
>>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
>>> >         at
>>> >
>>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
>>> >         at
>>> >
>>> org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:117)
>>> >         at
>>> >
>>> org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
>>> >         at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
>>> >         at
>>> > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
>>> >         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
>>> >         at
>>> >
>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Job complete:
>>> > job_local1308764206_0003
>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Counters: 20
>>> > 16/02/03 23:02:18 INFO mapred.JobClient:   File Output Format Counters
>>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Bytes Written=14923244
>>> > 16/02/03 23:02:18 INFO mapred.JobClient:   FileSystemCounters
>>> > 16/02/03 23:02:18 INFO mapred.JobClient:
>>>  FILE_BYTES_READ=1412144036729
>>> > 16/02/03 23:02:18 INFO mapred.JobClient:
>>> > FILE_BYTES_WRITTEN=323876626568
>>> > 16/02/03 23:02:18 INFO mapred.JobClient:   File Input Format Counters
>>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Bytes Read=11885543289
>>> > 16/02/03 23:02:18 INFO mapred.JobClient:   Map-Reduce Framework
>>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Reduce input groups=223
>>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Map output materialized
>>> > bytes=2214020551
>>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Combine output records=0
>>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Map input records=223
>>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Reduce shuffle bytes=0
>>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Physical memory (bytes)
>>> > snapshot=0
>>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Reduce output records=222
>>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Spilled Records=638
>>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Map output
>>> bytes=2214019100
>>> > 16/02/03 23:02:18 INFO mapred.JobClient:     CPU time spent (ms)=0
>>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Total committed heap usaAT
>>> > (bytes)=735978192896
>>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Virtual memory (bytes)
>>> > snapshot=0
>>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Combine input records=0
>>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Map output records=223
>>> > 16/02/03 23:02:18 INFO mapred.JobClient:     SPLIT_RAW_BYTES=9100
>>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Reduce input records=222
>>> > Exception in thread "main" java.lang.IllegalStateException: Job failed!
>>> >         at
>>> >
>>> org.apache.mahout.vectorizer.DictionaryVectorizer.makePartialVectors(DictionaryVectorizer.java:329)
>>> >         at
>>> >
>>> org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:199)
>>> >         at
>>> >
>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:274)
>>> >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>> >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>> >         at
>>> >
>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:56)
>>> >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> >         at
>>> >
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>> >         at
>>> >
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> >         at java.lang.reflect.Method.invoke(Method.java:606)
>>> >         at
>>> >
>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>> >         at
>>> > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>> >         at
>>> > org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>>> > .
>>> > .
>>> >
>>> >
>>> >
>>> > --
>>> > Thanks & Regards,
>>> >
>>> > Alok R. Tanna
>>> >
>>> >
>>>
>>
>>
>>
>> --
>> Thanks & Regards,
>>
>> Alok R. Tanna
>>
>>
>
>


-- 
Thanks & Regards,

Alok R. Tanna

Re: Mahout error : seq2sparse

Reply via email to