Re: Mahout error : seq2sparse

Andrew Musselman Wed, 03 Feb 2016 21:43:40 -0800

For the Mahout version you could run `mahout` and look for lines that
include the version-jar name, such as:  "MAHOUT-JOB:
/usr/lib/mahout/mahout-examples-0.11.1-job.jar"


We don't have a -version flag that I can see but I just opened
https://issues.apache.org/jira/browse/MAHOUT-1798 which you're free to take
a stab at.

On Wed, Feb 3, 2016 at 9:21 PM, Andrew Musselman <[email protected]
> wrote:

> $ for i in `ls input-directory`; do sed -i '/^$/d' input-directory/$i; done
>
> On Wed, Feb 3, 2016 at 9:08 PM, Alok Tanna <[email protected]> wrote:
>
>> This command works thank you  , yes I am seeing lot of empty lines in my
>> input files. any magic command to remove this lines that would save lot of
>> time.
>> I would re run this once I have removed empty lines.
>>
>> It would be great if I can get this working in local mode or else I will
>> have to send few days to get it working on hadoop\spark cluster.
>>
>> Thanks,
>> Alok Tanna
>>
>> On Wed, Feb 3, 2016 at 11:38 PM, Andrew Musselman <
>> [email protected]> wrote:
>>
>>> Ah; looks like that config can be set in Hadoop's core-site.xml but if
>>> you're running Mahout in local mode that shouldn't help.
>>>
>>> Can you try this with local mode off, in other words on a running
>>> Hadoop/Spark cluster?
>>>
>>> Looking for empty lines could be run via a command like `grep -r "^$"
>>> input-file-directory`; blank lines will show up before your next prompt if
>>> so.
>>>
>>> On Wed, Feb 3, 2016 at 8:30 PM, Alok Tanna <[email protected]> wrote:
>>>
>>>> Thank you Andrew for the quick response . I have around 300 input
>>>> files. It would take a while for me to go though each file. I will try to
>>>> look into that, but then I had successfully generated the sequence file 
>>>> use mahout
>>>> seqdirectory for the same dataset. How can I find which mahout release I am
>>>> on? also can you let me know how can I increase io.sort.mb = 100 when
>>>> I have Mahout running in local mode.
>>>>
>>>> In the earlier attach file you can see it says 16/02/03 22:59:04 INFO
>>>> mapred.MapTask: Record too large for in-memory buffer: 99614722 bytes
>>>>
>>>> How can I increase in-memory buffer for Mahout local mode.
>>>>
>>>> I hope this has nothing to do with this error.
>>>>
>>>> Thanks,
>>>> Alok Tanna
>>>>
>>>> On Wed, Feb 3, 2016 at 10:50 PM, Andrew Musselman <
>>>> [email protected]> wrote:
>>>>
>>>>> Is it possible you have any empty lines or extra whitespace at the end
>>>>> or
>>>>> in the middle of any of your input files?  I don't know for sure but
>>>>> that's
>>>>> where I'd start looking.
>>>>>
>>>>> Are you on the most recent release?
>>>>>
>>>>> On Wed, Feb 3, 2016 at 7:33 PM, Alok Tanna <[email protected]>
>>>>> wrote:
>>>>>
>>>>> > Mahout in local mode
>>>>> >
>>>>> > I am able to successfully run the below command on smaller data set,
>>>>> but
>>>>> > then when I am running this command on large data set I am getting
>>>>> below
>>>>> > error.  Its looks like I need to increase size of some parameter but
>>>>> then I
>>>>> > am not sure which one.  It is failing with this error
>>>>> java.io.EOFException
>>>>> >   which creating the dictionary-0 file
>>>>> >
>>>>> > Please fine the attached file for more details.
>>>>> >
>>>>> > command: mahout seq2sparse -i /home/ubuntu/AT/AT-Seq/ -o
>>>>> > /home/ubuntu/AT/AT-vectors/ -lnorm -nv -wt tfidf
>>>>> >
>>>>> > Main error :
>>>>> >
>>>>> >
>>>>> > 16/02/03 23:02:06 INFO mapred.LocalJobRunner: reduce > reduce
>>>>> > 16/02/03 23:02:17 INFO mapred.LocalJobRunner: reduce > reduce
>>>>> > 16/02/03 23:02:18 WARN mapred.LocalJobRunner:
>>>>> job_local1308764206_0003
>>>>> > java.io.EOFException
>>>>> >         at java.io.DataInputStream.readByte(DataInputStream.java:267)
>>>>> >         at
>>>>> > org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:299)
>>>>> >         at
>>>>> > org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:320)
>>>>> >         at org.apache.hadoop.io.Text.readFields(Text.java:263)
>>>>> >         at
>>>>> > org.apache.mahout.common.StringTuple.readFields(StringTuple.java:142)
>>>>> >         at
>>>>> >
>>>>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
>>>>> >         at
>>>>> >
>>>>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
>>>>> >         at
>>>>> >
>>>>> org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:117)
>>>>> >         at
>>>>> >
>>>>> org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
>>>>> >         at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
>>>>> >         at
>>>>> >
>>>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
>>>>> >         at
>>>>> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
>>>>> >         at
>>>>> >
>>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
>>>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Job complete:
>>>>> > job_local1308764206_0003
>>>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Counters: 20
>>>>> > 16/02/03 23:02:18 INFO mapred.JobClient:   File Output Format
>>>>> Counters
>>>>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Bytes Written=14923244
>>>>> > 16/02/03 23:02:18 INFO mapred.JobClient:   FileSystemCounters
>>>>> > 16/02/03 23:02:18 INFO mapred.JobClient:
>>>>>  FILE_BYTES_READ=1412144036729
>>>>> > 16/02/03 23:02:18 INFO mapred.JobClient:
>>>>> > FILE_BYTES_WRITTEN=323876626568
>>>>> > 16/02/03 23:02:18 INFO mapred.JobClient:   File Input Format Counters
>>>>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Bytes Read=11885543289
>>>>> > 16/02/03 23:02:18 INFO mapred.JobClient:   Map-Reduce Framework
>>>>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Reduce input groups=223
>>>>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Map output materialized
>>>>> > bytes=2214020551
>>>>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Combine output records=0
>>>>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Map input records=223
>>>>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Reduce shuffle bytes=0
>>>>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Physical memory (bytes)
>>>>> > snapshot=0
>>>>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Reduce output
>>>>> records=222
>>>>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Spilled Records=638
>>>>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Map output
>>>>> bytes=2214019100
>>>>> > 16/02/03 23:02:18 INFO mapred.JobClient:     CPU time spent (ms)=0
>>>>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Total committed heap
>>>>> usaAT
>>>>> > (bytes)=735978192896
>>>>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Virtual memory (bytes)
>>>>> > snapshot=0
>>>>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Combine input records=0
>>>>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Map output records=223
>>>>> > 16/02/03 23:02:18 INFO mapred.JobClient:     SPLIT_RAW_BYTES=9100
>>>>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Reduce input records=222
>>>>> > Exception in thread "main" java.lang.IllegalStateException: Job
>>>>> failed!
>>>>> >         at
>>>>> >
>>>>> org.apache.mahout.vectorizer.DictionaryVectorizer.makePartialVectors(DictionaryVectorizer.java:329)
>>>>> >         at
>>>>> >
>>>>> org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:199)
>>>>> >         at
>>>>> >
>>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:274)
>>>>> >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>> >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>>>> >         at
>>>>> >
>>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:56)
>>>>> >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>>> Method)
>>>>> >         at
>>>>> >
>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>> >         at
>>>>> >
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>> >         at java.lang.reflect.Method.invoke(Method.java:606)
>>>>> >         at
>>>>> >
>>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>>> >         at
>>>>> > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>>> >         at
>>>>> > org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>>>>> > .
>>>>> > .
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Thanks & Regards,
>>>>> >
>>>>> > Alok R. Tanna
>>>>> >
>>>>> >
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Thanks & Regards,
>>>>
>>>> Alok R. Tanna
>>>>
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks & Regards,
>>
>> Alok R. Tanna
>>
>>
>
>

Re: Mahout error : seq2sparse

Reply via email to