Re: Problem with K-Means clustering on Amazon EMR

Sebastian Schelter Sun, 16 Mar 2014 09:38:25 -0700

I've also encountered a similar error once. It's really just theFileSystem.get call that needs to be modified. I think its a good ideato walk through the codebase and refactor this where necessary.


--sebastian



On 03/16/2014 05:16 PM, Andrew Musselman wrote:

Another wild guess, I've had issues trying to use the 's3' protocol from Hadoop 
and got things working by using the 's3n' protocol instead.

On Mar 16, 2014, at 8:41 AM, Jay Vyas <[email protected]> wrote:

I specifically have fixed mapreduce jobs by doing what the error message 
suggests.

But maybe (hopefully) there is another workaround that is configuration driven.

Just a hunch but, Maybe mahout needs to be refactored to create fs objects 
using the get(uri,conf) calls?

As hadoop evolves to support different flavored of hcfs probably using API 
calls that are more flexible (i.e. Like the fs.get(uri,conf) one), will 
probably be a good thing to keep in mind.

On Mar 16, 2014, at 9:22 AM, Frank Scholten <[email protected]> wrote:

Hi Konstantin,

Good to hear from you.

The link you mentioned points to EigenSeedGenerator not
RandomSeedGenerator. The problem seems to be with the call to

fs.getFileStatus(input).isDir()


It's been a while and I don't remember but perhaps you have to set
additional Hadoop fs properties to use S3. See
https://wiki.apache.org/hadoop/AmazonS3. Perhaps you isolate the cause of
this by creating a small Java main app with that line of code and run it in
the debugger.

Cheers,

Frank



On Sun, Mar 16, 2014 at 12:07 PM, Konstantin Slisenko
<[email protected]>wrote:

Hello!

I run a text-documents clustering on Hadoop cluster in Amazon Elastic Map
Reduce. As input and output I use S3 Amazon file system. I specify all
paths as "s3://bucket-name/folder-name".

SparceVectorsFromSequenceFile works correctly with S3
but when I start K-Means clustering job, I get this error:

Exception in thread "main" java.lang.IllegalArgumentException: This
file system object (hdfs://172.31.41.65:9000) does not support access
to the request path

's3://by.kslisenko.bigdata/stackovweflow-small/out_new/sparse/tfidf-vectors'
You possibly called FileSystem.get(conf) when you should have called
FileSystem.get(uri, conf) to obtain a file system supporting your
path.

       at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:375)
       at
org.apache.hadoop.hdfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:106)
       at
org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:162)
       at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:530)
       at
org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:76)
       at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:93)
       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
       at
bbuzz2011.stackoverflow.runner.RunnerWithInParams.cluster(RunnerWithInParams.java:121)
       at
bbuzz2011.stackoverflow.runner.RunnerWithInParams.run(RunnerWithInParams.java:52)cause
of this a
       at
bbuzz2011.stackoverflow.runner.RunnerWithInParams.main(RunnerWithInParams.java:41)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
       at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
       at java.lang.reflect.Method.invoke(Method.java:597)
       at org.apache.hadoop.util.RunJar.main(RunJar.java:156)


I checked RandomSeedGenerator.buildRandom
(
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.8/org/apache/mahout/clustering/kmeans/EigenSeedGenerator.java?av=f
)
and I assume it has correct code:

FileSystem fs = FileSystem.get(output.toUri(), conf);


I can not run clustering because of this error. May be you have any
ideas how to fix this?

Re: Problem with K-Means clustering on Amazon EMR

Reply via email to