The input must be tuples (if not using a filter) so the CLI you have expects 
user and item ids that are

user-id1,item-id1
user-id500,item-id3000
…

The ids must be tokenized because it doesn’t use a full csv parser, only lines 
of delimited text.

If this doesn’t help can you supply a snippet of the input


On Apr 2, 2015, at 10:39 AM, Michael Kelly <[email protected]> wrote:

Hi all,

I'm running the spark-itemsimilarity job from the cli on an AWS emr
cluster, and I'm running into an exception.

The input file format is
UserId<tab>ItemId1<tab>ItemId2<tab>ItemId3......

There is only one row per user, and a total of 97,000 rows.

I also tried input with one row per UserId/ItemId pair, which had
about 250,000 rows, but I also saw a similar exception, this time the
out of bounds index was around 110,000.

The input is stored in hdfs and this is the command I used to start the job -

mahout spark-itemsimilarity --input userItems --output output --master
yarn-client

Any idea what the problem might be?

Thanks,

Michael



15/04/02 16:37:40 WARN TaskSetManager: Lost task 1.0 in stage 10.0
(TID 7631, ip-XX.XX.ec2.internal):
org.apache.mahout.math.IndexException: Index 22050 is outside
allowable range of [0,21997)

       org.apache.mahout.math.AbstractVector.viewPart(AbstractVector.java:147)

       org.apache.mahout.math.scalabindings.VectorOps.apply(VectorOps.scala:37)

       
org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:152)

       
org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:149)

       scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)

       scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)

       scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)

       scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077)

       
scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980)

       
scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980)

       
scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:969)

       scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:969)

       scala.collection.immutable.StreamIterator.hasNext(Stream.scala:974)

       scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)

       
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:144)

       org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)

       
org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)

       
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

       
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)

       org.apache.spark.scheduler.Task.run(Task.scala:54)

       org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

       
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

       
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

       java.lang.Thread.run(Thread.java:745)

Reply via email to