You will get 10x speedup by not using mahout vector and use breeze sparse vector from mllib in your mllib kmeans run....
@Xiangrui showed the comparison chart sometime back... On May 14, 2014 6:33 AM, "Xiangrui Meng" <[email protected]> wrote: > You need > > > val raw = sc.sequenceFile(path, classOf[Text], classOf[VectorWriteable]) > > to load the data. After that, you can do > > > val data = raw.values.map(_.get) > > To get an RDD of mahout's Vector. You can use `--jar mahout-math.jar` > when you launch spark-shell to include mahout-math. > > Best, > Xiangrui > > On Tue, May 13, 2014 at 10:37 PM, Stuti Awasthi <[email protected]> > wrote: > > Hi All, > > > > I am very new to Spark and trying to play around with Mllib hence > apologies > > for the basic question. > > > > > > > > I am trying to run KMeans algorithm using Mahout and Spark MLlib to see > the > > performance. Now initial datasize was 10 GB. Mahout converts the data in > > Sequence File <Text,VectorWritable> which is used for KMeans Clustering. > > The Sequence File crated was ~ 6GB in size. > > > > > > > > Now I wanted if I can use the Mahout Sequence file to be executed in > Spark > > MLlib for KMeans . I have read that SparkContext.sequenceFile may be used > > here. Hence I tried to read my sequencefile as below but getting the > error : > > > > > > > > Command on Spark Shell : > > > > scala> val data = sc.sequenceFile[String,VectorWritable]("/ > > KMeans_dataset_seq/part-r-00000",String,VectorWritable) > > > > <console>:12: error: not found: type VectorWritable > > > > val data = sc.sequenceFile[String,VectorWritable](" > > /KMeans_dataset_seq/part-r-00000",String,VectorWritable) > > > > > > > > Here I have 2 ques: > > > > 1. Mahout has “Text” as Key but Spark is printing “not found: type:Text” > > hence I changed it to String.. Is this correct ??? > > > > 2. How will VectorWritable be found in Spark. Do I need to include Mahout > > jar in Classpath or any other option ?? > > > > > > > > Please Suggest > > > > > > > > Regards > > > > Stuti Awasthi > > > > > > > > ::DISCLAIMER:: > > > ---------------------------------------------------------------------------------------------------------------------------------------------------- > > > > The contents of this e-mail and any attachment(s) are confidential and > > intended for the named recipient(s) only. > > E-mail transmission is not guaranteed to be secure or error-free as > > information could be intercepted, corrupted, > > lost, destroyed, arrive late or incomplete, or may contain viruses in > > transmission. The e mail and its contents > > (with or without referred errors) shall therefore not attach any > liability > > on the originator or HCL or its affiliates. > > Views or opinions, if any, presented in this email are solely those of > the > > author and may not necessarily reflect the > > views or opinions of HCL or its affiliates. Any form of reproduction, > > dissemination, copying, disclosure, modification, > > distribution and / or publication of this message without the prior > written > > consent of authorized representative of > > HCL is strictly prohibited. If you have received this email in error > please > > delete it and notify the sender immediately. > > Before opening any email and/or attachments, please check them for > viruses > > and other defects. > > > > > ---------------------------------------------------------------------------------------------------------------------------------------------------- >
