The SparkKMeans is just an example code showing a barebone implementation of k-means. To run k-means on big datasets, please use the KMeans implemented in MLlib directly: http://spark.apache.org/docs/latest/mllib-clustering.html
-Xiangrui On Wed, Jul 2, 2014 at 9:50 AM, Wanda Hawk <wanda_haw...@yahoo.com> wrote: > I can run it now with the suggested method. However, I have encountered a > new problem that I have not faced before (sent another email with that one > but here it goes again ...) > > I ran SparkKMeans with a big file (~ 7 GB of data) for one iteration with > spark-0.8.0 with this line in bash.rc " export _JAVA_OPTIONS="-Xmx15g > -Xms15g -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails" ". It > finished in a decent time, ~50 seconds, and I had only a few "Full GC...." > messages from Java. (a max of 4-5) > > Now, using the same export in bash.rc but with spark-1.0.0 (and running it > with spark-submit) the first loop never finishes and I get a lot of: > "18.537: [GC (Allocation Failure) --[PSYoungGen: > 11796992K->11796992K(13762560K)] 11797442K->11797450K(13763072K), 2.8420311 > secs] [Times: user=5.81 sys=2.12, real=2.85 secs] > " > or > > "31.867: [Full GC (Ergonomics) [PSYoungGen: 11796992K->3177967K(13762560K)] > [ParOldGen: 505K->505K(512K)] 11797497K->3178473K(13763072K), [Metaspace: > 37646K->37646K(1081344K)], 2.3053283 secs] [Times: user=37.74 sys=0.11, > real=2.31 secs]" > > I tried passing different parameters for the JVM through spark-submit, but > the results are the same > This happens with java 1.7 and also with java 1.8. > I do not know what the "Ergonomics" stands for ... > > How can I get a decent performance from spark-1.0.0 considering that > spark-0.8.0 did not need any fine tuning on the gargage collection method > (the default worked well) ? > > Thank you > > > On Wednesday, July 2, 2014 4:45 PM, Yana Kadiyska <yana.kadiy...@gmail.com> > wrote: > > > The scripts that Xiangrui mentions set up the classpath...Can you run > ./run-example for the provided example sucessfully? > > What you can try is set SPARK_PRINT_LAUNCH_COMMAND=1 and then call > run-example -- that will show you the exact java command used to run > the example at the start of execution. Assuming you can run examples > succesfully, you should be able to just copy that and add your jar to > the front of the classpath. If that works you can start removing extra > jars (run-examples put all the example jars in the cp, which you won't > need) > > As you said the error you see is indicative of the class not being > available/seen at runtime but it's hard to tell why. > > On Wed, Jul 2, 2014 at 2:13 AM, Wanda Hawk <wanda_haw...@yahoo.com> wrote: >> I want to make some minor modifications in the SparkMeans.scala so running >> the basic example won't do. >> I have also packed my code under a "jar" file with sbt. It completes >> successfully but when I try to run it : "java -jar myjar.jar" I get the >> same >> error: >> "Exception in thread "main" java.lang.NoClassDefFoundError: >> breeze/linalg/Vector >> at java.lang.Class.getDeclaredMethods0(Native Method) >> at java.lang.Class.privateGetDeclaredMethods(Class.java:2531) >> at java.lang.Class.getMethod0(Class.java:2774) >> at java.lang.Class.getMethod(Class.java:1663) >> at >> sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494) >> at >> sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486) >> " >> >> If "scalac -d classes/ SparkKMeans.scala" can't see my classpath, why does >> it succeeds in compiling and does not give the same error ? >> The error itself "NoClassDefFoundError" means that the files are available >> at compile time, but for some reason I cannot figure out they are not >> available at run time. Does anyone know why ? >> >> Thank you >> >> >> On Tuesday, July 1, 2014 7:03 PM, Xiangrui Meng <men...@gmail.com> wrote: >> >> >> You can use either bin/run-example or bin/spark-summit to run example >> code. "scalac -d classes/ SparkKMeans.scala" doesn't recognize Spark >> classpath. There are examples in the official doc: >> http://spark.apache.org/docs/latest/quick-start.html#where-to-go-from-here >> -Xiangrui >> >> On Tue, Jul 1, 2014 at 4:39 AM, Wanda Hawk <wanda_haw...@yahoo.com> wrote: >>> Hello, >>> >>> I have installed spark-1.0.0 with scala2.10.3. I have built spark with >>> "sbt/sbt assembly" and added >>> >>> >>> "/home/wanda/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar" >>> to my CLASSPATH variable. >>> Then I went here >>> "../spark-1.0.0/examples/src/main/scala/org/apache/spark/examples" >>> created >>> a >>> new directory "classes" and compiled SparkKMeans.scala with "scalac -d >>> classes/ SparkKMeans.scala" >>> Then I navigated to "classes" (I commented this line in the scala file : >>> package org.apache.spark.examples ) and tried to run it with "java -cp . >>> SparkKMeans" and I get the following error: >>> "Exception in thread "main" java.lang.NoClassDefFoundError: >>> breeze/linalg/Vector >>> at java.lang.Class.getDeclaredMethods0(Native Method) >>> at java.lang.Class.privateGetDeclaredMethods(Class.java:2531) >>> at java.lang.Class.getMethod0(Class.java:2774) >>> at java.lang.Class.getMethod(Class.java:1663) >>> at >>> sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494) >>> at >>> sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486) >>> Caused by: java.lang.ClassNotFoundException: breeze.linalg.Vector >>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366) >>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355) >>> at java.security.AccessController.doPrivileged(Native Method) >>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354) >>> at java.lang.ClassLoader.loadClass(ClassLoader.java:425) >>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) >>> at java.lang.ClassLoader.loadClass(ClassLoader.java:358) >>> ... 6 more >>> " >>> The jar under >>> >>> >>> "/home/wanda/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar" >>> contains the breeze/linalg/Vector* path, I even tried to unpack it and >>> put >>> it in CLASSPATH to it does not seem to pick it up >>> >>> >>> I am currently running java 1.8 >>> "java version "1.8.0_05" >>> Java(TM) SE Runtime Environment (build 1.8.0_05-b13) >>> Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)" >>> >>> What I am doing wrong ? >>> >> >> > >