Re: Changing number of workers for benchmarking purposes

2014-03-13 Thread Pierre Borckmans
Thanks Patrick. I could try that. But the idea was to be able to write a fully automated benchmark, varying the dataset size, the number of workers, the memory, … without having to stop/start the cluster each time. I was thinking something like SparkConf.set(“spark.max_number_workers”, n) wou

How to monitor the communication process?

2014-03-13 Thread moxiecui
Hello everyone: Say I have a application run on GraphX, and I am trying to monitor the communication cost between machines and processes. I check out the webui just to find a very fuzzy shuffled read/write size. I want to know when did the communications between machines or processes occur. Is t

Spark temp dir (spark.local.dir)

2014-03-13 Thread Tsai Li Ming
Hi, I'm confused about the -Dspark.local.dir and SPARK_WORKER_DIR(--work-dir). What's the difference? I have set -Dspark.local.dir for all my worker nodes but I'm still seeing directories being created in /tmp when the job is running. I have also tried setting -Dspark.local.dir when I run the

Re: Spark temp dir (spark.local.dir)

2014-03-13 Thread Guillaume Pitel
I'm not 100% sure but I think it goes like this : spark.local.dir can and should be set both on the executors and on the driver (if the driver broadcast variables, the files will be stored in this directory) the SPARK_WORKER_DIR is where the j

Re: Spark temp dir (spark.local.dir)

2014-03-13 Thread Guillaume Pitel
Also, I think the jetty connector will create a small file or directory in /tmp regardless of the spark.local.dir It's very small, about 10KB Guillaume I'm not 100% sure but I think it goes like this :

Re: Spark temp dir (spark.local.dir)

2014-03-13 Thread Tsai Li Ming
>> spark.local.dir can and should be set both on the executors and on the >> driver (if the driver broadcast variables, the files will be stored in this >> directory) Do you mean the worker nodes? Don’t think they are jetty connectors and the directories are empty: /tmp/spark-3e330cdc-7540-4313-

Spark Java example using external Jars

2014-03-13 Thread dmpour23
Hi, Can anyone point out any examples on the web other than the java examples offered by the spark documentation. To be more specific an example using external jars and property files in the classpath. Thanks in advance Dimitri -- View this message in context: http://apache-spark-user-list

Re: Spark temp dir (spark.local.dir)

2014-03-13 Thread Guillaume Pitel
spark.local.dir can and should be set both on the executors and on the driver (if the driver broadcast variables, the files will be stored in this directory)

Large shuffle RDD

2014-03-13 Thread Domen Grabec
Hi, I am having problems with large inputs that cause a RDD to have a wide dependency thus creating a shuffle RDD. Somehow shuffled partitions get lost and need to be refetched. In web UI I see 3x the amount of successfully completed tasks ( picture

How to solve : java.io.NotSerializableException: org.apache.hadoop.io.Text ?

2014-03-13 Thread Jaonary Rabarisoa
Dear all, I have a SequenceFile[Text,BytesWritable] that I load with : val data = context.sequenceFile("data", classOf[Text], classOf[BytesWritable]) I want to view may data with data.collect().foreach { d => println(d) } but I got this java.io.NotSerializableException error.

Re: java.lang.ClassNotFoundException in spark 0.9.0, shark 0.9.0 (pre-release) and hadoop 2.2.0

2014-03-13 Thread pradeeps8
Hi All, We have found the actual problem. The problem was with the getList method in row class. Earlier, the row class used to return java.util.List for getList method but as of now the new source code (shark 0.9.0) returns a string. This is the commit log. https://github.com/amplab/shark/commit/

Re: How to solve : java.io.NotSerializableException: org.apache.hadoop.io.Text ?

2014-03-13 Thread Shixiong Zhu
Hi, Text and BytesWritable do not implement Serializable. You can convert them to String and Array[Byte]. E.g., data.map { case (text, bytes) => (text.toString, bytes.copyBytes) }.collect().foreach { d => println(d) } Best Regards, Shixiong Zhu 2014-03-13 20:36 GMT+08:00 Jaonary

JVM memory in local threading (SparkLR example)

2014-03-13 Thread Tsai Li Ming
Hi, Couple of questions here: 0. I modified SparkLR.scala to change the N(# of data points) and D (# of dimensions) , and ran it with: # bin/run-example -Dspark.executor.memory=40g org.apache.spark.examples.SparkLR local[23] 500 And here’s the process table: /net/home/ltsai/jdk1.7.0_51/bin/jav

sample data for pagerank?

2014-03-13 Thread Diana Carroll
I'd like to play around with the Page Rank example included with Spark but I can't find that any sample data to work with is included. Am I missing it? Anyone got a sample file to share? Thanks, Diana

RDD partition task number

2014-03-13 Thread Jaonary Rabarisoa
Hi all, I have a RDD got from "context.sequenceFile" that I transform with "map" function. By default this transformation takes 2 CPU cores. How to make it run on more than 2 cpu ? Best regards, Jaonary Rabarisoa

parson json within rdd's filter()

2014-03-13 Thread Ognen Duzlevski
Hello, Is there anything special about calling functions that parse json lines from filter? I have code that looks like this: jsonMatches(line:String):Boolean = { take a line in json format val jline=parse(line) val je = jline \ "event" if (je != JNothing && je.values.toString == user

Re: TriangleCount & Shortest Path under Spark

2014-03-13 Thread Keith Massey
The triangle count failed for me when I ran it on more than one node. There was this assertion in TriangleCount.scala: // double count should be even (divisible by two) assert((dblCount & 1) == 0) That did not hold true when I ran this on multiple nodes, even when following the gui

Re: parson json within rdd's filter()

2014-03-13 Thread Paul Brown
It's trying to send You just need to have the jsonMatches function available on the worker side of the interaction rather than on the driver side, e.g., put it on an object CodeThatIsRemote that gets shipped with the JARs and then filter(CodeThatIsRemote.jsonMatches) and you should be off to the ra

Re: How to monitor the communication process?

2014-03-13 Thread Mayur Rustagi
You can check out Ganglia for network utilization. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Thu, Mar 13, 2014 at 2:04 AM, moxiecui wrote: > Hello everyone: > > Say I have a application run on GraphX, and I am try

Re: Changing number of workers for benchmarking purposes

2014-03-13 Thread Mayur Rustagi
How about hacking your way around it. Start with max workers & keep killing them off after each run. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Thu, Mar 13, 2014 at 2:00 AM, Pierre Borckmans < pierre.borckm...@realim

Re: parson json within rdd's filter()

2014-03-13 Thread Ognen Duzlevski
Hmm. The whole thing is packaged in a .jar file and I execute .addJar on the SparkContext. My expectation is that the whole jar together with that function is available on every worker automatically. Is that not a valid expectation? Ognen On 3/13/14, 11:09 AM, Paul Brown wrote: It's tryin

Re: parson json within rdd's filter()

2014-03-13 Thread Paul Brown
Well, the question is how you're referencing it. If you reference it in a static fashion (function on an object, Scala-wise), then that's dereferenced on the worker side. If you reference it in a way that refers to something on the driver side, serializing the block will attempt to serialize the

Re: parson json within rdd's filter()

2014-03-13 Thread Ognen Duzlevski
I must be really dense! :) Here is the most simplified version of the code, I removed a bunch of stuff and hard-coded the "event" and "Sign Up" lines. def jsonMatches(line:String):Boolean = { val jLine = parse(line) // extract the event: from the line val e = jLine \ "event"

Re: parson json within rdd's filter()

2014-03-13 Thread Ognen Duzlevski
I even tried this: def jsonMatches(line:String):Boolean = true It is still failing with the same error. Ognen On 3/13/14, 11:45 AM, Ognen Duzlevski wrote: I must be really dense! :) Here is the most simplified version of the code, I removed a bunch of stuff and hard-coded the "event" and "S

How to work with ReduceByKey?

2014-03-13 Thread goi cto
Hi, I have an RDD with > which I want to reduceByKey and get I+I and List of List (add the integers and build a list of the lists. BUT reduce by key requires that the return value is of the same type of the input so I can combine the lists. JavaPairRDD>>> callCount = byCaller.*reduceByKey*( new

Re: SparkContext startup time out

2014-03-13 Thread velvia
I also hit this error. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-startup-time-out-tp1753p2667.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Java example using external Jars

2014-03-13 Thread Adam Novak
Have a look at my project: . I use the SBT Native Packager, which dumps my jar and all its dependency jars into one directory. Then I have my code find the jar it's running from, and loop through tha

Re: SparkContext startup time out

2014-03-13 Thread velvia
By the way, this is the underlying error for me: java.lang.VerifyError: (class: org/jboss/netty/channel/socket/nio/NioWorkerPool, method: createWorker signature: (Ljava/util/concurrent/Executor;)Lorg/jboss/netty/channel/socket/nio/AbstractNioWorker;) Wrong return type in function at akka.r

Re: sample data for pagerank?

2014-03-13 Thread Mo
You can find it here: https://github.com/apache/incubator-spark/tree/master/graphx/data On Thu, Mar 13, 2014 at 10:13 AM, Diana Carroll wrote: > I'd like to play around with the Page Rank example included with Spark but > I can't find that any sample data to work with is included. Am I missing

Re: SparkContext startup time out

2014-03-13 Thread velvia
I have found a workaround. If you add akka 2.2.4 to your dependencies, then everything works, probably because akka 2.2.4 brings in newer version of Jetty. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-startup-time-out-tp1753p2670.html Sent f

links for the old versions are broken

2014-03-13 Thread Walrus theCat
Sup, Where can I get Spark 0.7.3? It's 404 here: http://spark.apache.org/downloads.html Thanks

Kafka in Yarn

2014-03-13 Thread aecc
Hi, I would like to know how is the correct way to add kafka to my project in StandAlone YARN, given that now it's in a different artifact than the Spark core. I tried adding the dependency to my project but I get a NotClassFoundException to my main class. Also, that makes my Jar file very big, so

Re: links for the old versions are broken

2014-03-13 Thread Aaron Davidson
Looks like everything from 0.8.0 and before errors similarly (though "Spark 0.3 for Scala 2.9" has a malformed link as well). On Thu, Mar 13, 2014 at 10:52 AM, Walrus theCat wrote: > Sup, > > Where can I get Spark 0.7.3? It's 404 here: > > http://spark.apache.org/downloads.html > > Thanks >

Reading back a sorted RDD

2014-03-13 Thread Aureliano Buendia
Hi, After sorting an RDD and writing to hadoop, would the RDD be still sorted when reading it back? Can sorting be guaranteed after reading back, when the RDD was written as 1 partition with rdd.coalesce(1)?

Re: parson json within rdd's filter()

2014-03-13 Thread Ognen Duzlevski
OK, problem solved. Interesting thing - I separated the jsonMatches function below and put it in as a method to a separate file/object. Once done that way, it all serializes and works. Ognen On 3/13/14, 11:52 AM, Ognen Duzlevski wrote: I even tried this: def jsonMatches(line:String):Boolea

Round Robin Partitioner

2014-03-13 Thread David Thomas
Is it possible to parition the RDD elements in a round robin fashion? Say I have 5 nodes in the cluster and 5 elements in the RDD. I need to ensure each element gets mapped to each node in the cluster.

Re: Round Robin Partitioner

2014-03-13 Thread Patrick Wendell
In Spark 1.0 we've added better randomization to the scheduling of tasks so they are distributed more evenly by default. https://github.com/apache/spark/commit/556c56689bbc32c6cec0d07b57bd3ec73ceb243e However having specific policies like that isn't really supported unless you subclass the RDD it

combining operations elegantly

2014-03-13 Thread Koert Kuipers
not that long ago there was a nice example on here about how to combine multiple operations on a single RDD. so basically if you want to do a count() and something else, how to roll them into a single job. i think patrick wendell gave the examples. i cant find them anymore patrick can you plea

Re: Local Standalone Application and shuffle spills

2014-03-13 Thread Aaron Davidson
The amplab spark internals talk you mentioned is actually referring to the RDD persistence levels, where by default we do not persist RDDs to disk ( https://spark.apache.org/docs/0.9.0/scala-programming-guide.html#rdd-persistence ). "spark.shuffle.spill" refers to a different behavior -- if the "r

best practices for pushing an RDD into a database

2014-03-13 Thread Nicholas Chammas
My fellow welders , (Can we make that a thing? Let's make that a thing. :) I'm trying to wedge Spark into an existing model where we process and transform some data and then load it into an MPP database. I know that part of the sell of Spar

Re: best practices for pushing an RDD into a database

2014-03-13 Thread Patrick Wendell
Hey Nicholas, The best way to do this is to do rdd.mapPartitions() and pass a function that will open a JDBC connection to your database and write the range in each partition. On the input path there is something called JDBC-RDD that is relevant: http://spark.incubator.apache.org/docs/latest/api/

Re: best practices for pushing an RDD into a database

2014-03-13 Thread Sandy Ryza
You can also call rdd.saveAsHadoopDataset and use the DBOutputFormat that Hadoop provides: http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapred/lib/db/DBOutputFormat.html On Thu, Mar 13, 2014 at 4:17 PM, Patrick Wendell wrote: > Hey Nicholas, > > The best way to do this is to do rd

NoClassFound Errors with Streaming Twitter

2014-03-13 Thread Paul Schooss
Hello Folks, We have a strange issue going on with a spark standalone cluster in which a simple test application is having a hard time using external classes. Here are the details The application is located here: https://github.com/prantik/spark-example We use classes such as spark's streaming

Help vote for Spark talks at the Hadoop Summit

2014-03-13 Thread Patrick Wendell
Hey All, The Hadoop Summit uses community choice voting to decide which talks to feature. It would be great if the community could help vote for Spark talks so that Spark has a good showing at this event. You can make three votes on each track. Below I've listed Spark talks in each of the tracks -

best practices for pushing an RDD into a database

2014-03-13 Thread Nicholas Chammas
Thank you for the suggestions. I will look into both and report back. I'm looking at potentially a third option in Redshift's ability to COPY from SSH: http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html Is there some relatively straightforward way a command sent via SSH to a worker node c

Re: best practices for pushing an RDD into a database

2014-03-13 Thread Christopher Nguyen
Nicholas, > (Can we make that a thing? Let's make that a thing. :) Yes, we're soon releasing something called Distributed DataFrame (DDF) to the community that will make this (among other useful idioms) "a (straightforward) thing" for Spark. Sent while mobile. Pls excuse typos etc. On Mar 13, 20

Re: How to work with ReduceByKey?

2014-03-13 Thread Shixiong Zhu
Hi, You can use "groupByKey + mapValues", e.g., JavaPairRDD>>> callCount = byCaller .groupByKey() .mapValues( new Function>>, Tuple2>>>() { @Override public Tuple2>> call( List>> values) throws Exception { int count = 0; List> l = new ArrayList>(); for (Tuple2> value : values) { count