Broadcst RDD Lookup

2014-05-01 Thread vivek.ys
Hi All, I am facing an issue while performing the lookup. Please guide me on where the mistake is. val userCluster = sc.textFile("/vives/cluster2/day/users").map(_ match { case line : String => (line.split(',')(1).split(')')(0).trim.toInt, line.split(',')(0).split('(')(1).toInt) }) val

Multiple Streams with Spark Streaming

2014-05-01 Thread Laeeq Ahmed
Hi all, Is it possible to read and process multiple streams with spark. I have eeg(brain waves) csv file with 23 columns  Each column is one stream(wave) and each column has one million values. I know one way to do it is to take transpose of the file and then give it to spark and each mapper w

Re: CDH 5.0 and Spark 0.9.0

2014-05-01 Thread Sean Owen
This codec does require native libraries to be installed, IIRC, but they are installed with CDH 5. The error you show does not look related though. Are you sure your HA setup is working and that you have configured it correctly in whatever config spark is seeing? -- Sean Owen | Director, Data Scie

Re: Multiple Streams with Spark Streaming

2014-05-01 Thread Mayur Rustagi
File as a stream? I think you are confusing Spark Streaming with buffer reader. Spark streaming is meant to process batches of data (files, packets, messages) as they come in, infact utilizing time of packet reception as a way to create windows etc. In your case you are better off reading the file

Re: Broadcst RDD Lookup

2014-05-01 Thread Mayur Rustagi
Mostly none of the items in PairRDD match your input. Hence the error. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Thu, May 1, 2014 at 2:06 PM, vivek.ys wrote: > Hi All, > I am facing an issue while performing

Re: How to handle this situation: Huge File Shared by All maps and Each Computer Has one copy?

2014-05-01 Thread Mayur Rustagi
Broadcast variable is meant to be shared across each node & not map tasks. The process you are using should work, however having 6GB of broadcast variable could be an issue. Does the broadcast variable finally move or always stays stuck? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalyt

Re: Broadcst RDD Lookup

2014-05-01 Thread Vivek YS
No I am sure the items match. Because userCluster & productCluster are prepared from "data" . Cross product of userCluster & productCluster is a super set of "data". On Thu, May 1, 2014 at 3:41 PM, Mayur Rustagi wrote: > Mostly none of the items in PairRDD match your input. Hence the error. >

Re: update of RDDs

2014-05-01 Thread Mayur Rustagi
RDD are immutable so cannot be updated. You can create new RDD containing updated entries(often not what you want to do). Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Thu, May 1, 2014 at 4:42 AM, narayanabhatla Naras

RE: update of RDDs

2014-05-01 Thread NN Murthy
Thanks a lot for very prompt response. Then next questions are the following. 1. Can we conclude that Spark is NOT the solution for our requirement? Or 2. Is there a design approach to meet such requirements using Spark? From: Mayur Rustagi [mailto:mayur.rust...@gmail.com] Sent:

Re: GraphX. How to remove vertex or edge?

2014-05-01 Thread Daniel Darabos
Graph.subgraph() allows you to apply a filter to edges and/or vertices. On Thu, May 1, 2014 at 8:52 AM, Николай Кинаш wrote: > Hello. > > How to remove vertex or edges from graph in GraphX? >

Re: My talk on "Spark: The Next Top (Compute) Model"

2014-05-01 Thread Daniel Darabos
Cool intro, thanks! One question. On slide 23 it says "Standalone ("local" mode)". That sounds a bit confusing without hearing the talk. Standalone mode is not local. It just does not depend on a cluster software. I think it's the best mode for EC2/GCE, because they provide a distributed filesyste

Re: My talk on "Spark: The Next Top (Compute) Model"

2014-05-01 Thread Dean Wampler
Thanks for the clarification. I'll fix the slide. I've done a lot of Scalding/Cascading programming where the two concepts are synonymous, but clearly I was imposing my prejudices here ;) dean On Thu, May 1, 2014 at 8:18 AM, Daniel Darabos < daniel.dara...@lynxanalytics.com> wrote: > Cool intro

Re: My talk on "Spark: The Next Top (Compute) Model"

2014-05-01 Thread ZhangYi
Very Useful material. Currently, I am trying to persuade my client choose Spark instead of Hadoop MapReduce. Your slide give me more evidence to support my opinion. -- ZhangYi (张逸) Developer tel: 15023157626 blog: agiledon.github.com weibo: tw张逸 Sent with Sparrow (http://www.sparrowmailapp.

Re: My talk on "Spark: The Next Top (Compute) Model"

2014-05-01 Thread Dean Wampler
That's great! Thanks. Let me know if it works ;) or what I could improve to make it work. dean On Thu, May 1, 2014 at 8:45 AM, ZhangYi wrote: > Very Useful material. Currently, I am trying to persuade my client choose > Spark instead of Hadoop MapReduce. Your slide give me more evidence to >

"sbt/sbt run" command returns a JVM problem

2014-05-01 Thread Carter
Hi, I have a very simple spark program written in Scala: /*** testApp.scala ***/ object testApp { def main(args: Array[String]) { println("Hello! World!") } } Then I use the following command to compile it: $ sbt/sbt package The compilation finished successfully and I got a JAR file. But wh

Re: "sbt/sbt run" command returns a JVM problem

2014-05-01 Thread Chester Chen
You might want to check the memory settings in sbt itself, which its a shell scripts run a java command. I don't have computer at hand, but if you vim or cat the sbt/sbt , you might see the memory settings , you change it to fit your need You might also can overwrite the setting by change .sbto

Re: update of RDDs

2014-05-01 Thread Mayur Rustagi
if you are doing a lot of small updates on a huge amount of data & need to get real time response on the output spark is probably not a good fit. If you are dong small updates on your rdd but need to materialize the final rdd with all the changes every 1 day or so then probably spark can fit with s

Re: "sbt/sbt run" command returns a JVM problem

2014-05-01 Thread Sean Owen
Here's how I configure SBT, which I think is the usual way: export SBT_OPTS="-XX:+CMSClassUnloadingEnabled -XX:MaxPermSize=256m -Xmx1g" See if that takes. But your error is that you're already asking for too much memory for your machine. So maybe you are setting the value successfully, but it's n

Re: My talk on "Spark: The Next Top (Compute) Model"

2014-05-01 Thread diplomatic Guru
Thanks Dean, very useful indeed! Best regards, Raj On 1 May 2014 14:46, Dean Wampler wrote: > That's great! Thanks. Let me know if it works ;) or what I could improve > to make it work. > > dean > > > On Thu, May 1, 2014 at 8:45 AM, ZhangYi wrote: > >> Very Useful material. Currently, I am

RE: What is Seq[V] in updateStateByKey?

2014-05-01 Thread Adrian Mocanu
So Seq[V] contains only "new" tuples. I initially thought that whenever a new tuple was found, it would add it to Seq and call the update function immediately so there wouldn't be more than 1 update to Seq per function call. Say I want to sum tuples with the same key is an RDD using updateStateB

Re: My talk on "Spark: The Next Top (Compute) Model"

2014-05-01 Thread Dean Wampler
I updated the uploads at both locations to fix slide 23. Thanks for the feedback. dean On Thu, May 1, 2014 at 9:25 AM, diplomatic Guru wrote: > Thanks Dean, very useful indeed! > > Best regards, > > Raj > > > On 1 May 2014 14:46, Dean Wampler wrote: > >> That's great! Thanks. Let me know if it

Re: "sbt/sbt run" command returns a JVM problem

2014-05-01 Thread Chester Chen
Here is the options defined in sbt/sbt JAVA_OPTS environment variable, if unset uses "$java_opts" SBT_OPTS environment variable, if unset uses "$default_sbt_opts" .sbtopts if this file exists in the current directory, it is prepended to the runner args /etc/sbt/sbtopts

Spark Training

2014-05-01 Thread Nicholas Chammas
There are many freely-available resources for the enterprising individual to use if they want to Spark up their life. For others, some structured training is in order. Say I want everyone from my department at my company to get something like the AMP Campexperience, p

Spark profiler

2014-05-01 Thread Punya Biswal
Hi all, I am thinking of starting work on a profiler for Spark clusters. The current idea is that it would collect jstacks from executor nodes and put them into a central index (either a database or elasticsearch), and it would present them to people in a UI that would let people slice and dice th

RE: Spark Training

2014-05-01 Thread Huang, Roger
If you're in the Bay Area, the Spark Summit would be a great source of information. http://spark-summit.org/2014 -Roger From: Nicholas Chammas [mailto:nicholas.cham...@gmail.com] Sent: Thursday, May 01, 2014 10:12 AM To: u...@spark.incubator.apache.org Subject: Spark Training There are many free

Re: Spark Training

2014-05-01 Thread Mayur Rustagi
Hi Nicholas, We provide training on spark, hands-on also associated ecosystem. We gave it recently at a conference in Santa Clara. Primarily its targetted to novices in Spark ecosystem, to introduce them & hands on to get them to write simple codes & also queries on Shark. I think Cloudera also has

Re: Spark profiler

2014-05-01 Thread Mayur Rustagi
Some thing like Twitter Ambrose would be lovely to integrate :) Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Thu, May 1, 2014 at 8:44 PM, Punya Biswal wrote: > Hi all, > > I am thinking of starting work on a profile

Re: Spark Training

2014-05-01 Thread Denny Lee
You may also want to check out Paco Nathan's Introduction to Spark courses: http://liber118.com/pxn/ > On May 1, 2014, at 8:20 AM, Mayur Rustagi wrote: > > Hi Nicholas, > We provide training on spark, hands-on also associated ecosystem. > We gave it recently at a conference in Santa Clara. P

Equally weighted partitions in Spark

2014-05-01 Thread deenar.toraskar
Hi I am using Spark to distribute computationally intensive tasks across the cluster. Currently I partition my RDD of tasks randomly. There is a large variation in how long each of the jobs take to complete, leading to most partitions being processed quickly and a couple of partitions take forever

Re: Spark Training

2014-05-01 Thread Dean Wampler
I'm working on a 1-day workshop that I'm giving in Australia next week and a few other conferences later in the year. I'll post a link when it's ready. dean On Thu, May 1, 2014 at 10:30 AM, Denny Lee wrote: > You may also want to check out Paco Nathan's Introduction to Spark > courses: http://

Re: Efficient Aggregation over DB data

2014-05-01 Thread Andrea Esposito
Hi Sai, i don't sincerely figure out where you are using the RDDs (because the split method isn't defined in them) by the way you should use the map function instead of the foreach due the fact it is NOT idempotent and some partitions could be recomputed executing the function multiple times. Wha

Re: Reading multiple S3 objects, transforming, writing back one

2014-05-01 Thread Peter
Thank you Patrick.  I took a quick stab at it:     val s3Client = new AmazonS3Client(...)     val copyObjectResult = s3Client.copyObject("upload", outputPrefix + "/part-0", "rolled-up-logs", "2014-04-28.csv")     val objectListing = s3Client.listObjects("upload", outputPrefix)     s3Client.d

Spark "streaming"

2014-05-01 Thread Mohit Singh
Hi, I guess Spark is using streaming in context of streaming live data but what I mean is something more on the lines of hadoop streaming.. where one can code in any programming language? Or is something among that lines on the cards? Thanks -- Mohit "When you want success as badly as you wan

Re: Spark "streaming"

2014-05-01 Thread Tathagata Das
Take a look at the RDD.pipe() operation. That allows you to pipe the data in a RDD to any external shell command (just like Unix Shell pipe). On May 1, 2014 10:46 AM, "Mohit Singh" wrote: > Hi, > I guess Spark is using streaming in context of streaming live data but > what I mean is something m

permition problem

2014-05-01 Thread Livni, Dana
I'm working with spark 0.9.0 on cdh5. I'm running a spark application written in java in yarn-client mode. Cause of the OP installed on the cluster I need to run the application using the hdfs user, otherwise I have a permission problem and getting the following error: org.apache.hadoop.ipc.Rem

Re: permition problem

2014-05-01 Thread Sean Owen
Yeah actually it's hdfs that has superuser privileges on HDFS, not root. It looks like you're trying to access a nonexistent user directory like "/user/foo", and it fails because root can't create it, and you inherit privileges for root since that is what your app runs as. I don't think you want t

Re: Reading multiple S3 objects, transforming, writing back one

2014-05-01 Thread Nicholas Chammas
The fastest way to save to S3 should be to leave the RDD with many partitions, because all partitions will be written out in parallel. Then, once the various parts are in S3, somehow concatenate the files together into one file. If this can be done within S3 (I don't know if this is possible), th

ClassNotFoundException

2014-05-01 Thread Joe L
Hi, I am getting the following error. How could I fix this problem? Joe 14/05/02 03:51:48 WARN TaskSetManager: Lost TID 12 (task 2.0:1) 14/05/02 03:51:48 INFO TaskSetManager: Loss was due to java.lang.ClassNotFoundException: org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$4 [duplicate

Can't be built on MAC

2014-05-01 Thread Zhige Xin
Hi dear all, When I tried to build Spark 0.9.1 on my Mac OS X 10.9.2 with Java 8, I found the following errors: [error] error while loading CharSequence, class file '/Library/Java/JavaVirtualMachines/jdk1.8.0_05.jdk/Contents/Home/jre/lib/rt.jar(java/lang/CharSequence.class)' is broken [error] (ba

Re: Can't be built on MAC

2014-05-01 Thread Ian Ferreira
HI Zhige, I had the same issue and revert to using JDK 1.7.055 From: Zhige Xin Reply-To: Date: Thursday, May 1, 2014 at 12:32 PM To: Subject: Can't be built on MAC Hi dear all, When I tried to build Spark 0.9.1 on my Mac OS X 10.9.2 with Java 8, I found the following errors: [error] err

Re: Can't be built on MAC

2014-05-01 Thread Zhige Xin
Thank you! Ian. Zhige On Thu, May 1, 2014 at 12:35 PM, Ian Ferreira wrote: > HI Zhige, > I had the same issue and revert to using JDK 1.7.055 > > From: Zhige Xin > Reply-To: > Date: Thursday, May 1, 2014 at 12:32 PM > To: > Subject: Can't be built on MAC > > Hi dear all, > > When I tried to

updateStateByKey example not using correct input data?

2014-05-01 Thread Adrian Mocanu
I'm trying to understand updateStateByKey. Here's an example I'm testing with: Input data: DStream( RDD( ("a",2) ), RDD( ("a",3) ), RDD( ("a",4) ), RDD( ("a",5) ), RDD( ("a",6) ), RDD( ("a",7) ) ) Code: val updateFunc = (values: Seq[Int], state: Option[StateClass]) => { val previousState

Running Spark jobs via oozie

2014-05-01 Thread Shivani Rao
Hello Spark Fans, I am trying to run a spark job via oozie as a java action. The spark code is packaged as a MySparkJob.jar compiled using sbt assembly (excluding spark and hadoop dependencies). I am able to invoke the spark job from any client using java -cp lib/MySparkJob.jar:lib/spark-0.9-ass

Setting the Scala version in the EC2 script?

2014-05-01 Thread Ian Ferreira
Is this possible, it is very annoying to have such a great script, but still have to manually update stuff afterwards.

Re: Equally weighted partitions in Spark

2014-05-01 Thread Andrew Ash
The problem is that equally-sized partitions take variable time to complete based on their contents? Sent from my mobile phone On May 1, 2014 8:31 AM, "deenar.toraskar" wrote: > Hi > > I am using Spark to distribute computationally intensive tasks across the > cluster. Currently I partition my R

range partitioner with updateStateByKey

2014-05-01 Thread Adrian Mocanu
If I use a range partitioner, will this make updateStateByKey take the tuples in order? Right now I see them not being taken in order (most of them are ordered but not all) -Adrian

java.lang.ClassNotFoundException

2014-05-01 Thread İbrahim Rıza HALLAÇ
HelIo. I followed "A Standalone App in Java" part of the tutorial https://spark.apache.org/docs/0.8.1/quick-start.html Spark standalone cluster looks it's running without a problem : http://i.stack.imgur.com/7bFv8.png I have built a fat jar for running this JavaApp on the cluster. Before maven

Re: How to handle this situation: Huge File Shared by All maps and Each Computer Has one copy?

2014-05-01 Thread PengWeiPRC
Thanks, Rustagi. Yes, the global data is read-only and stays from the beginning to the end of the whole Spark task. Actually, it is not only identical for one Map/Reduce task, but used by a lot of map/reduce tasks of mine. That's why I intend to put the data into each node of my cluster, and hope t

Task not serializable: collect, take

2014-05-01 Thread SK
Hi, I have the following code structure. I compiles ok, but at runtime it aborts with the error: Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: I am running in local (standalone) mode. trait A{ def input(...): .

Re: Task not serializable: collect, take

2014-05-01 Thread Marcelo Vanzin
Have you tried making A extend Serializable? On Thu, May 1, 2014 at 3:47 PM, SK wrote: > Hi, > > I have the following code structure. I compiles ok, but at runtime it aborts > with the error: > Exception in thread "main" org.apache.spark.SparkException: Job aborted: > Task not serializable: java

Re: Opinions stratosphere

2014-05-01 Thread Christopher Nguyen
Someone (Ze Ni, https://www.sics.se/people/ze-ni) has actually attempted such a comparative study as a Masters thesis: http://www.diva-portal.org/smash/get/diva2:605106/FULLTEXT01.pdf According to this snapshot (c. 2013), Stratosphere is different from Spark in not having an explicit concept of a

Question regarding doing aggregation over custom partitions

2014-05-01 Thread Arun Swami
Hi, I am a newbie to Spark. I looked for documentation or examples to answer my question but came up empty handed. I don't know whether I am using the right terminology but here goes. I have a file of records. Initially, I had the following Spark program (I am omitting all the surrounding code a

configure spark history server for running on Yarn

2014-05-01 Thread Jenny Zhao
Hi, I have installed spark 1.0 from the branch-1.0, build went fine, and I have tried running the example on Yarn client mode, here is my command: /home/hadoop/spark-branch-1.0/bin/spark-submit /home/hadoop/spark-branch-1.0/examples/target/scala-2.10/spark-examples-1.0.0-hadoop2.2.0.jar --master

Re: Spark: issues with running a sbt fat jar due to akka dependencies

2014-05-01 Thread Shivani Rao
Hello Koert, That did not work. I specified it in my email already. But I figured a way around it by excluding akka dependencies Shivani On Tue, Apr 29, 2014 at 12:37 PM, Koert Kuipers wrote: > you need to merge reference.conf files and its no longer an issue. > > see the Build for for spark

Re: same partition id means same location?

2014-05-01 Thread wxhsdp
anyone talk something about this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/same-partition-id-means-same-location-tp5136p5200.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

YARN issues with resourcemanager.scheduler.address

2014-05-01 Thread zsterone
Hi, I'm trying to connect to a YARN cluster by running these commands: export HADOOP_CONF_DIR=/hadoop/var/hadoop/conf/ export YARN_CONF_DIR=$HADOOP_CONF_DIR export SPARK_YARN_MODE=true export SPARK_JAR=./assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar export SPARK_YARN_APP_JAR

sbt package, NoClassDefFoundError

2014-05-01 Thread SK
When I run sbt/sbt package and then execute ./bin/run-example org.apache.spark.examples.SparkPi local I get a NoClassDefFoundError. However, when I run sbt/sbt assembly and create the fat jar and run the above command, I am able to run it correctly. Creating a fat jar each time takes a lot of ti

Re: java.lang.ClassNotFoundException

2014-05-01 Thread Joe L
Hi, You should include the jar file of your project. for example: conf.set("yourjarfilepath.jar") Joe On Friday, May 2, 2014 7:39 AM, proofmoore [via Apache Spark User List] wrote: HelIo. I followed "A Standalone App in Java" part of the tutorial  https://spark.apache.org/docs/0.8.1/quick-sta

Re: What is Seq[V] in updateStateByKey?

2014-05-01 Thread Tathagata Das
Depends on your code. Referring to the earlier example, if you do words.map(x => (x,1)).updateStateByKey() then for a particular word, if a batch contains 6 occurrences of that word, then the Seq[V] will be [1, 1, 1, 1, 1, 1] Instead if you do words.map(x => (x,1)).reduceByKey(_ + _).update

Re: range partitioner with updateStateByKey

2014-05-01 Thread Tathagata Das
Ordered by what? arrival order? sort order? TD On Thu, May 1, 2014 at 2:35 PM, Adrian Mocanu wrote: > If I use a range partitioner, will this make updateStateByKey take the > tuples in order? > > Right now I see them not being taken in order (most of them are ordered > but not all) > > > > -A

Re: Spark: issues with running a sbt fat jar due to akka dependencies

2014-05-01 Thread Stephen Boesch
Hi Shivani, Your work would be helpful to others (well at least to me ;) Would you be willing to share your resultant sbt build files? 2014-05-01 17:45 GMT-07:00 Shivani Rao : > Hello Koert, > > That did not work. I specified it in my email already. But I figured a way > around it by exclu

Getting the following error using EC2 deployment

2014-05-01 Thread Ian Ferreira
I have a custom app that was compiled with scala 2.10.3 which I believe is what the latest spark-ec2 script installs. However running it on the master yields this cryptic error which according to the web implies incompatible jar versions. Exception in thread "main" java.lang.NoClassDefFoundError:

Re: Equally weighted partitions in Spark

2014-05-01 Thread deenar.toraskar
Yes On a job I am currently running, 99% of the partitions finish within seconds and a couple of partitions take around and hour to finish. I am pricing some instruments and complex instruments take far longer to price than plain vanilla ones. If I could distribute these complex instruments evenly

Re: ClassNotFoundException

2014-05-01 Thread Joe L
Please help me -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ClassNotFoundException-tp5182p5209.html Sent from the Apache Spark User List mailing list archive at Nabble.com.