SocketException when reading from S3 (s3n format)

2014-06-04 Thread yuzeh
Hi all, I've set up a 4-node spark cluster (the nodes are r3.large) with the spark-ec2 script. I've been trying to run a job on this cluster, and I'm trying to figure out why I get the following exception: java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInput

Re: SocketException when reading from S3 (s3n format)

2014-06-04 Thread yuzeh
I should add that I'm using spark 0.9.1. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SocketException-when-reading-from-S3-s3n-format-tp6889p6890.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: mounting SSD devices of EC2 r3.8xlarge instances

2014-06-04 Thread Han JU
For SSDs in r3, maybe it's better to mount with `discard` option since it supports TRIM: What I did for r3.large: echo '/dev/xvdb /mnt ext4 defaults,noatime,nodiratime,discard 0 0' >> /etc/fstab mkfs.ext4 /dev/xvdb mount /dev/xvdb 2014-06-03 19:15 GMT+02:00 Matei Zaharia : > Those insta

IllegalArgumentException on calling KMeans.train()

2014-06-04 Thread bluejoe2008
what does this exception mean? 14/06/04 16:35:15 ERROR executor.Executor: Exception in task ID 6 java.lang.IllegalArgumentException: requirement failed at scala.Predef$.require(Predef.scala:221) at org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:271) at org.apache.spark.mll

Problem understanding log message in SparkStreaming

2014-06-04 Thread nilmish
I wanted to know the meaning of the following log message when running a spark streaming job : [spark-akka.actor.default-dispatcher-18] INFO org.apache.spark.streaming.scheduler.JobScheduler - Total delay: 5.432 s for time 1401870454500 ms (execution: 0.593 s) According to my understanding, tota

How to change default storage levels

2014-06-04 Thread Salih Kardan
Hi I'm using Spark 0.9.1 and Shark 0.9.1. My dataset does not fit into memory I have in my cluster setup, so I want to use also disk for caching. I guess MEMORY_ONLY is the default storage level in Spark. If that's the case how could I change the storage level to MEMORY_AND_DISK in Spark? thanks

executor idle during task schedule

2014-06-04 Thread wxhsdp
Hi, all i've observed that sometimes when the executor finishes one task, it will wait about 5 seconds to get another task to work, during the 5 seconds, the executor does nothing: cpu idle, no disk access, no network transfer. is that normal for spark? thanks! -- View this message in c

compile spark 1.0.0 error

2014-06-04 Thread ch huang
hi,maillist: i try to compile spark ,but failed, here is my compile command and compile output # SPARK_HADOOP_VERSION=2.0.0-cdh4.4.0 SPARK_YARN=true sbt/sbt assembly [warn] 18 warnings found [info] Compiling 53 Scala sources and 1 Java source to /home/admserver/spark-1.0.0/sql/catalyst/

Re: IllegalArgumentException on calling KMeans.train()

2014-06-04 Thread Xiangrui Meng
Could you check whether the vectors have the same size? -Xiangrui On Wed, Jun 4, 2014 at 1:43 AM, bluejoe2008 wrote: > what does this exception mean? > > 14/06/04 16:35:15 ERROR executor.Executor: Exception in task ID 6 > java.lang.IllegalArgumentException: requirement failed > at scala.Predef$.r

Re: ZeroMQ Stream -> stack guard problem and no data

2014-06-04 Thread Sean Owen
It's complaining about the native library shipped with ZeroMQ, right? That message is the JVM complaining about how it was compiled. If so, I think it's a question for ZeroMQ? On Wed, Jun 4, 2014 at 7:10 AM, Tobias Pfeiffer wrote: > Hi, > > I am trying to use Spark Streaming (1.0.0) with ZeroMQ,

Re: RDD with a Map

2014-06-04 Thread Oleg Proudnikov
Just a thought... Are you trying to use use the RDD as a Map? On 3 June 2014 23:14, Doris Xin wrote: > Hey Amit, > > You might want to check out PairRDDFunctions > . > For your use case in particula

Re: Spark not working with mesos

2014-06-04 Thread praveshjain1991
Thanks for the reply Akhil. I created a tar.gz of created by make-distribution.sh which is accessible from all the slaves (I checked it using hadoop fs -ls /path/). Also there are no worker logs printed in $SPARK_HOME/work/ directory on the workers (which are otherwise printed if i run without usin

Re: Error related to serialisation in spark streaming

2014-06-04 Thread Sean Owen
I think Mayur meant that Spark doesn't necessarily clean the closure under Java 7 -- is that true though? I didn't know of an issue there. Some anonymous class in your (?) OptimisingSort class is getting serialized, which may be fine and intentional, but it is not serializable. You haven't posted

Re: compile spark 1.0.0 error

2014-06-04 Thread Sean Owen
I am not sure if it is exposed in the SBT build, but you may need the equivalent of the 'yarn-alpha' profile from the Maven build. This older build of CDH predates the newer YARN APIs. See also https://groups.google.com/forum/#!msg/spark-users/T1soH67C5M4/CmGYV8kfRkcJ Or, use a later CDH. In fac

Re: Spark not working with mesos

2014-06-04 Thread Akhil Das
http://spark.apache.org/docs/latest/running-on-mesos.html#troubleshooting-and-debugging ​​ If you are not able to find the logs in /var/log/mesos Do check in /tmp/mesos/ and you can see your applications id and all just like in the $SPARK_HOME/work directory. Thanks Best Regards On Wed, Jun

Re: Error related to serialisation in spark streaming

2014-06-04 Thread nilmish
The error is resolved. I was using a comparator which was not serialised because of which it was throwing the error. I have now switched to kryo serializer as it is faster than java serialser. I have set the required config conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerialize

Can this be done in map-reduce technique (in parallel)

2014-06-04 Thread lmk
Hi, I am a new spark user. Pls let me know how to handle the following scenario: I have a data set with the following fields: 1. DeviceId 2. latitude 3. longitude 4. ip address 5. Datetime 6. Mobile application name With the above data, I would like to perform the following steps: 1. Collect all

Re: Error related to serialisation in spark streaming

2014-06-04 Thread Mayur Rustagi
I had issues around embedded functions here's what I have figured. Every inner class actually contains a field referencing the outer class. The anonymous class actually has a this$0 field referencing the outer class, and thus why Spark is trying to serialize Outer class. In the Scala API, the clos

Re: Error related to serialisation in spark streaming

2014-06-04 Thread Sean Owen
static inner classes do not refer to the outer class. Often people declare them non-static by default when it's unnecessary -- a Comparator class is typically a great example. Anonymous inner classes declared inside a method are another example, but there again they can be refactored into named sta

Re: Can this be done in map-reduce technique (in parallel)

2014-06-04 Thread Oleg Proudnikov
It is possible if you use a cartesian product to produce all possible pairs for each IP address and 2 stages of map-reduce: - first by pairs of points to find the total of each pair and - second by IP address to find the pair for each IP address with the maximum count. Oleg On 4 June 2014 11

Facing MetricsSystem error on Running Spark applications

2014-06-04 Thread Vibhor Banga
Hi, I am facing following error on running spark applications. What could be missing which is causing this issue. org.eclipse.jetty.server.AbstractConnector - Started SocketConnector@0.0.0.0:55046 3574 [main] ERROR org.apache.spark.metrics.MetricsSystem - Sink class org.apache.spark.metrics.sin

Join : Giving incorrect result

2014-06-04 Thread Ajay Srivastava
Hi, I am doing join of two RDDs which giving different results ( counting number of records ) each time I run this code on same input. The input files are large enough to be divided in two splits. When the program runs on two workers with single core assigned to these, output is consistent and

Re: Facing MetricsSystem error on Running Spark applications

2014-06-04 Thread Sean Owen
You've got a conflict in the version of Jackson that is being used: Caused by: java.lang.NoSuchMethodError: com.fasterxml.jackson.databind.module.SimpleSerializers.(Ljava/util/List;)V Looks like you are using Jackson 2.x somewhere, but AFAIK all of the Hadoop/Spark libs are still on 1.x. That's

Can't seem to link "external/twitter" classes from my own app

2014-06-04 Thread Jeremy Lee
Man, this has been hard going. Six days, and I finally got a "Hello World" App working that I wrote myself. Now I'm trying to make a minimal streaming app based on the twitter examples, (running standalone right now while learning) and when running it like this: bin/spark-submit --class "SimpleAp

Re: Can't seem to link "external/twitter" classes from my own app

2014-06-04 Thread Sean Owen
Those aren't the names of the artifacts: http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22spark-streaming-twitter_2.10%22 The name is "spark-streaming-twitter_2.10" On Wed, Jun 4, 2014 at 1:49 PM, Jeremy Lee wrote: > Man, this has been hard going. Six days, and I finally got a "Hello World" >

Re: Re: IllegalArgumentException on calling KMeans.train()

2014-06-04 Thread bluejoe2008
thank you! 孟祥瑞 with your help i solved the problem. I constructed SparseVectors in a wrong way the first parameter of the constructor SparseVector(int size, int[] indices, double[] values) I mistaked it for the size of values 2014-06-04 bluejoe2008 From: Xiangrui Meng Date: 2014-06-04 17

Re: Can't seem to link "external/twitter" classes from my own app

2014-06-04 Thread Nick Pentreath
@Sean, the %% syntax in SBT should automatically add the Scala major version qualifier (_2.10, _2.11 etc) for you, so that does appear to be correct syntax for the build. I seemed to run into this issue with some missing Jackson deps, and solved it by including the jar explicitly on the driver cla

Re: Can't seem to link "external/twitter" classes from my own app

2014-06-04 Thread Sean Owen
Ah sorry, this may be the thing I learned for the day. The issue is that classes from that particular artifact are missing though. Worth interrogating the resulting .jar file with "jar tf" to see if it made it in? On Wed, Jun 4, 2014 at 2:12 PM, Nick Pentreath wrote: > @Sean, the %% syntax in SBT

is there any easier way to define a custom RDD in Java

2014-06-04 Thread bluejoe2008
hi, folks, is there any easier way to define a custom RDD in Java? I am wondering if I have to define a new java class which extends RDD from scratch? It is really a hard job for developers! 2014-06-04 bluejoe2008

Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-04 Thread Jeremy Lee
On Wed, Jun 4, 2014 at 12:31 PM, Matei Zaharia wrote: > Ah, sorry to hear you had more problems. Some thoughts on them: > There will always be more problems, 'tis the nature of coding. :-) I try not to bother the list until I've smacked my head against them for a few hours, so it's only the "mos

Re: spark on yarn fail with IOException

2014-06-04 Thread sam
I get a very similar stack trace and have no idea what could be causing it (see below). I've created a SO: http://stackoverflow.com/questions/24038908/spark-fails-on-big-jobs-with-java-io-ioexception-filesystem-closed 14/06/02 20:44:04 INFO client.AppClient$ClientActor: Executor updated: app-2014

Re: ZeroMQ Stream -> stack guard problem and no data

2014-06-04 Thread Tobias Pfeiffer
Hi, thanks for your messages. I'm not pursuing this further, though. People in #zeromq IRC advised strongly against using libraries based on version 2 of libzmq (for example, the Akka ZeroMQ library) due to a number issues in that version. In fact, https://github.com/zeromq/jeromq seems the way to

Re: RDD with a Map

2014-06-04 Thread Cheng Lian
On Wed, Jun 4, 2014 at 5:56 AM, Amit Kumar wrote: Hi Folks, > > I am new to spark -and this is probably a basic question. > > I have a file on the hdfs > > 1, one > 1, uno > 2, two > 2, dos > > I want to create a multi Map RDD RDD[Map[String,List[String]]] > > {"1"->["one","uno"], "2"->["two","d

Spark Usecase

2014-06-04 Thread Shahab Yunus
Hello All. I have a newbie question. We have a use case where huge amount of data will be coming in streams or micro-batches of streams and we want to process these streams according to some business logic. We don't have to provide extremely low latency guarantees but batch M/R will still be slow

Re: Join : Giving incorrect result

2014-06-04 Thread Cheng Lian
Hi Ajay, would you mind to synthesise a minimum code snippet that can reproduce this issue and paste it here? On Wed, Jun 4, 2014 at 8:32 PM, Ajay Srivastava wrote: > Hi, > > I am doing join of two RDDs which giving different results ( counting > number of records ) each time I run this code on

error with cdh 5 spark installation

2014-06-04 Thread chirag lakhani
I recently spun up an AWS cluster with cdh 5 using Cloudera Manager. I am trying to install spark and simply used the install command, as stated in the CDH 5 documentation. sudo apt-get install spark-core spark-master spark-worker spark-python I get the following error Setting up spark-master

Java IO Stream Corrupted - Invalid Type AC?

2014-06-04 Thread Matt Kielo
Hi Im trying run some spark code on a cluster but I keep running into a "java.io.StreamCorruptedException: invalid type code: AC" error. My task involves analyzing ~50GB of data (some operations involve sorting) then writing them out to a JSON file. Im running the analysis on each of the data's ~1

Re: SocketException when reading from S3 (s3n format)

2014-06-04 Thread Nicholas Chammas
I think by default a thread can die up to 4 times before Spark considers it a failure. Are you seeing that happen? I believe that is a configurable thing, but don't know off the top of my head how to change it. I've seen this error before when reading data from a large amount of files on S3, and i

Re: Spark not working with mesos

2014-06-04 Thread praveshjain1991
Thanks for the reply Akhil I saw the logs in /tmp/mesos and found that my tar.gz was not properly created. I corrected that but now got another error which i can't find an answer for on google. The error is pretty much the same "org.apache.spark.SparkException: Job aborted: Task 0.0:6 failed 4 ti

Trouble launching EC2 Cluster with Spark

2014-06-04 Thread Sam Taylor Steyer
Hi, I am trying to launch an EC2 cluster from spark using the following command: ./spark-ec2 -k HackerPair -i [path]/HackerPair.pem -s 2 launch HackerCluster I set my access key id and secret access key. I have been getting an error in the "setting up security groups..." phase: Invalid value '

Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-04 Thread Nicholas Chammas
On Wed, Jun 4, 2014 at 9:35 AM, Jeremy Lee wrote: > Oh, I went back to m1.large while those issues get sorted out. Random side note, Amazon is deprecating the m1 instances in favor of m3 instances, which have SSDs and more ECUs than their m1 counterparts. m3.2xlarge has 30GB of RAM and may be a

Re: is there any easier way to define a custom RDD in Java

2014-06-04 Thread Andrew Ash
Just curious, what do you want your custom RDD to do that the normal ones don't? On Wed, Jun 4, 2014 at 6:30 AM, bluejoe2008 wrote: > hi, folks, > is there any easier way to define a custom RDD in Java? > I am wondering if I have to define a new java class which extends RDD > from scra

Re: Java IO Stream Corrupted - Invalid Type AC?

2014-06-04 Thread Sean Owen
On Wed, Jun 4, 2014 at 3:33 PM, Matt Kielo wrote: > Im trying run some spark code on a cluster but I keep running into a > "java.io.StreamCorruptedException: invalid type code: AC" error. My task > involves analyzing ~50GB of data (some operations involve sorting) then > writing them out to a JSON

Re: Error related to serialisation in spark streaming

2014-06-04 Thread Andrew Ash
nilmish, To confirm your code is using kryo, go to the web ui of your application (defaults to :4040) and look at the environment tab. If your serializer settings are there then things should be working properly. I'm not sure how to confirm that it works against typos in the setting, but you can

Re: How to change default storage levels

2014-06-04 Thread Andrew Ash
You can change storage level on an individual RDD with .persist(StorageLevel.MEMORY_AND_DISK), but I don't think you can change what the default persistency level is for RDDs. Andrew On Wed, Jun 4, 2014 at 1:52 AM, Salih Kardan wrote: > Hi > > I'm using Spark 0.9.1 and Shark 0.9.1. My dataset

Re: Trouble launching EC2 Cluster with Spark

2014-06-04 Thread Gianluca Privitera
Hi, if you say you correctly setted your access key id and secret access key then probably it's a problem related to the key.pem file. Try generate a new one, and be sure to be the only one with the right to read it or it wont work. Gianluca On 04/06/2014 09:45, Sam Taylor Steyer wrote: Hi,

Re: RDD with a Map

2014-06-04 Thread Amit
Thanks folks. I was trying to get the RDD[multimap] so the collectAsMap is what I needed. Best, Amit On Jun 4, 2014, at 6:53, Cheng Lian wrote: > On Wed, Jun 4, 2014 at 5:56 AM, Amit Kumar wrote: > > Hi Folks, > > I am new to spark -and this is probably a basic question. > > I have a file

Re: RDD with a Map

2014-06-04 Thread Amit
Yes, RDD as a map of String keys and List of string as values. Amit On Jun 4, 2014, at 2:46, Oleg Proudnikov wrote: > Just a thought... Are you trying to use use the RDD as a Map? > > > > On 3 June 2014 23:14, Doris Xin wrote: > Hey Amit, > > You might want to check out PairRDDFunctions. F

Re: Can this be done in map-reduce technique (in parallel)

2014-06-04 Thread Andrew Ash
When you group by IP address in step 1 to this: (ip1,(lat1,lon1),(lat2,lon2)) (ip2,(lat3,lon3),(lat4,lat5)) How many lat/lon locations do you expect for each IP address? avg and max are interesting. Andrew On Wed, Jun 4, 2014 at 5:29 AM, Oleg Proudnikov wrote: > It is possi

pyspark join crash

2014-06-04 Thread Brad Miller
Hi All, I have experienced some crashing behavior with join in pyspark. When I attempt a join with 2000 partitions in the result, the join succeeds, but when I use only 200 partitions in the result, the join fails with the message "Job aborted due to stage failure: Master removed our application:

Re: Spark not working with mesos

2014-06-04 Thread ajatix
Since $HADOOP_HOME is deprecated, try adding it to the Mesos configuration file. Add `export MESOS_HADOOP_HOME=$HADOOP_HOME to ~/.bashrc` and that should solve your error -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-not-working-with-mesos-tp6806p69

Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0->1.0.0

2014-06-04 Thread Daniel Darabos
On Tue, Jun 3, 2014 at 8:46 PM, Marek Wiewiorka wrote: > Hi All, > I've been experiencing a very strange error after upgrade from Spark 0.9 > to 1.0 - it seems that saveAsTestFile function is throwing > java.lang.UnsupportedOperationException that I have never seen before. > In the stack trace y

Re: Better line number hints for logging?

2014-06-04 Thread Daniel Darabos
Oh, this would be super useful for us too! Actually wouldn't it be best if you could see the whole call stack on the UI, rather than just one line? (Of course you would have to click to expand it.) On Wed, Jun 4, 2014 at 2:38 AM, John Salvatier wrote: > Ok, I will probably open a Jira. > > > O

Re: Trouble launching EC2 Cluster with Spark

2014-06-04 Thread Krishna Sankar
One reason could be that the keys are in a different region. Need to create the keys in us-east-1-North Virginia. Cheers On Wed, Jun 4, 2014 at 7:45 AM, Sam Taylor Steyer wrote: > Hi, > > I am trying to launch an EC2 cluster from spark using the following > command: > > ./spark-ec2 -k HackerPa

Re: Spark 1.0.0 fails if mesos.coarse set to true

2014-06-04 Thread ajatix
I am also getting the exact error, with the exact logs when I run Spark 1.0.0 in coarse-grained mode. Coarse grained mode works perfectly with earlier versions that I tested - 0.9.1 and 0.9.0, but seems to have undergone some modification in spark 1.0.0 -- View this message in context: http://

Re: Spark 1.0.0 fails if mesos.coarse set to true

2014-06-04 Thread Mark Hamstra
Are you using spark-submit to run your application? On Wed, Jun 4, 2014 at 8:49 AM, ajatix wrote: > I am also getting the exact error, with the exact logs when I run Spark > 1.0.0 > in coarse-grained mode. > Coarse grained mode works perfectly with earlier versions that I tested - > 0.9.1 and 0

Re: Spark 1.0.0 fails if mesos.coarse set to true

2014-06-04 Thread ajatix
I'm running a manually built cluster on EC2. I have mesos (0.18.2) and hdfs (2.0.0-cdh4.5.0) installed on all slaves (3) and masters (3). I have spark-1.0.0 on one master and the executor file is on hdfs for the slaves. Whenever I try to launch a spark application on the cluster, it starts a task

Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0->1.0.0

2014-06-04 Thread Mark Hamstra
Actually, what the stack trace is showing is the result of an exception being thrown by the DAGScheduler's event processing actor. What happens is that the Supervisor tries to shut down Spark when an exception is thrown by that actor. As part of the shutdown procedure, the DAGScheduler tries to c

Re: Spark 1.0.0 fails if mesos.coarse set to true

2014-06-04 Thread Marek Wiewiorka
Exactly the same story - it used to work with 0.9.1 and does not work anymore with 1.0.0. I ran tests using spark-shell as well as my application(so tested turning on coarse mode via env variable and SparkContext properties explicitly) M. 2014-06-04 18:12 GMT+02:00 ajatix : > I'm running a man

Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0->1.0.0

2014-06-04 Thread Marek Wiewiorka
No, it's a Scala application. Unfortunately after I came across problems with running using mesos coarse mode and this issue I decided to do the downgrade to Spark 0.9.1 and purged logs.But I as far as I can remember I tried to run my app using Spark standalone mode there was also the same ClassNot

Re: SocketException when reading from S3 (s3n format)

2014-06-04 Thread Dan
Thanks Nicholas. The obvious fix for this issue, in my case, was to cache the input since it's only 35 megabytes. Dan On Wed, Jun 4, 2014 at 7:34 AM, Nicholas Chammas wrote: > I think by default a thread can die up to 4 times before Spark considers > it a failure. Are you seeing that happen?

Re: using Log4j to log INFO level messages on workers

2014-06-04 Thread Shivani Rao
Hello Alex Thanks for the link. Yes creating a singleton object for logging outside the code that gets executed on the workers definitely works. The problem that i am facing though is related to configuration of the logger. I don't see any log messages in the worker logs of the application. a) wh

Re: Using mongo with PySpark

2014-06-04 Thread Samarth Mailinglist
Thanks a lot, sorry for the really late reply! (Didn't have my laptop) This is working, but it's dreadfully slow and seems to not run in parallel? On Mon, May 19, 2014 at 2:54 PM, Nick Pentreath wrote: > You need to use mapPartitions (or foreachPartition) to instantiate your > client in each p

Re: Spark streaming on load run - How to increase single node capacity?

2014-06-04 Thread Wayne Adams
Hi Rod: Not sure about the 2nd item on your list, but for the first one, try raising the thread limit. Your machine might be set to 1024 or some other low number (ulimit -n). -- Wayne -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-on-loa

Re: Spark 1.0.0 fails if mesos.coarse set to true

2014-06-04 Thread Patrick Wendell
Hey, thanks a lot for reporting this. Do you mind making a JIRA with the details so we can track it? - Patrick On Wed, Jun 4, 2014 at 9:24 AM, Marek Wiewiorka wrote: > Exactly the same story - it used to work with 0.9.1 and does not work > anymore with 1.0.0. > I ran tests using spark-shell as w

Re: is there any easier way to define a custom RDD in Java

2014-06-04 Thread Patrick Wendell
Hey There, This is only possible in Scala right now. However, this is almost never needed since the core API is fairly flexible. I have the same question as Andrew... what are you trying to do with your RDD? - Patrick On Wed, Jun 4, 2014 at 7:49 AM, Andrew Ash wrote: > Just curious, what do you

Re: error with cdh 5 spark installation

2014-06-04 Thread Patrick Wendell
Hey Chirag, Those init scripts are part of the Cloudera Spark package (they are not in the Spark project itself) so you might try e-mailing their support lists directly. - Patrick On Wed, Jun 4, 2014 at 7:19 AM, chirag lakhani wrote: > I recently spun up an AWS cluster with cdh 5 using Cloudera

Re: error with cdh 5 spark installation

2014-06-04 Thread Sean Owen
Spark is already part of the distribution, and the core CDH5 parcel. You shouldn't need extra steps unless you're doing something special. It may be that this is the very cause of the error when trying to install over the existing services. On Wed, Jun 4, 2014 at 3:19 PM, chirag lakhani wrote: >

Re: Can't seem to link "external/twitter" classes from my own app

2014-06-04 Thread Patrick Wendell
Hey Jeremy, The issue is that you are using one of the external libraries and these aren't actually packaged with Spark on the cluster, so you need to create an uber jar that includes them. You can look at the example here (I recently did this for a kafka project and the idea is the same): https

Re: Invalid Class Exception

2014-06-04 Thread Suman Somasundar
I am building Spark by myself and I am using Java 7 to both build and run. I will try with Java 6. Thanks, Suman. On 6/3/2014 7:18 PM, Matei Zaharia wrote: What Java version do you have, and how did you get Spark (did you build it yourself by any chance or download a pre-built one)? If you bu

RDD[(K,V)] for a Map File on HDFS

2014-06-04 Thread Amit Kumar
Hey guys, What is the best way for me to get an RDD[(K,V)] for a Map File created by MapFile.Writer? The Map file has Text key and MyArrayWritable as the value. Something akin to sc.textFile($path) So far I have tried two approaches -sc.hadoopFile and sc.sequenceFile #1 val rdd= sc.hadoopFile[

Re: custom receiver in java

2014-06-04 Thread lbustelo
Not that what TD was referring above, is already in 1.0.0 http://spark.apache.org/docs/1.0.0/streaming-custom-receivers.html -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/custom-receiver-in-java-tp3575p6962.html Sent from the Apache Spark User List mailin

Re: Invalid Class Exception

2014-06-04 Thread Suman Somasundar
I tried building with Java 6 and also tried the pre-built packages. I am still getting the same error. It works fine when I run it on a machine with Solaris OS and X-86 architecture. But, it does not work with Solaris OS and Sparc architecture. Any ideas, why this would happen? Thanks, Su

Re: Trouble launching EC2 Cluster with Spark

2014-06-04 Thread Sam Taylor Steyer
Thanks you! The regions advice solved the problem for my friend who was getting the key pair does not exist problem. I am still getting the error: ERROR:boto:400 Bad Request ERROR:boto: InvalidParameterValueInvalid value 'null' for protocol. VPC security group rules must specify protocols expli

Re: Trouble launching EC2 Cluster with Spark

2014-06-04 Thread Sam Taylor Steyer
Also, once my friend logged in to his cluster he received the error "Permissions 0644 for 'FinalKey.pem' are too open." This sounds like the other problem described. How do we make the permissions more private? Thanks very much, Sam - Original Message - From: "Sam Taylor Steyer" To: us

Re: Trouble launching EC2 Cluster with Spark

2014-06-04 Thread Krishna Sankar
chmod 600 /FinalKey.pem Cheers On Wed, Jun 4, 2014 at 12:49 PM, Sam Taylor Steyer wrote: > Also, once my friend logged in to his cluster he received the error > "Permissions 0644 for 'FinalKey.pem' are too open." This sounds like the > other problem described. How do we make the permissions

Re: Trouble launching EC2 Cluster with Spark

2014-06-04 Thread Sam Taylor Steyer
Awesome, that worked. Thank you! - Original Message - From: "Krishna Sankar" To: user@spark.apache.org Sent: Wednesday, June 4, 2014 12:52:00 PM Subject: Re: Trouble launching EC2 Cluster with Spark chmod 600 /FinalKey.pem Cheers On Wed, Jun 4, 2014 at 12:49 PM, Sam Taylor Steyer w

Re: Join : Giving incorrect result

2014-06-04 Thread Xu (Simon) Chen
Maybe your two workers have different assembly jar files? I just ran into a similar problem that my spark-shell is using a different jar file than my workers - got really confusing results. On Jun 4, 2014 8:33 AM, "Ajay Srivastava" wrote: > Hi, > > I am doing join of two RDDs which giving differ

Re: access hdfs file name in map()

2014-06-04 Thread Xu (Simon) Chen
N/M.. I wrote a HadoopRDD subclass and append one env field of the HadoopPartition to the value in compute function. It worked pretty well. Thanks! On Jun 4, 2014 12:22 AM, "Xu (Simon) Chen" wrote: > I don't quite get it.. > > mapPartitionWithIndex takes a function that maps an integer index and

Re: Join : Giving incorrect result

2014-06-04 Thread Matei Zaharia
If this isn’t the problem, it would be great if you can post the code for the program. Matei On Jun 4, 2014, at 12:58 PM, Xu (Simon) Chen wrote: > Maybe your two workers have different assembly jar files? > > I just ran into a similar problem that my spark-shell is using a different > jar fi

reuse hadoop code in Spark

2014-06-04 Thread Wei Tan
Hello, I am trying to use spark in such a scenario: I have code written in Hadoop and now I try to migrate to Spark. The mappers and reducers are fairly complex. So I wonder if I can reuse the map() functions I already wrote in Hadoop (Java), and use Spark to chain them, mixing the Java ma

Re: reuse hadoop code in Spark

2014-06-04 Thread Matei Zaharia
Yes, you can write some glue in Spark to call these. Some functions to look at: - SparkContext.hadoopRDD lets you create an input RDD from an existing JobConf configured by Hadoop (including InputFormat, paths, etc) - RDD.mapPartitions lets you operate in all the values on one partition (block)

Re: Better line number hints for logging?

2014-06-04 Thread Matei Zaharia
That’s a good idea too, maybe we can change CallSiteInfo to do that. Matei On Jun 4, 2014, at 8:44 AM, Daniel Darabos wrote: > Oh, this would be super useful for us too! > > Actually wouldn't it be best if you could see the whole call stack on the UI, > rather than just one line? (Of course

Re: pyspark join crash

2014-06-04 Thread Matei Zaharia
In PySpark, the data processed by each reduce task needs to fit in memory within the Python process, so you should use more tasks to process this dataset. Data is spilled to disk across tasks. I’ve created https://issues.apache.org/jira/browse/SPARK-2021 to track this — it’s something we’ve bee

Re: SQLContext and HiveContext Query Performance

2014-06-04 Thread ssb61
-- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SQLContext-and-HiveContext-Query-Performance-tp6948p6976.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How can I dispose an Accumulator?

2014-06-04 Thread Daniel Siegmann
Will the broadcast variables be disposed automatically if the context is stopped, or do I still need to unpersist()? On Sat, May 31, 2014 at 1:20 PM, Patrick Wendell wrote: > Hey There, > > You can remove an accumulator by just letting it go out of scope and > it will be garbage collected. For

Re: SQLContext and HiveContext Query Performance

2014-06-04 Thread Zongheng Yang
Hi, Just wondering if you can try this: val obj = sql("select manufacturer, count(*) as examcount from pft group by manufacturer order by examcount desc") obj.collect() obj.queryExecution.executedPlan.executeCollect() and time the third line alone. It could be that Spark SQL taking some time to

Re: How can I dispose an Accumulator?

2014-06-04 Thread Matei Zaharia
All of these are disposed of automatically if you stop the context or exit the program. Matei On Jun 4, 2014, at 2:22 PM, Daniel Siegmann wrote: > Will the broadcast variables be disposed automatically if the context is > stopped, or do I still need to unpersist()? > > > On Sat, May 31, 20

Re: pyspark join crash

2014-06-04 Thread Brad Miller
Hi Matei, Thanks for the reply and creating the JIRA. I hear what you're saying, although to be clear I want to still state that it seems like each reduce task is loading significantly more data than just the records needed for that task. The workers seem to load all data from each block containi

Re: SQLContext and HiveContext Query Performance

2014-06-04 Thread ssb61
I timed the third line and here are stage timings, collect at SparkPlan.scala:52- 0.5 s mapPartitions at Exchange.scala:58 - 0.7 s RangePartitioner at Exchange.Scala:62 - 0.7 s RangePartitioner at Exchange.Scala:62 - 0.5 s m

Cassandra examples don't work for me

2014-06-04 Thread Tim Kellogg
Hi, I’m following the directions to run the cassandra example “org.apache.spark.examples.CassandraTest” and I get this error Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected at org.apache.cassa

Re: Running a spark-submit compatible app in spark-shell

2014-06-04 Thread Roger Hoover
It took me a little while to get back to this but it works now!! I'm invoking the shell like this: spark-shell --jars target/scala-2.10/spark-etl_2.10-1.0.jar Once inside, I can invoke a method in my package to run the job. > val reseult = etl.IP2IncomeJob.job(sc) On Tue, May 27, 2014 at 8:42

Re: error loading large files in PySpark 0.9.0

2014-06-04 Thread Jeremy Freeman
Hey Matei, Wanted to let you know this issue appears to be fixed in 1.0.0. Great work! -- Jeremy -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/error-loading-large-files-in-PySpark-0-9-0-tp3049p6985.html Sent from the Apache Spark User List mailing list a

Re: error loading large files in PySpark 0.9.0

2014-06-04 Thread Matei Zaharia
Ah, good to know! By the way in master we now have saveAsPickleFile (https://github.com/apache/spark/pull/755), and Nick Pentreath has been working on Hadoop InputFormats: https://github.com/apache/spark/pull/455. Would be good to have your input on both of those if you have a chance to try the

Re: Can't seem to link "external/twitter" classes from my own app

2014-06-04 Thread Jeremy Lee
Thanks Patrick! Uberjars. Cool. I'd actually heard of them. And thanks for the link to the example! I shall work through that today. I'm still learning sbt and it's many options... the last new framework I learned was node.js, and I think I've been rather spoiled by "npm". At least it's not mave

Re: pyspark join crash

2014-06-04 Thread Matei Zaharia
I think the problem is that once unpacked in Python, the objects take considerably more space, as they are stored as Python objects in a Python dictionary. Take a look at python/pyspark/join.py and combineByKey in python/pyspark/rdd.py. We should probably try to store these in serialized form.

Spark assembly error.

2014-06-04 Thread Sung Hwan Chung
When I run sbt/sbt assembly, I get the following exception. Is anyone else experiencing a similar problem? .. [info] Resolving org.eclipse.jetty.orbit#javax.servlet;3.0.0.v201112011016 ... [info] Updating {file:/Users/Sung/Projects/spark_06_04_14/}assembly... [info] Resolving org.fuses

Re: custom receiver in java

2014-06-04 Thread Tathagata Das
Yes, thanks updating this old thread! We heard our community demands and added support for Java receivers! TD On Wed, Jun 4, 2014 at 12:15 PM, lbustelo wrote: > Not that what TD was referring above, is already in 1.0.0 > > http://spark.apache.org/docs/1.0.0/streaming-custom-receivers.html > >

Re: Spark assembly error.

2014-06-04 Thread Sung Hwan Chung
Nevermind, it turns out that this is a problem for the Pivotal Hadoop that we are trying to compile against. On Wed, Jun 4, 2014 at 4:16 PM, Sung Hwan Chung wrote: > When I run sbt/sbt assembly, I get the following exception. Is anyone else > experiencing a similar problem? > > > .. > >

Re: Why Scala?

2014-06-04 Thread John Omernik
So Python is used in many of the Spark Ecosystem products, but not Streaming at this point. Is there a roadmap to include Python APIs in Spark Streaming? Anytime frame on this? Thanks! John On Thu, May 29, 2014 at 4:19 PM, Matei Zaharia wrote: > Quite a few people ask this question and the an

  1   2   >