socketTextStream() call on Cluster stream no records

2014-04-21 Thread Kulkarni, Vikram
Hello Spark-users, I modified the org.apache.spark.streaming.examples.JavaNetworkWordCount that uses a Netcat server to instead read data from a SocketServer implementation. The SocketServer Java program accepts connections on port ; simulates 1 million records and streams it to the socket

Re: Task splitting among workers

2014-04-21 Thread Arpit Tak
1.) How about if data is in S3 and we cached in memory , instead of hdfs ? 2.) How is the numbers of reducers determined in both case . Even if I specify set.mapred.reduce.tasks=50, still somehow reducers allocated are only 2, instead of 50. Although query/tasks gets completed. Regards, Arpit

Re: Java heap space and spark.akka.frameSize Inbox x

2014-04-21 Thread Arpit Tak
Also check out this post http://apache-spark-user-list.1001560.n3.nabble.com/Spark-program-thows-OutOfMemoryError-td4268.html On Mon, Apr 21, 2014 at 11:49 AM, Akhil Das wrote: > Hi Chieh, > > You can increase the heap size by exporting the java options (See below, > will increase the heap size

Re: Are there any plans to develop Graphx Streaming?

2014-04-21 Thread Ankur Dave
On Sun, Apr 20, 2014 at 6:27 PM, Qi Song wrote: > I wander if there exists some > documentation about how to choose partition methods, based on the graph's > structure or some other properties? > The best option is to try all the partition strategies (as well as the default, which is to leave ed

Re: Long running time for GraphX pagerank in dataset com-Friendster

2014-04-21 Thread Ankur Dave
On Sun, Apr 20, 2014 at 6:18 PM, Qi Song wrote: > I was running some pagerank tests of GraphX in my 8 nodes cluster. I > allocated each worker 32G memory and 8 CPU cores. The LiveJournal dataset > used 370s, which in my mind is reasonable. But when I tried the > com-Friendster data ( http://snap.

Re: [GraphX] Cast error when comparing a vertex attribute after its type has changed

2014-04-21 Thread Ankur Dave
On Fri, Apr 11, 2014 at 4:42 AM, Pierre-Alexandre Fonta < pierre.alexandre.fonta+sp...@gmail.com> wrote: > Testing in mapTriplets if a vertex attribute, which is defined as Integer > in > first VertexRDD but has been changed after to Double by mapVertices, is > greater than a number throws "java.l

Spark running slow for small hadoop files of 10 mb size

2014-04-21 Thread neeravsalaria
Hi, i have been using MapReduce to analyze multiple files whose size can range from 10 mb to 200mb per file. recently i planned to move spark , but my spark Job is taking too much time executing a single file in case my file size is 10MB and hdfs block size is 64MB. It is executing on a single

Re: Hung inserts?

2014-04-21 Thread Mayur Rustagi
Clustering is not supported. Can you remove that & give it a go. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Mon, Apr 21, 2014 at 3:20 AM, Brad Heller wrote: > Hey list, > > I've got some CSV data I'm importing from

Do I need to learn Scala for spark ?

2014-04-21 Thread arpan57
Hi guys, I read Spark is pretty faster than Hadoop and that inspires me to learn it. I've hands on exp. with Hadoop (MR-1). And pretty good with java programming. Do I need to learn Scala in order to learn Spark ? Can I go ahead and write my jobs in Java and run on spark ? How much dependenc

Re: Do I need to learn Scala for spark ?

2014-04-21 Thread Pulasthi Supun Wickramasinghe
Hi, I think you can do just fine with your Java knowledge. There is a Java API that you can use [1]. I am also new to Spark and i have got around with just my Java knowledge. And Scala is easy to learn if you are good with Java. [1] http://spark.apache.org/docs/latest/java-programming-guide.html

Re: Do I need to learn Scala for spark ?

2014-04-21 Thread Dean Wampler
I'm doing a talk this week at the Philly ETE conference on Spark. I'll compare the Hadoop Java API and Spark Scala API for implemented the *inverted index* algorithm. I'm going to make the case that the Spark API, like many functional-programming APIs, is so powerful that it's well worth your time

custom kryoserializer class under mesos

2014-04-21 Thread Soren Macbeth
Hello, Is it possible to use a custom class as my spark's KryoSerializer running under Mesos? I've tried adding my jar containing the class to my spark context (via SparkConf.addJars), but I always get: java.lang.ClassNotFoundException: flambo.kryo.FlamboKryoSerializer at java.net.URLCla

Re: Long running time for GraphX pagerank in dataset com-Friendster

2014-04-21 Thread Qi Song
Thanks Ankurdave~ The reason is actually the out of memory. Bests~ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Long-running-time-for-GraphX-pagerank-in-dataset-com-Friendster-tp4511p4533.html Sent from the Apache Spark User List mailing list archive at

Re: SPARK_YARN_APP_JAR, SPARK_CLASSPATH and ADD_JARS in a spark-shell on YARN

2014-04-21 Thread Sandy Ryza
Hi Christophe, Adding the jars to both SPARK_CLASSPATH and ADD_JARS is required. The former makes them available to the spark-shell driver process, and the latter tells Spark to make them available to the executor processes running on the cluster. -Sandy On Wed, Apr 16, 2014 at 9:27 AM, Christ

Re: PySpark still reading only text?

2014-04-21 Thread Nick Pentreath
Also see: https://github.com/apache/spark/pull/455 This will add support for reading sequencefile and other inputformat in PySpark, as long as the Writables are either simple (primitives, maps and arrays of same), or reasonably simple Java objects. I'm about to push a change from MsgPack to

Re: Do developers have to be aware of Spark's fault tolerance mechanism?

2014-04-21 Thread Sung Hwan Chung
Thanks for the answer Marcelo, The goal is to keep an intermediate value per row in memory, which would allow faster subsequent computations. I.e., computeSomething would depend on the previous value from the previous computation. If we replicate the entire row (other than the last changed elemen

stdout in workers

2014-04-21 Thread Jim Carroll
I'm experimenting with a few things trying to understand how it's working. I took the JavaSparkPi example as a starting point and added a few System.out lines. I added a system.out to the main body of the driver program (not inside of any Functions). I added another to the mapper. I added another

Re: Do developers have to be aware of Spark's fault tolerance mechanism?

2014-04-21 Thread Marcelo Vanzin
Hi Sung, On Mon, Apr 21, 2014 at 10:52 AM, Sung Hwan Chung wrote: > The goal is to keep an intermediate value per row in memory, which would > allow faster subsequent computations. I.e., computeSomething would depend on > the previous value from the previous computation. I think the fundamental

Spark is slow

2014-04-21 Thread Joe L
It is claimed that spark is 10x or 100x times faster than mapreduce and hive but since I started using it I haven't seen any faster performance. it is taking 2 minutes to run map and join tasks over just 2GB data. Instead hive was taking just a few seconds to join 2 tables over the same data. And,

spark-0.9.1 compiled with Hadoop 2.3.0 doesn't work with S3?

2014-04-21 Thread Nan Zhu
Hi, all I’m writing a Spark application to load S3 data to HDFS, the HDFS version is 2.3.0, so I have to compile Spark with Hadoop 2.3.0 after I execute val allfiles = sc.textFile("s3n://abc/*.txt”) val output = allfiles.saveAsTextFile("hdfs://x.x.x.x:9000/dataset”) Spark throws exception

checkpointing without streaming?

2014-04-21 Thread Diana Carroll
I'm trying to understand when I would want to checkpoint an RDD rather than just persist to disk. Every reference I can find to checkpoint related to Spark Streaming. But the method is defined in the core Spark library, not Streaming. Does it exist solely for streaming, or are there circumstance

Re: checkpointing without streaming?

2014-04-21 Thread Xiangrui Meng
Checkpoint clears dependencies. You might need checkpoint to cut a long lineage in iterative algorithms. -Xiangrui On Mon, Apr 21, 2014 at 11:34 AM, Diana Carroll wrote: > I'm trying to understand when I would want to checkpoint an RDD rather than > just persist to disk. > > Every reference I can

Re: Spark is slow

2014-04-21 Thread Marcelo Vanzin
Hi Joe, On Mon, Apr 21, 2014 at 11:23 AM, Joe L wrote: > And, I haven't gotten any answers to my questions. One thing that might explain that is that, at least for me, all (and I mean *all*) of your messages are ending up in my GMail spam folder, complaining that GMail can't verify that it real

Re: Do developers have to be aware of Spark's fault tolerance mechanism?

2014-04-21 Thread Sung Hwan Chung
I would probably agree that it's typically not a good idea to add states to distributed systems. Additionally, from a purist's perspective, this would be a bit of hacking to the paradigm. However, from a practical point of view, I think that it's a reasonable trade-off between efficiency and compl

Re: Spark is slow

2014-04-21 Thread John Meagher
Yahoo made some changes that drive mailing list posts into spam folders: http://www.virusbtn.com/blog/2014/04_15.xml On Mon, Apr 21, 2014 at 2:50 PM, Marcelo Vanzin wrote: > Hi Joe, > > On Mon, Apr 21, 2014 at 11:23 AM, Joe L wrote: >> And, I haven't gotten any answers to my questions. > > One

Problem connecting to HDFS in Spark shell

2014-04-21 Thread Williams, Ken
I'm trying to get my feet wet with Spark. I've done some simple stuff in the shell in standalone mode, and now I'm trying to connect to HDFS resources, but I'm running into a problem. I synced to git's master branch (c399baa - "SPARK-1456 Remove view bounds on Ordered in favor of a context bou

Re: checkpointing without streaming?

2014-04-21 Thread Diana Carroll
When might that be necessary or useful? Presumably I can persist and replicate my RDD to avoid re-computation, if that's my goal. What advantage does checkpointing provide over disk persistence with replication? On Mon, Apr 21, 2014 at 2:42 PM, Xiangrui Meng wrote: > Checkpoint clears depend

Re: Spark is slow

2014-04-21 Thread Nicholas Chammas
I'm seeing the same thing as Marcelo, Joe. All your mail is going to my Spam folder. :( With regards to your questions, I would suggest in general adding some more technical detail to them. It will be difficult for people to give you suggestions if all they are told is "Spark is slow". How does yo

Re: Spark is slow

2014-04-21 Thread Sam Bessalah
Why don't start by explaining what kind of operation you're running on spark that's faster than hadoop mapred. Mybewe could start there. And yes this mailing is very busy since many people are getting into Spark, it's hard to answer to everyone. On 21 Apr 2014 20:23, "Joe L" wrote: > It is claime

Spark Streaming source from Amazon Kinesis

2014-04-21 Thread Nicholas Chammas
I'm looking to start experimenting with Spark Streaming, and I'd like to use Amazon Kinesis as my data source. Looking at the list of supported Spark Streaming sources, I don't see any me

Re: Spark-ec2 asks for password

2014-04-21 Thread Mayur Rustagi
Hi We have a deployment tool from GCE that we use internally for Spark. Let me know if you want access to that. Not really clean enough to opensource though :). Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Re: BFS implemented

2014-04-21 Thread Mayur Rustagi
would be good if you can contribute this as an example. BFS is a common enough algo. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Sat, Apr 19, 2014 at 4:16 AM, Ghufran Malik wrote: > Ahh nvm I found the solution :) >

Re: Using google cloud storage for spark big data

2014-04-21 Thread Mayur Rustagi
Okay just commented on another thread :) I have one that I use internally. Can give it out but will need some support from you to fix bugs etc. Let me know if you are interested. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Re: checkpointing without streaming?

2014-04-21 Thread Xiangrui Meng
Persist doesn't cut lineage. You might run into StackOverflow problem with a long lineage. See https://spark-project.atlassian.net/browse/SPARK-1006 for example. On Mon, Apr 21, 2014 at 12:11 PM, Diana Carroll wrote: > When might that be necessary or useful? Presumably I can persist and > replic

Re: Spark Streaming source from Amazon Kinesis

2014-04-21 Thread Matei Zaharia
There was a patch posted a few weeks ago (https://github.com/apache/spark/pull/223), but it needs a few changes in packaging because it uses a license that isn’t fully compatible with Apache. I’d like to get this merged when the changes are made though — it would be a good input source to suppo

Re: Strange behaviour of different SSCs with same Kafka topic

2014-04-21 Thread Tathagata Das
Are you by any chance starting two StreamingContexts in the same JVM? That could explain a lot of the weird mixing of data that you are seeing. Its not a supported usage scenario to start multiple streamingContexts simultaneously in the same JVM. TD On Thu, Apr 17, 2014 at 10:58 PM, gaganbm wro

[ann] Spark-NYC Meetup

2014-04-21 Thread François Le Lay
Hi everyone, This is a quick email to announce the creation of a Spark-NYC Meetup. We have 2 upcoming events, one at PlaceIQ, another at Spotify where Reynold Xin (Databricks) and Christopher Johnson (Spotify) have talks scheduled. More info : http://www.meetup.com/Spark-NYC/ -- François Le Lay

Re: question about the SocketReceiver

2014-04-21 Thread Tathagata Das
As long as the socket server sends data through the same connection, the existing code is going to work. The socket.getInputStream returns a input stream which will continuously allow you to pull data sent over the connection. The bytesToObject function continuously reads data from the input stream

RE: Problem connecting to HDFS in Spark shell

2014-04-21 Thread Williams, Ken
I figured it out - I should be using textFile(...), not hadoopFile(...). And my HDFS URL should include the host: hdfs://host/user/kwilliams/corTable2/part-m-0 I haven't figured out how to let the hostname default to the host mentioned in our /etc/hadoop/conf/hdfs-site.xml like the Hadoo

Re: Problem connecting to HDFS in Spark shell

2014-04-21 Thread Marcelo Vanzin
Hi Ken, On Mon, Apr 21, 2014 at 1:39 PM, Williams, Ken wrote: > I haven't figured out how to let the hostname default to the host mentioned > in our /etc/hadoop/conf/hdfs-site.xml like the Hadoop command-line tools do, > but that's not so important. Try adding "/etc/hadoop/conf" to SPARK_CLASS

Re: checkpointing without streaming?

2014-04-21 Thread Tathagata Das
Diana, that is a good question. When you persist an RDD, the system still remembers the whole lineage of parent RDDs that created that RDD. If one of the executor fails, and the persist data is lost (both local disk and memory data will get lost), then the lineage is used to recreate the RDD. The

Re: Spark Streaming source from Amazon Kinesis

2014-04-21 Thread Nicholas Chammas
Thanks for the link! I'll follow the discussion there. On Mon, Apr 21, 2014 at 4:10 PM, Matei Zaharia wrote: > There was a patch posted a few weeks ago ( > https://github.com/apache/spark/pull/223), but it needs a few changes in > packaging because it uses a license that isn’t fully compatible w

Re: Spark Streaming source from Amazon Kinesis

2014-04-21 Thread Parviz Deyhim
it is possible Nick. Please take a look here: https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923 the source code is here as a pull request: https://github.com/apache/spark/pull/223 let me know if you have any questions. On Mon, Apr 21, 2014 at 1:00 PM, Nicholas Chammas < nichola

Re: Spark Streaming source from Amazon Kinesis

2014-04-21 Thread Parviz Deyhim
sorry Matei. Will definitely start working on making the changes soon :) On Mon, Apr 21, 2014 at 1:10 PM, Matei Zaharia wrote: > There was a patch posted a few weeks ago ( > https://github.com/apache/spark/pull/223), but it needs a few changes in > packaging because it uses a license that isn’t

RE: Problem connecting to HDFS in Spark shell

2014-04-21 Thread Williams, Ken
> -Original Message- > From: Marcelo Vanzin [mailto:van...@cloudera.com] > Hi Ken, > > On Mon, Apr 21, 2014 at 1:39 PM, Williams, Ken > wrote: > > I haven't figured out how to let the hostname default to the host > mentioned in our /etc/hadoop/conf/hdfs-site.xml like the Hadoop > command-l

ERROR TaskSchedulerImpl: Lost an executor

2014-04-21 Thread jaeholee
Hi, I am trying to set up my own standalone Spark, and I started the master node and worker nodes. Then I ran ./bin/spark-shell, and I get this message: 14/04/21 16:31:51 ERROR TaskSchedulerImpl: Lost an executor 1 (already removed): remote Akka client disassociated 14/04/21 16:31:51 ERROR TaskSch

Re: Spark Streaming source from Amazon Kinesis

2014-04-21 Thread Matei Zaharia
No worries, looking forward to it! Matei On Apr 21, 2014, at 1:59 PM, Parviz Deyhim wrote: > sorry Matei. Will definitely start working on making the changes soon :) > > > On Mon, Apr 21, 2014 at 1:10 PM, Matei Zaharia > wrote: > There was a patch posted a few weeks ago > (https://github.c

Re: Hung inserts?

2014-04-21 Thread Brad Heller
I tried removing the CLUSTERED directive and get the same results :( I also removed SORTED, same deal. I'm going to try removign partitioning all together for now. On Mon, Apr 21, 2014 at 4:58 AM, Mayur Rustagi wrote: > Clustering is not supported. Can you remove that & give it a go. > > Mayur

Re: Hung inserts?

2014-04-21 Thread Brad Heller
So after a little more investigation it turns out this issue happens specifically when I interact with shark server. If I log in to the master and start a shark session (./bin/shark), everything works as expected. i'm starting shark server with the following upstart script, am I doing something wr

Re: [ann] Spark-NYC Meetup

2014-04-21 Thread Sam Bessalah
Sounds great François. On 21 Apr 2014 22:31, "François Le Lay" wrote: > Hi everyone, > > This is a quick email to announce the creation of a Spark-NYC Meetup. > We have 2 upcoming events, one at PlaceIQ, another at Spotify where > Reynold Xin (Databricks) and Christopher Johnson (Spotify) have t

Re: spark-0.9.1 compiled with Hadoop 2.3.0 doesn't work with S3?

2014-04-21 Thread Parviz Deyhim
I ran into the same issue. The problem seems to be with the jets3t library that Spark uses in project/SparkBuild.scala. change this: "net.java.dev.jets3t" % "jets3t" % "0.7.1" to "net.java.dev.jets3t" % "jets3t" % "0.9.0" "0.7.1" is not the right version of jets3t

Re: spark-0.9.1 compiled with Hadoop 2.3.0 doesn't work with S3?

2014-04-21 Thread Nan Zhu
Yes, I fixed in the same way, but didn’t get a change to get back to here I also made a PR: https://github.com/apache/spark/pull/468 Best, -- Nan Zhu On Monday, April 21, 2014 at 8:19 PM, Parviz Deyhim wrote: > I ran into the same issue. The problem seems to be with the jets3t library

Adding to an RDD

2014-04-21 Thread Ian Ferreira
Feels like a silly questions, But what if I wanted to apply a map to each element in a RDD, but instead of replacing it, I wanted to add new columns of the manipulate value I.e. res0: Array[String] = Array(1 2, 1 3, 1 4, 2 1, 3 1, 4 1) Becomes res0: Array[String] = Array(1 2 2 4, 1 3 1 6,

Re: Spark recovery from bad nodes

2014-04-21 Thread rama0120
Hi, I couldn't find any details regarding this recovery mechanism - could someone please shed some light on this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-recovery-from-bad-nodes-tp4505p4576.html Sent from the Apache Spark User List mailing li

Re: Spark is slow

2014-04-21 Thread Joe L
g1 = pairs1.groupByKey().count() pairs1 = pairs1.groupByKey(g1).cache() g2 = triples.groupByKey().count() pairs2 = pairs2.groupByKey(g2) pairs = pairs2.join(pairs1) Hi, I want to implement hash-partitioned joining as shown above. But somehow, it is taking so long to perform. As I understand,

two calls of saveAsTextFile() have different results on the same RDD

2014-04-21 Thread randylu
i just call saveAsTextFile() twice. 'doc_topic_dist' is type of RDD[(Long, Array[Int])], each element is pair of (doc, topic_arr), for the same doc, they have different of topic_arr in two files. ... doc_topic_dist.coalesce(1, true).saveAsTextFile(save_path) doc_topic_dist.coalesce

how to solve this problem?

2014-04-21 Thread gogototo
14/04/22 10:43:45 WARN scheduler.TaskSetManager: Loss was due to java.util.NoSuchElementException java.util.NoSuchElementException: End of stream at org.apache.spark.util.NextIterator.next(NextIterator.scala:83) at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.sc

Re: two calls of saveAsTextFile() have different results on the same RDD

2014-04-21 Thread randylu
it's ok when i call doc_topic_dist.cache() firstly. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/two-calls-of-saveAsTextFile-have-different-results-on-the-same-RDD-tp4578p4580.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Adding to an RDD

2014-04-21 Thread Mark Hamstra
As long as the function that you are mapping over the RDD is pure, preserving referential transparency so that anytime you map the same function over the same initial RDD elements you get the same result elements, then there is no problem in doing what you suggest. In fact, it's common practice.

Re: Strange behaviour of different SSCs with same Kafka topic

2014-04-21 Thread gaganbm
Yes. I am running this in a local mode and the SSCs run on the same JVM. So, if I deploy this on a cluster, such behavior would be gone ? Also, is there anyway I can start the SSCs on a local machine but on different JVMs? I couldn't find anything about this in the documentation. The inter-minglin

Need clarification of joining streams

2014-04-21 Thread gaganbm
I wanted some clarification on the behavior of join streams. As I believe, the join works per batch. I am reading data from two Kafka streams and then joining them based on some keys. But what happens if one stream hasn't produced any data in that batch duration, and the other has some ? Or lets s

Re: Spark recovery from bad nodes

2014-04-21 Thread Praveen R
Please check my comment on the shark-users thread . On Tue, Apr 22, 2014 at 8:06 AM, rama0120 wrote: > Hi, > > I couldn't find any details regarding th

Re: how to solve this problem?

2014-04-21 Thread Akhil Das
Hi, Would you mind sharing the piece of code that caused this exception? As per Javadoc NoSuchElementException is thrown if you call nextElement() method of Enumeration and there is no more element in Enumeration. Thanks Best Regards. On Tue, Apr 22, 2014 at 8:50 AM, gogototo wrote: > 14/04

Re: ERROR TaskSchedulerImpl: Lost an executor

2014-04-21 Thread Praveen R
Do have cluster deployed on aws? Could you try checking if 7077 port is accessible from worker nodes. On Tue, Apr 22, 2014 at 2:56 AM, jaeholee wrote: > Hi, I am trying to set up my own standalone Spark, and I started the master > node and worker nodes. Then I ran ./bin/spark-shell, and I get t