Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Tomer Benyamini
~/ephemeral-hdfs/sbin/start-mapred.sh does not exist on spark-1.0.2; I restarted hdfs using ~/ephemeral-hdfs/sbin/stop-dfs.sh and ~/ephemeral-hdfs/sbin/start-dfs.sh, but still getting the same error when trying to run distcp: ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered java

Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Frank Austin Nothaft
Tomer, To use distcp, you need to have a Hadoop compute cluster up. start-dfs just restarts HDFS. I don’t have a Spark 1.0.2 cluster up right now, but there should be a start-mapred*.sh or start-all.sh script that will launch the Hadoop MapReduce cluster that you will need for distcp. Regards,

Re: Spark Streaming and database access (e.g. MySQL)

2014-09-08 Thread Sean Owen
That should be OK, since the iterator is definitely consumed, and therefore the connection actually done with, at the end of a 'foreach' method. You might put the close in a finally block. On Mon, Sep 8, 2014 at 12:29 AM, Soumitra Kumar wrote: > I have the following code: > > stream foreachRDD {

Re: Spark Streaming and database access (e.g. MySQL)

2014-09-08 Thread Tobias Pfeiffer
Hi, On Mon, Sep 8, 2014 at 4:39 PM, Sean Owen wrote: > > > if (rdd.take (1).size == 1) { > > rdd foreachPartition { iterator => > I was wondering: Since take() is an output operation, isn't it computed twice (once for the take(1), once during the iteration)? O

sharing off_heap rdds

2014-09-08 Thread Manku Timma
I see that the tachyon url constructed for an rdd partition has executor id in it. So if the same partition is being processed by a different executor on a reexecution of the same computation, it cannot really use the earlier result. Is this a correct assessment? Will removing the executor id from

How to profile a spark application

2014-09-08 Thread rapelly kartheek
Hi, Can someone tell me how to profile a spark application. -Karthik

Re: Spark SQL check if query is completed (pyspark)

2014-09-08 Thread jamborta
thank you for the replies. I am running an insert on a join (INSERT OVERWRITE TABLE new_table select * from table1 as a join table2 as b on (a.key = b.key), The process does not have the right permission to write to that folder, so I get the following error printed: chgrp: `/user/x/y': No such f

Re: How to profile a spark application

2014-09-08 Thread Ted Yu
See https://cwiki.apache.org/confluence/display/SPARK/Profiling+Spark+Applications+Using+YourKit On Sep 8, 2014, at 2:48 AM, rapelly kartheek wrote: > Hi, > > Can someone tell me how to profile a spark application. > > -Karthik

Re: Running spark-shell (or queries) over the network (not from master)

2014-09-08 Thread Ognen Duzlevski
Solved. The problem is the following: the underlying Akka driver uses the INTERNAL interface address on the Amazon instance (the ones that start with 10.x.y.z) to present itself to the master, it does not use the external (public) IP! Ognen On 9/7/2014 3:21 PM, Sean Owen wrote: Also keep i

Re: Standalone spark cluster. Can't submit job programmatically -> java.io.InvalidClassException

2014-09-08 Thread DrKhu
After wasting a lot of time, I've found the problem. Despite I haven't used hadoop/hdfs in my application, hadoop client matters. The problem was in hadoop-client version, it was different than the version of hadoop, spark was built for. Spark's hadoop version 1.2.1, but in my application that was

spark application in cluster mode doesn't run correctly

2014-09-08 Thread 남윤민
Hello, I tried to execute a simple spark application using sparkSQL. At first try, it worked as I exepcted but after then, it doesn't run and shows an stderr like below: Spark Executor Command: "java" "-cp" "::/opt/spark-1.0.2-bin-hadoop2/conf:/opt/spark-1.0.2-bin-hadoop2/lib/spark-assembly-1.0

Error while running sparkSQL application in the cluster-mode environment

2014-09-08 Thread 남윤민
Hello, I tried to execute a simple spark application using sparkSQL. At first try, it worked as I exepcted but after then, it doesn't run and shows an stderr like below: Spark Executor Command: "java" "-cp" "::/opt/spark-1.0.2-bin-hadoop2/conf:/opt/spark-1.0.2-bin-hadoop2/lib/spark-asse

How to scale large kafka topic

2014-09-08 Thread richiesgr
Hi I'm building a application the read from kafka stream event. In production we've 5 consumers that share 10 partitions. But on spark streaming kafka the master act as a consumer then distribute the tasks to workers so I can have only 1 masters acting as consumer but I need more because only 1 co

clarification for some spark on yarn configuration options

2014-09-08 Thread Greg Hill
Is SPARK_EXECUTOR_INSTANCES the total number of workers in the cluster or the workers per slave node? Is spark.executor.instances an actual config option? I found that in a commit, but it's not in the docs. What is the difference between spark.yarn.executor.memoryOverhead and spark.executor.m

Re: Solving Systems of Linear Equations Using Spark?

2014-09-08 Thread Debasish Das
Durin, I have integrated ecos with spark which uses suitesparse under the hood for linear equation solvesI have exposed only the qp solver api in spark since I was comparing ip with proximal algorithms but we can expose suitesparse api as well...jni is used to load up ldl amd and ecos librarie

Re: Solving Systems of Linear Equations Using Spark?

2014-09-08 Thread Debasish Das
Xiangrui, Should I open up a JIRA for this ? Distributed lp/socp solver through ecos/ldl/amd ? I can open source it with gpl license in spark code as that's what our legal cleared (apache + gpl becomes gpl) and figure out the right way to call it...ecos is gpl but we can definitely use the jni v

Cannot run SimpleApp as regular Java app

2014-09-08 Thread ericacm
Dear all: I am a brand new Spark user trying out the SimpleApp from the Quick Start page. Here is the code: object SimpleApp { def main(args: Array[String]) { val logFile = "/dev/spark-1.0.2-bin-hadoop2/README.md" // Should be some file on your system val conf = new SparkConf()

Spark SQL on Cassandra

2014-09-08 Thread gtinside
Hi , I am reading data from Cassandra through datastax spark-cassandra connector converting it into JSON and then running spark-sql on it. Refer to the code snippet below : step 1 > val o_rdd = sc.cassandraTable[CassandraRDDWrapper]( '', '') step 2 > val tempObjectRDD = sc.parallelize(o_r

A problem for running MLLIB in amazon clound

2014-09-08 Thread Hui Li
I am running a very simple example using the SVMWithSGD on Amazon EMR. I haven't got any result after one hour long. My instance-type is: m3.large instance-count is: 3 Dataset is the data provided by the MLLIB in apache: sample_svm_data The number of iteration is: 2 and all other options ar

groupBy gives non deterministic results

2014-09-08 Thread redocpot
Hi, I have a key-value RDD called rdd below. After a groupBy, I tried to count rows. But the result is not unique, somehow non deterministic. Here is the test code: val step1 = ligneReceipt_cleTable.persist val step2 = step1.groupByKey val s1size = step1.count val s2size = step2.count

Re: How to profile a spark application

2014-09-08 Thread rapelly kartheek
Thank you Ted. regards Karthik On Mon, Sep 8, 2014 at 3:33 PM, Ted Yu wrote: > See > https://cwiki.apache.org/confluence/display/SPARK/Profiling+Spark+Applications+Using+YourKit > > On Sep 8, 2014, at 2:48 AM, rapelly kartheek > wrote: > > Hi, > > Can someone tell me how to profile a spark app

Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Nicholas Chammas
Tomer, Did you try start-all.sh? It worked for me the last time I tried using distcp, and it worked for this guy too . Nick ​ On Mon, Sep 8, 2014 at 3:28 AM, Tomer Benyamini wrote: > ~/ephemeral-hdfs/sbin/start-mapred.sh does not exist on spark-1.0.2

Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Tomer Benyamini
Still no luck, even when running stop-all.sh followed by start-all.sh. On Mon, Sep 8, 2014 at 5:57 PM, Nicholas Chammas wrote: > Tomer, > > Did you try start-all.sh? It worked for me the last time I tried using > distcp, and it worked for this guy too. > > Nick > > > On Mon, Sep 8, 2014 at 3:28 A

Re: groupBy gives non deterministic results

2014-09-08 Thread redocpot
Update: Just test with HashPartitioner(8) and count on each partition: List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394), *(5,657591*), (*6,658327*), (*7,658434*)), List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394), *(5,657594)*, (6,658326), (*7,658434*)), List((0,65

How do you perform blocking IO in apache spark job?

2014-09-08 Thread DrKhu
What if, when I traverse RDD, I need to calculate values in dataset by calling external (blocking) service? How do you think that could be achieved? val values: Future[RDD[Double]] = Future sequence tasks I've tried to create a list of Futures, but as RDD id not Traversable, Future.sequence is no

Re: How do you perform blocking IO in apache spark job?

2014-09-08 Thread Jörn Franke
Hi, I What does the external service provide? Data? Calculations? Can the service push data to you via Kafka and Spark streaming ? Can you fetch the necessary data beforehand from the service? The solution to your question depends on your answers. I would not recommend to connect to a blocking se

Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Ye Xianjin
what did you see in the log? was there anything related to mapreduce? can you log into your hdfs (data) node, use jps to list all java process and confirm whether there is a tasktracker process (or nodemanager) running with datanode process -- Ye Xianjin Sent with Sparrow (http://www.sparrowma

Re: How do you perform blocking IO in apache spark job?

2014-09-08 Thread DrKhu
Hi, Jörn, first of all, thanks for you intent to help. This one external service is a native component, that is stateless and that performs the calculation based on the data I provide. The data is in RDD. That one component I have on each worker node and I would like to get as much parallelism as

Re: How do you perform blocking IO in apache spark job?

2014-09-08 Thread Sean Owen
What is the driver-side Future for? Are you trying to make the remote Spark workers execute more requests to your service concurrently? it's not clear from your messages whether it's something like a web service, or just local native code. So the time spent in your processing -- whatever returns D

Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Tomer Benyamini
No tasktracker or nodemanager. This is what I see: On the master: org.apache.hadoop.yarn.server.resourcemanager.ResourceManager org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode org.apache.hadoop.hdfs.server.namenode.NameNode On the data node (slave): org.apache.hadoop.hdfs.server.datano

If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

2014-09-08 Thread Dimension Data, LLC.
Hello friends: It was mentioned in another (Y.A.R.N.-centric) email thread that 'SPARK_JAR' was deprecated, and to use the 'spark.yarn.jar' property instead for YARN submission. For example: user$ pyspark [some-options] --driver-java-options spark.yarn.jar=hdfs://namenode:8020/path/to/spa

Re: If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

2014-09-08 Thread Marcelo Vanzin
On Mon, Sep 8, 2014 at 9:35 AM, Dimension Data, LLC. wrote: >user$ pyspark [some-options] --driver-java-options > spark.yarn.jar=hdfs://namenode:8020/path/to/spark-assembly-*.jar This command line does not look correct. "spark.yarn.jar" is not a JVM command line option. You most probably need

Setting Kafka parameters in Spark Streaming

2014-09-08 Thread Hemanth Yamijala
Hi, I am using Spark 0.8.1 with Kafka 0.7. I am trying to set the parameter fetch.message.max.bytes when creating the Kafka DStream. The only API that seems to allow this is the following: kafkaStream[T, D <: kafka.serializer.Decoder[_]](typeClass: Class[T], decoderClass: Class[D], kafkaParams: M

Re: How do you perform blocking IO in apache spark job?

2014-09-08 Thread DrKhu
Thanks, Sean, I'll try to explain, what I'm trying to do. The native component, that I'm talking about is the native code, that I call using JNI. I've wrote small test Here, I traverse through the collection to call the native component N (1000) times. Then I have a result it means, that I'm

Re: Spark Streaming with Kafka, building project with 'sbt assembly' is extremely slow

2014-09-08 Thread Matt Narrell
I came across this: https://github.com/xerial/sbt-pack Until i found this, I was simply using the sbt-assembly plugin (sbt clean assembly) mn On Sep 4, 2014, at 2:46 PM, Aris wrote: > Thanks for answering Daniil - > > I have SBT version 0.13.5, is that an old version? Seems pretty up-to-da

Spark-submit ClassNotFoundException with JAR!

2014-09-08 Thread Peter Aberline
Hi, I'm having problems with a ClassNotFoundException using this simple example: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import java.net.URLClassLoader import scala.util.Marshal class ClassToRoundTrip(val id: Int) extends s

Re: If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

2014-09-08 Thread Marcelo Vanzin
On Mon, Sep 8, 2014 at 10:00 AM, Dimension Data, LLC. < subscripti...@didata.us> wrote: > user$ export MASTER=local[nn] # Run spark shell on LOCAL CPU threads. > user$ pyspark [someOptions] --driver-java-options -Dspark.*XYZ*.jar=' > /usr/lib/spark/assembly/lib/spark-assembly-*.jar' > > My questi

RE: prepending jars to the driver class path for spark-submit on YARN

2014-09-08 Thread Penny Espinoza
I don't understand what you mean. Can you be more specific? From: Victor Tso-Guillen Sent: Saturday, September 06, 2014 5:13 PM To: Penny Espinoza Cc: Spark Subject: Re: prepending jars to the driver class path for spark-submit on YARN I ran into the same issue

Re: How do you perform blocking IO in apache spark job?

2014-09-08 Thread Jörn Franke
Hi, So the external service itself creates threads and blocks until they finished execution? In this case you should not do threading but include it via jni directly in spark - it will take care about threading for you. Vest regards Hi, Jörn, first of all, thanks for you intent to help. This one

Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Ye Xianjin
well, this means you didn't start a compute cluster. Most likely because the wrong value of mapreduce.jobtracker.address cause the slave node cannot start the node manager. ( I am not familiar with the ec2 script, so I don't know whether the slave node has node manager installed or not.) Can yo

Re: prepending jars to the driver class path for spark-submit on YARN

2014-09-08 Thread Xiangrui Meng
When you submit the job to yarn with spark-submit, set --conf spark.yarn.user.classpath.first=true . On Mon, Sep 8, 2014 at 10:46 AM, Penny Espinoza wrote: > I don't understand what you mean. Can you be more specific? > > > > From: Victor Tso-Guillen > Sent: Sat

Re: A problem for running MLLIB in amazon clound

2014-09-08 Thread Xiangrui Meng
Could you attach the driver log? -Xiangrui On Mon, Sep 8, 2014 at 7:23 AM, Hui Li wrote: > I am running a very simple example using the SVMWithSGD on Amazon EMR. I > haven't got any result after one hour long. > > My instance-type is: m3.large > instance-count is: 3 > Dataset is the data pr

RE: prepending jars to the driver class path for spark-submit on YARN

2014-09-08 Thread Penny Espinoza
I have tried using the spark.files.userClassPathFirst option (which, incidentally, is documented now, but marked as experimental), but it just causes different errors. I am using spark-streaming-kafka. If I mark spark-core and spark-streaming as provided and also exclude them from the spark-s

Re: Crawler and Scraper with different priorities

2014-09-08 Thread Daniil Osipov
Depending on what you want to do with the result of the scraping, Spark may not be the best framework for your use case. Take a look at a general Akka application. On Sun, Sep 7, 2014 at 12:15 AM, Sandeep Singh wrote: > Hi all, > > I am Implementing a Crawler, Scraper. The It should be able to p

Input Field in Spark 1.1 Web UI

2014-09-08 Thread Arun Ahuja
Is there more information on what the "Input" column on the Spark UI means? How is this computed? I am processing a fairly small (but zipped) file and see the value as [image: Inline image 1] This does not seem correct? Thanks, Arun

Re: Solving Systems of Linear Equations Using Spark?

2014-09-08 Thread Xiangrui Meng
I asked Tim whether he would change the license of SuiteSparse to an Apache-friendly license couple months ago, but the answer was no. So I don't think we can use SuiteSparse in MLlib through JNI. Please feel free to create JIRAs for distributed linear programming and SOCP solvers and run the discu

RE: prepending jars to the driver class path for spark-submit on YARN

2014-09-08 Thread Penny Espinoza
?VIctor - Not sure what you mean. Can you provide more detail about what you did? From: Victor Tso-Guillen Sent: Saturday, September 06, 2014 5:13 PM To: Penny Espinoza Cc: Spark Subject: Re: prepending jars to the driver class path for spark-submit on YARN I r

saveAsHadoopFile into avro format

2014-09-08 Thread Dariusz Kobylarz
What is the right way of saving any PairRDD into avro output format. GraphArray extends SpecificRecord etc. I have the following java rdd: JavaPairRDD pairRDD = ... and want to save it to avro format: org.apache.hadoop.mapred.JobConf jc = new org.apache.hadoop.mapred.JobConf(); org.apache.avro.m

Re: If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

2014-09-08 Thread Marcelo Vanzin
On Mon, Sep 8, 2014 at 11:52 AM, Dimension Data, LLC. < subscripti...@didata.us> wrote: > So just to clarify for me: When specifying 'spark.yarn.jar' as I did > above, even if I don't use HDFS to create a > RDD (e.g. do something simple like: 'sc.parallelize(range(100))'), it is > still necessary

Re: Solving Systems of Linear Equations Using Spark?

2014-09-08 Thread Debasish Das
Yup...this can be a spark community project...I saw a PR for that...interested users fine with lgpl/gpl code can make use of it... On Mon, Sep 8, 2014 at 12:37 PM, Xiangrui Meng wrote: > I asked Tim whether he would change the license of SuiteSparse to an > Apache-friendly license couple months

Records - Input Byte

2014-09-08 Thread danilopds
Hi, I was reading the paper of Spark Streaming: "Discretized Streams: Fault-Tolerant Streaming Computation at Scale" So, I read that performance evaluation used 100-byte input records in test Grep and WordCount. I don't have much experience and I'd like to know how can I control this value in my

Recommendations for performance

2014-09-08 Thread Manu Mukerji
Hi, Let me start with, I am new to spark.(be gentle) I have a large data set in Parquet (~1.5B rows, 900 columns) Currently Impala takes ~1-2 seconds for the queries while SparkSQL is taking ~30 seconds.. Here is what I am currently doing.. I launch with SPARK_MEM=6g spark-shell val sqlContex

Re: Spark SQL check if query is completed (pyspark)

2014-09-08 Thread Michael Armbrust
You are probably not getting an error because the exception is happening inside of Hive. I'd still consider this a bug if you'd like to open a JIRA. On Mon, Sep 8, 2014 at 3:02 AM, jamborta wrote: > thank you for the replies. > > I am running an insert on a join (INSERT OVERWRITE TABLE new_tabl

Querying a parquet file in s3 with an ec2 install

2014-09-08 Thread Jim Carroll
Hello all, I've been wrestling with this problem all day and any suggestions would be greatly appreciated. I'm trying to test reading a parquet file that's stored in s3 using a spark cluster deployed on ec2. The following works in the spark shell when run completely locally on my own machine (i.e

Re: Spark SQL on Cassandra

2014-09-08 Thread Michael Armbrust
I believe DataStax is working on better integration here, but until that is ready you can use the applySchema API. Basically you will convert the CassandraTable into and RDD of Row objects using a .map() and then you can call applySchema (provided by SQLContext) to get a SchemaRDD. More details w

Re: Querying a parquet file in s3 with an ec2 install

2014-09-08 Thread Manu Mukerji
How big is the data set? Does it work when you copy it to hdfs? -Manu On Mon, Sep 8, 2014 at 2:58 PM, Jim Carroll wrote: > Hello all, > > I've been wrestling with this problem all day and any suggestions would be > greatly appreciated. > > I'm trying to test reading a parquet file that's store

RE: SchemaRDD - Parquet - "insertInto" makes many files

2014-09-08 Thread chutium
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark#HiveonSpark-NumberofTasks it will be great, if something like hive.exec.reducers.bytes.per.reducer could be implemented. one idea is, get total size of all target blocks, then set number of partitions -- View this message in cont

Re: Low Level Kafka Consumer for Spark

2014-09-08 Thread Tim Smith
Thanks TD. Someone already pointed out to me that /repartition(...)/ isn't the right way. You have to /val partedStream = repartition(...)/. Would be nice to have it fixed in the docs. On Fri, Sep 5, 2014 at 10:44 AM, Tathagata Das wrote: > Some thoughts on this thread to clarify the doubts.

Re: Querying a parquet file in s3 with an ec2 install

2014-09-08 Thread Ian O'Connell
Mmm how many days worth of data/how deep is your data nesting? I suspect your running into a current issue with parquet (a fix is in master but I don't believe released yet..). It reads all the metadata to the submitter node as part of scheduling the job. This can cause long start times(timeouts t

[Spark Streaming] java.lang.OutOfMemoryError: GC overhead limit exceeded

2014-09-08 Thread Yan Fang
Hi guys, My Spark Streaming application have this "java.lang.OutOfMemoryError: GC overhead limit exceeded" error in SparkStreaming driver program. I have done the following to debug with it: 1. improved the driver memory from 1GB to 2GB, this error came after 22 hrs. When the memory was 1GB, it c

Re: If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

2014-09-08 Thread Marcelo Vanzin
On Mon, Sep 8, 2014 at 3:54 PM, Dimension Data, LLC. < subscripti...@didata.us> wrote: > You're probably right about the above because, as seen *below* for > pyspark (but probably for other Spark > applications too), once '-Dspark.master=[yarn-client|yarn-cluster]' is > specified, the app invocat

Spark Web UI in Mesos mode

2014-09-08 Thread SK
Hi, I am running Spark 1.0.2 on a cluster in Mesos mode. I am not able to access the Spark master Web UI at port 8080 but am able to access it at port 5050. Is 5050 the standard port? Also, in the the standalone mode, there is a link to the Application detail UI directly from the master UI. I do

Re: Spark Web UI in Mesos mode

2014-09-08 Thread Wonha Ryu
Hi, Spark master web UI is only for "standalone" clusters, where cluster resources are managed by Spark, not other resource managers. Mesos master's default port is 5050. Within Mesos, a Spark application is considered as one of many frameworks, so there's no Spark-specific support like accessing

Executor address issue: "CANNOT FIND ADDRESS" (Spark 0.9.1)

2014-09-08 Thread Nicolas Mai
Hi, One of the executors in my spark cluster shows a "CANNOT FIND ADDRESS" address, for one of the stages which failed. After that stages, I got cascading failures for all my stages :/ (stages that seem complete but still appears as active stage in the dashboard; incomplete or failed stages that ar

Is the structure for a jar file for running Spark applications the same as that for Hadoop

2014-09-08 Thread Steve Lewis
In a Hadoop jar there is a directory called lib and all non-provided third party jars go there and are included in the class path of the code. Do jars for Spark have the same structure - another way to ask the question is if I have code to execute Spark and a jar build for Hadoop can I simply use

Re: Multi-tenancy for Spark (Streaming) Applications

2014-09-08 Thread Tobias Pfeiffer
Hi, On Thu, Sep 4, 2014 at 10:33 AM, Tathagata Das wrote: > In the current state of Spark Streaming, creating separate Java processes > each having a streaming context is probably the best approach to > dynamically adding and removing of input sources. All of these should be > able to to use a Y

Spark streaming for synchronous API

2014-09-08 Thread Ron's Yahoo!
Hi, I’m trying to figure out how I can run Spark Streaming like an API. The goal is to have a synchronous REST API that runs the spark data flow on YARN. Has anyone done something like this? Can you share your architecture? To begin with, is it even possible to have Spark Streaming run as a

Re: Spark streaming for synchronous API

2014-09-08 Thread Tobias Pfeiffer
Ron, On Tue, Sep 9, 2014 at 11:27 AM, Ron's Yahoo! wrote: > > I’m trying to figure out how I can run Spark Streaming like an API. > The goal is to have a synchronous REST API that runs the spark data flow > on YARN. I guess I *may* develop something similar in the future. By "a synchronou

RE: Setting Kafka parameters in Spark Streaming

2014-09-08 Thread Shao, Saisai
Hi Hemanth, I think there is a bug in this API in Spark 0.8.1, so you will meet this exception when using Java code with this API, this bug is fixed in latest version, as you can see the patch (https://github.com/apache/spark/pull/1508). But it’s only for Kafka 0.8+, as you still use kafka 0.7,

Re: Executor address issue: "CANNOT FIND ADDRESS" (Spark 0.9.1)

2014-09-08 Thread Burak Yavuz
Hi Nicolas, It seems that you are starting to lose executors and then the job starts to fail. Can you please share more information about your application so that we can help you debug it, such as what you're trying to do, and your driver logs please? Best, Burak - Original Message - F

Re: Spark streaming for synchronous API

2014-09-08 Thread Ron's Yahoo!
Tobias, Let me explain a little more. I want to create a synchronous REST API that will process some data that is passed in as some request. I would imagine that the Spark Streaming Job on YARN is a long running job that waits on requests from something. What that something is is still not

Re: Spark streaming for synchronous API

2014-09-08 Thread Ron's Yahoo!
Tobias, Let me explain a little more. I want to create a synchronous REST API that will process some data that is passed in as some request. I would imagine that the Spark Streaming Job on YARN is a long running job that waits on requests from something. What that something is is still not cl

Re: Spark streaming for synchronous API

2014-09-08 Thread Tobias Pfeiffer
Hi, On Tue, Sep 9, 2014 at 12:59 PM, Ron's Yahoo! wrote: > > I want to create a synchronous REST API that will process some data that > is passed in as some request. > I would imagine that the Spark Streaming Job on YARN is a long > running job that waits on requests from something. What that s

Re: How to profile a spark application

2014-09-08 Thread rapelly kartheek
hi Ted, Where do I find the licence keys that I need to copy to the licences directory. Thank you!! On Mon, Sep 8, 2014 at 8:25 PM, rapelly kartheek wrote: > Thank you Ted. > > regards > Karthik > > On Mon, Sep 8, 2014 at 3:33 PM, Ted Yu wrote: > >> See >> https://cwiki.apache.org/confluence

Re: Spark streaming for synchronous API

2014-09-08 Thread Ron's Yahoo!
Hi Tobias, So I guess where I was coming from was the assumption that starting up a new job to be listening on a particular queue topic could be done asynchronously. For example, let’s say there’s a particular topic T1 in a Kafka queue. If I have a new set of requests coming from a particular

Re: Setting Kafka parameters in Spark Streaming

2014-09-08 Thread Hemanth Yamijala
Thanks, Shao, for providing the necessary information. Hemanth On Tue, Sep 9, 2014 at 8:21 AM, Shao, Saisai wrote: > Hi Hemanth, > > > > I think there is a bug in this API in Spark 0.8.1, so you will meet this > exception when using Java code with this API, this bug is fixed in latest > versio

Re: Spark streaming for synchronous API

2014-09-08 Thread Tobias Pfeiffer
Hi, On Tue, Sep 9, 2014 at 2:02 PM, Ron's Yahoo! wrote: > So I guess where I was coming from was the assumption that starting up a > new job to be listening on a particular queue topic could be done > asynchronously. > No, with the current state of Spark Streaming, all data sources and the pr

Re: Crawler and Scraper with different priorities

2014-09-08 Thread Sandeep Singh
Hi Daniil, I have to do some processing of the results, as well as pushing the data to the front end. Currently I'm using akka for this application, but I was thinking maybe spark streaming would be a better thing to do. as well as i can use mllib for processing the results. Any specific reason's

RE: Setting Kafka parameters in Spark Streaming

2014-09-08 Thread Shao, Saisai
As you mentioned you hope to transplant latest version of Spark into Kafka 0.7 in another mail, there are some notes you should take care: 1. Kafka 0.7+ can only be compiled with Scala 2.8, while now Spark is compiled with Scala 2.10, there is no binary compatible between these two Scala

Iterable of Strings

2014-09-08 Thread Deep Pradhan
Hi, I have "s" as an Iterable of String. I also have "arr" as an array of bytes. I want to set the 's' position of the array 'arr' to 1. In short, I want to do arr(s) = 1 // algorithmic notation I tried the above but I am getting type mismatch error How should I do this? Thank You

Re: Iterable of Strings

2014-09-08 Thread Sean Owen
These questions have been Scala questions, not Spark questions. It's better to look for answers on the internet or on discussion groups devoted to Scala. StackOverflow is good, for example. An array is indexed by integers, not strings. It's not even clear what you intend here. On Tue, Sep 9, 2014

Re: Is the structure for a jar file for running Spark applications the same as that for Hadoop

2014-09-08 Thread Sean Owen
This structure is not specific to Hadoop, but in theory works in any JAR file. You can put JARs in JARs and refer to them with Class-Path entries in META-INF/MANIFEST.MF. It works but I have found it can cause trouble with programs that query the JARs on the classpath to find other classes. When t

Re: [GraphX] how to set memory configurations to avoid OutOfMemoryError "GC overhead limit exceeded"

2014-09-08 Thread Ankur Dave
At 2014-09-05 12:13:18 +0200, Yifan LI wrote: > But how to assign the storage level to a new vertices RDD that mapped from > an existing vertices RDD, > e.g. > *val newVertexRDD = > graph.collectNeighborIds(EdgeDirection.Out).map{case(id:VertexId, > a:Array[VertexId]) => (id, initialHashMap(a))}*

Spark streaming: size of DStream

2014-09-08 Thread julyfire
I want to implement the following logic: val stream = getFlumeStream() // a DStream if(size_of_stream > 0) // if the DStream contains some RDD stream.someTransfromation stream.count() can figure out the number of RDD in a DStream, but it return a DStream[Long] and can't compare with a number

Re: How to profile a spark application

2014-09-08 Thread julyfire
VisualVM is free and is enough in most situations -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-profile-a-spark-application-tp13684p13770.html Sent from the Apache Spark User List mailing list archive at Nabble.com.