date:20150221

Re: About FlumeUtils.createStream

2015-02-21 Thread Akhil Das

Spark won't listen on mate, It basically means you have a flume source running at port of your localhost. And when you submit your application in standalone mode, workers will consume date from that port. Thanks Best Regards On Sat, Feb 21, 2015 at 9:22 AM, bit1...@163.com wrote: > >

Re: randomSplit instead of a huge map & reduce ?

2015-02-21 Thread Krishna Sankar

- Divide and conquer with reduceByKey (like Ashish mentioned, each pair being the key) would work - looks like a "mapReduce with combiners" problem. I think reduceByKey would use combiners while aggregateByKey wouldn't. - Could we optimize this further by using combineByKey directly

Re: Spark Performance on Yarn

2015-02-21 Thread Davies Liu

How many executors you have per machine? It will be helpful if you could list all the configs. Could you also try to run it without persist? Caching do hurt than help, if you don't have enough memory. On Fri, Feb 20, 2015 at 5:18 PM, Lee Bierman wrote: > Thanks for the suggestions. > I'm experim

Re: Force RDD evaluation

2015-02-21 Thread Sean Owen

I think the cheapest possible way to force materialization is something like rdd.foreachPartition(i => None) I get the use case, but as you can see there is a cost: you are forced to materialize an RDD and cache it just to measure the computation time. In principle this could be taking significan

Re: Use Spark Streaming for Batch?

2015-02-21 Thread Sean Owen

I agree with your assessment as to why it *doesn't* just work. I don't think a small batch duration helps as all files it sees at the outset are processed in one batch. Your timestamps are a user-space concept not a framework concept. However, there ought to be a great deal of reusability between

Query data in Spark RRD

2015-02-21 Thread Nikhil Bafna

Hi. My use case is building a realtime monitoring system over multi-dimensional data. The way I'm planning to go about it is to use Spark Streaming to store aggregated count over all dimensions in 10 sec interval. Then, from a dashboard, I would be able to specify a query over some dimensions, w

Worker and Nodes

2015-02-21 Thread Deep Pradhan

Hi, I have been running some jobs in my local single node stand alone cluster. I am varying the worker instances for the same job, and the time taken for the job to complete increases with increase in the number of workers. I repeated some experiments varying the number of nodes in a cluster too an

Perf Prediction

2015-02-21 Thread Deep Pradhan

Hi, Has some performance prediction work been done on Spark? Thank You

Re: Worker and Nodes

2015-02-21 Thread Yiannis Gkoufas

Hi, I have experienced the same behavior. You are talking about standalone cluster mode right? BR On 21 February 2015 at 14:37, Deep Pradhan wrote: > Hi, > I have been running some jobs in my local single node stand alone cluster. > I am varying the worker instances for the same job, and the t

Re: Worker and Nodes

2015-02-21 Thread Sean Owen

I can imagine a few reasons. Adding workers might cause fewer tasks to execute locally (?) So you may be execute more remotely. Are you increasing parallelism? for trivial jobs, chopping them up further may cause you to pay more overhead of managing so many small tasks, for no speed up in executio

Re: Worker and Nodes

2015-02-21 Thread Deep Pradhan

Yes, I am talking about standalone single node cluster. No, I am not increasing parallelism. I just wanted to know if it is natural. Does message passing across the workers account for the happenning? I am running SparkKMeans, just to validate one prediction model. I am using several data sets. I

Re: Perf Prediction

2015-02-21 Thread Ted Yu

Can you be a bit more specific ? Are you asking about performance across Spark releases ? Cheers On Sat, Feb 21, 2015 at 6:38 AM, Deep Pradhan wrote: > Hi, > Has some performance prediction work been done on Spark? > > Thank You > >

Re: Perf Prediction

2015-02-21 Thread Deep Pradhan

No, I am talking about some work parallel to prediction works that are done on GPUs. Like say, given the data for smaller number of nodes in a Spark cluster, the prediction needs to be done about the time that the application would take when we have larger number of nodes. On Sat, Feb 21, 2015 at

Re: Worker and Nodes

2015-02-21 Thread Sean Owen

What's your storage like? are you adding worker machines that are remote from where the data lives? I wonder if it just means you are spending more and more time sending the data over the network as you try to ship more of it to more remote workers. To answer your question, no in general more work

Re: Worker and Nodes

2015-02-21 Thread Deep Pradhan

In this case, I just wanted to know if a single node cluster with various workers act like a simulator of a multi-node cluster with various nodes. Like, if we have a single node cluster with 10 workers, say, then can we tell that the same behavior will take place with cluster of 10 nodes? It is lik

Re: Query data in Spark RRD

2015-02-21 Thread Ted Yu

Have you looked at http://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD ? Cheers On Sat, Feb 21, 2015 at 4:24 AM, Nikhil Bafna wrote: > > Hi. > > My use case is building a realtime monitoring system over > multi-dimensional data. > > The way I'm planning to go

Re: Worker and Nodes

2015-02-21 Thread Deep Pradhan

No, I just have a single node standalone cluster. I am not tweaking around with the code to increase parallelism. I am just running SparkKMeans that is there in Spark-1.0.0 I just wanted to know, if this behavior is natural. And if so, what causes this? Thank you On Sat, Feb 21, 2015 at 8:32 PM,

Re: Worker and Nodes

2015-02-21 Thread Sean Owen

"Workers" has a specific meaning in Spark. You are running many on one machine? that's possible but not usual. Each worker's executors have access to a fraction of your machine's resources then. If you're not increasing parallelism, maybe you're not actually using additional workers, so are using

Re: Worker and Nodes

2015-02-21 Thread Frank Austin Nothaft

There could be many different things causing this. For example, if you only have a single partition of data, increasing the number of tasks will only increase execution time due to higher scheduling overhead. Additionally, how large is a single partition in your application relative to the amoun

Re: Worker and Nodes

2015-02-21 Thread Deep Pradhan

Yes, I have decreased the executor memory. But,if I have to do this, then I have to tweak around with the code corresponding to each configuration right? On Sat, Feb 21, 2015 at 8:47 PM, Sean Owen wrote: > "Workers" has a specific meaning in Spark. You are running many on one > machine? that's p

Re: Worker and Nodes

2015-02-21 Thread Deep Pradhan

So, with the increase in the number of worker instances, if I also increase the degree of parallelism, will it make any difference? I can use this model even the other way round right? I can always predict the performance of an app with the increase in number of worker instances, the deterioration

Re: java.io.IOException: Filesystem closed

2015-02-21 Thread Kartheek.R

Are you replicating any RDDs? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-io-IOException-Filesystem-closed-tp20150p21749.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: Worker and Nodes

2015-02-21 Thread Deep Pradhan

So, if I keep the number of instances constant and increase the degree of parallelism in steps, can I expect the performance to increase? Thank You On Sat, Feb 21, 2015 at 9:07 PM, Deep Pradhan wrote: > So, with the increase in the number of worker instances, if I also > increase the degree of

Missing shuffle files

2015-02-21 Thread Anders Arpteg

For large jobs, the following error message is shown that seems to indicate that shuffle files for some reason are missing. It's a rather large job with many partitions. If the data size is reduced, the problem disappears. I'm running a build from Spark master post 1.2 (build at 2015-01-16) and run

Re: Missing shuffle files

2015-02-21 Thread Corey Nolet

I'm experiencing the same issue. Upon closer inspection I'm noticing that executors are being lost as well. Thing is, I can't figure out how they are dying. I'm using MEMORY_AND_DISK_SER and i've got over 1.3TB of memory allocated for the application. I was thinking perhaps it was possible that a s

RE: Spark performance tuning

2015-02-21 Thread java8964

Can someone share some ideas about how to tune the GC time? Thanks From: java8...@hotmail.com To: user@spark.apache.org Subject: Spark performance tuning Date: Fri, 20 Feb 2015 16:04:23 -0500 Hi, I am new to Spark, and I am trying to test the Spark SQL performance vs Hive. I setup a standalo

Re: Missing shuffle files

2015-02-21 Thread Petar Zecevic

Could you try to turn on the external shuffle service? spark.shuffle.service.enable= true On 21.2.2015. 17:50, Corey Nolet wrote: I'm experiencing the same issue. Upon closer inspection I'm noticing that executors are being lost as well. Thing is, I can't figure out how they are dying. I'm u

Re: Which OutputCommitter to use for S3?

2015-02-21 Thread Andrew Ash

Josh is that class something you guys would consider open sourcing, or would you rather the community step up and create an OutputCommitter implementation optimized for S3? On Fri, Feb 20, 2015 at 4:02 PM, Josh Rosen wrote: > We (Databricks) use our own DirectOutputCommitter implementation, whic

Posting to the list

2015-02-21 Thread pzecevic

Hi Spark users. Does anybody know what are the steps required to be able to post to this list by sending an email to user@spark.apache.org? I just sent a reply to Corey Nolet's mail "Missing shuffle files" but I don't think it was accepted by the engine. If I look at the Spark user list, I don't

Re: Posting to the list

2015-02-21 Thread Petar Zecevic

The message went through after all. Sorry for spamming. On 21.2.2015. 21:27, pzecevic wrote: Hi Spark users. Does anybody know what are the steps required to be able to post to this list by sending an email to user@spark.apache.org? I just sent a reply to Corey Nolet's mail "Missing shuffle f

Re: Which OutputCommitter to use for S3?

2015-02-21 Thread Aaron Davidson

Here is the class: https://gist.github.com/aarondav/c513916e72101bbe14ec You can use it by setting "mapred.output.committer.class" in the Hadoop configuration (or "spark.hadoop.mapred.output.committer.class" in the Spark configuration). Note that this only works for the old Hadoop APIs, I believe

Re: Worker and Nodes

2015-02-21 Thread Aaron Davidson

Note that the parallelism (i.e., number of partitions) is just an upper bound on how much of the work can be done in parallel. If you have 200 partitions, then you can divide the work among between 1 and 200 cores and all resources will remain utilized. If you have more than 200 cores, though, then

Executor size and checkpoints

2015-02-21 Thread Yana Kadiyska

Hi all, I had a streaming application and midway through things decided to up the executor memory. I spent a long time launching like this: ~/spark-1.2.0-bin-cdh4/bin/spark-submit --class StreamingTest --executor-memory 2G --master... and observing the executor memory is still at old 512 setting

Re: Perf Prediction

2015-02-21 Thread Ognen Duzlevski

On Sat, Feb 21, 2015 at 8:54 AM, Deep Pradhan wrote: > No, I am talking about some work parallel to prediction works that are > done on GPUs. Like say, given the data for smaller number of nodes in a > Spark cluster, the prediction needs to be done about the time that the > application would take

Re: Perf Prediction

2015-02-21 Thread Deep Pradhan

Yes, exactly. On Sun, Feb 22, 2015 at 9:10 AM, Ognen Duzlevski wrote: > On Sat, Feb 21, 2015 at 8:54 AM, Deep Pradhan > wrote: > >> No, I am talking about some work parallel to prediction works that are >> done on GPUs. Like say, given the data for smaller number of nodes in a >> Spark cluster,

Re: Worker and Nodes

2015-02-21 Thread Deep Pradhan

>> So increasing Executors without increasing physical resources If I have a 16 GB RAM system and then I allocate 1 GB for each executor, and give number of executors as 8, then I am increasing the resource right? In this case, how do you explain? Thank You On Sun, Feb 22, 2015 at 6:12 AM, Aaron

Re: Worker and Nodes

2015-02-21 Thread Deep Pradhan

Also, If I take SparkPageRank for example (org.apache.spark.examples), there are various RDDs that are created and transformed in the code that is written. If I want to increase the number of partitions and test out, what is the optimum number of partitions that gives me the best performance, I hav

Re: Perf Prediction

2015-02-21 Thread Deep Pradhan

Has anyone done any work on that? On Sun, Feb 22, 2015 at 9:57 AM, Deep Pradhan wrote: > Yes, exactly. > > On Sun, Feb 22, 2015 at 9:10 AM, Ognen Duzlevski < > ognen.duzlev...@gmail.com> wrote: > >> On Sat, Feb 21, 2015 at 8:54 AM, Deep Pradhan >> wrote: >> >>> No, I am talking about some work

Re: Query data in Spark RRD

2015-02-21 Thread Nikhil Bafna

Yes. As my understanding, it would allow me to write SQLs to query a spark context. But, the query needs to be specified within a job & deployed. What I want is to be able to run multiple dynamic queries specified at runtime from a dashboard. -- Nikhil Bafna On Sat, Feb 21, 2015 at 8:37 PM, Te

39 matches

Mail list logo