Announcing Spark 1.4.1!

2015-07-15 Thread Patrick Wendell
Hi All, I'm happy to announce the Spark 1.4.1 maintenance release. We recommend all users on the 1.4 branch upgrade to this release, which contain several important bug fixes. Download Spark 1.4.1 - http://spark.apache.org/downloads.html Release notes - http://spark.apache.org/releases/spark-rele

SparkHub: a new community site for Apache Spark

2015-07-10 Thread Patrick Wendell
Hi All, Today, I'm happy to announce SparkHub (http://sparkhub.databricks.com), a service for the Apache Spark community to easily find the most relevant Spark resources on the web. SparkHub is a curated list of Spark news, videos and talks, package releases, upcoming events around the world, and

Re: Fully in-memory shuffles

2015-06-11 Thread Patrick Wendell
te: >> Ok so it is the case that small shuffles can be done without hitting any >> disk. Is this the same case for the aux shuffle service in yarn? Can that be >> done without hitting disk? >> >> On Wed, Jun 10, 2015 at 9:17 PM, Patrick Wendell wrote: >>> &

[ANNOUNCE] Announcing Spark 1.4

2015-06-11 Thread Patrick Wendell
Hi All, I'm happy to announce the availability of Spark 1.4.0! Spark 1.4.0 is the fifth release on the API-compatible 1.X line. It is Spark's largest release ever, with contributions from 210 developers and more than 1,000 commits! A huge thanks go to all of the individuals and organizations invo

Re: Fully in-memory shuffles

2015-06-10 Thread Patrick Wendell
In many cases the shuffle will actually hit the OS buffer cache and not ever touch spinning disk if it is a size that is less than memory on the machine. - Patrick On Wed, Jun 10, 2015 at 5:06 PM, Corey Nolet wrote: > So with this... to help my understanding of Spark under the hood- > > Is this

Re: Spark timeout issue

2015-04-26 Thread Patrick Wendell
Hi Deepak - please direct this to the user@ list. This list is for development of Spark itself. On Sun, Apr 26, 2015 at 12:42 PM, Deepak Gopalakrishnan wrote: > Hello All, > > I'm trying to process a 3.5GB file on standalone mode using spark. I could > run my spark job succesfully on a 100MB file

Announcing Spark 1.3.1 and 1.2.2

2015-04-17 Thread Patrick Wendell
Hi All, I'm happy to announce the Spark 1.3.1 and 1.2.2 maintenance releases. We recommend all users on the 1.3 and 1.2 Spark branches upgrade to these releases, which contain several important bug fixes. Download Spark 1.3.1 or 1.2.2: http://spark.apache.org/downloads.html Release notes: 1.3.1:

Re: Configuring amount of disk space available to spark executors in mesos?

2015-04-13 Thread Patrick Wendell
Hey Jonathan, Are you referring to disk space used for storing persisted RDD's? For that, Spark does not bound the amount of data persisted to disk. It's a similar story to how Spark's shuffle disk output works (and also Hadoop and other frameworks make this assumption as well for their shuffle da

Re: RDD resiliency -- does it keep state?

2015-03-27 Thread Patrick Wendell
If you invoke this, you will get at-least-once semantics on failure. For instance, if a machine dies in the middle of executing the foreach for a single partition, that will be re-executed on another machine. It could even fully complete on one machine, but the machine dies immediately before repor

Re: Spark 1.3 Source - Github and source tar does not seem to match

2015-03-27 Thread Patrick Wendell
The source code should match the Spark commit 4aaf48d46d13129f0f9bdafd771dd80fe568a7dc. Do you see any differences? On Fri, Mar 27, 2015 at 11:28 AM, Manoj Samel wrote: > While looking into a issue, I noticed that the source displayed on Github > site does not matches the downloaded tar for 1.3 >

Re: RDD.map does not allowed to preservesPartitioning?

2015-03-26 Thread Patrick Wendell
I think we have a version of mapPartitions that allows you to tell Spark the partitioning is preserved: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L639 We could also add a map function that does same. Or you can just write your map using an iter

Re: 1.3 Hadoop File System problem

2015-03-24 Thread Patrick Wendell
Hey Jim, Thanks for reporting this. Can you give a small end-to-end code example that reproduces it? If so, we can definitely fix it. - Patrick On Tue, Mar 24, 2015 at 4:55 PM, Jim Carroll wrote: > > I have code that works under 1.2.1 but when I upgraded to 1.3.0 it fails to > find the s3 hadoo

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-23 Thread Patrick Wendell
Hey Yiannis, If you just perform a count on each "name", "date" pair... can it succeed? If so, can you do a count and then order by to find the largest one? I'm wondering if there is a single pathologically large group here that is somehow causing OOM. Also, to be clear, you are getting GC limit

[ANNOUNCE] Announcing Spark 1.3!

2015-03-13 Thread Patrick Wendell
Hi All, I'm happy to announce the availability of Spark 1.3.0! Spark 1.3.0 is the fourth release on the API-compatible 1.X line. It is Spark's largest release ever, with contributions from 172 developers and more than 1,000 commits! Visit the release notes [1] to read about the new features, or d

Re: How to set per-user spark.local.dir?

2015-03-11 Thread Patrick Wendell
We don't support expressions or wildcards in that configuration. For each application, the local directories need to be constant. If you have users submitting different Spark applications, those can each set spark.local.dirs. - Patrick On Wed, Mar 11, 2015 at 12:14 AM, Jianshi Huang wrote: > Hi,

Re: Spark v1.2.1 failing under BigTop build in External Flume Sink (due to missing Netty library)

2015-03-05 Thread Patrick Wendell
You may need to add the -Phadoop-2.4 profile. When building or release packages for Hadoop 2.4 we use the following flags: -Phadoop-2.4 -Phive -Phive-thriftserver -Pyarn - Patrick On Thu, Mar 5, 2015 at 12:47 PM, Kelly, Jonathan wrote: > I confirmed that this has nothing to do with BigTop by ru

Re: Is SPARK_CLASSPATH really deprecated?

2015-02-27 Thread Patrick Wendell
I think we need to just update the docs, it is a bit unclear right now. At the time, we made it worded fairly sternly because we really wanted people to use --jars when we deprecated SPARK_CLASSPATH. But there are other types of deployments where there is a legitimate need to augment the classpath

Re: Add PredictionIO to Powered by Spark

2015-02-24 Thread Patrick Wendell
Added - thanks! I trimmed it down a bit to fit our normal description length. On Mon, Jan 5, 2015 at 8:24 AM, Thomas Stone wrote: > Please can we add PredictionIO to > https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark > > PredictionIO > http://prediction.io/ > > PredictionIO is a

Re: Can you add Big Industries to the Powered by Spark page?

2015-02-24 Thread Patrick Wendell
I've added it, thanks! On Fri, Feb 20, 2015 at 12:22 AM, Emre Sevinc wrote: > > Hello, > > Could you please add Big Industries to the Powered by Spark page at > https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark ? > > > Company Name: Big Industries > > URL: http://http://www.bigi

Re: FW: Submitting jobs to Spark EC2 cluster remotely

2015-02-23 Thread Patrick Wendell
. > > I think that what you are saying is exactly the issue: on my master node UI > at the bottom I can see the list of "Completed Drivers" all with ERROR > state... > > Thanks, > Oleg > > -Original Message- > From: Patrick Wendell [mailto:pwend

Re: Submitting jobs to Spark EC2 cluster remotely

2015-02-23 Thread Patrick Wendell
ontext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:127) > at > org.apache.spark.deploy.Sp

Re: Re: Sort Shuffle performance issues about using AppendOnlyMap for large data sets

2015-02-12 Thread Patrick Wendell
The map will start with a capacity of 64, but will grow to accommodate new data. Are you using the groupBy operator in Spark or are you using Spark SQL's group by? This usually happens if you are grouping or aggregating in a way that doesn't sufficiently condense the data created from each input pa

Re: feeding DataFrames into predictive algorithms

2015-02-11 Thread Patrick Wendell
I think there is a minor error here in that the first example needs a "tail" after the seq: df.map { row => (row.getDouble(0), row.toSeq.tail.map(_.asInstanceOf[Double])) }.toDataFrame("label", "features") On Wed, Feb 11, 2015 at 7:46 PM, Michael Armbrust wrote: > It sounds like you probably w

[ANNOUNCE] Apache Spark 1.2.1 Released

2015-02-09 Thread Patrick Wendell
Hi All, I've just posted the 1.2.1 maintenance release of Apache Spark. We recommend all 1.2.0 users upgrade to this release, as this release includes stability fixes across all components of Spark. - Download this release: http://spark.apache.org/downloads.html - View the release notes: http://s

Re: Questions about Spark standalone resource scheduler

2015-02-02 Thread Patrick Wendell
Hey Jerry, I think standalone mode will still add more features over time, but the goal isn't really for it to become equivalent to what Mesos/YARN are today. Or at least, I doubt Spark Standalone will ever attempt to manage _other_ frameworks outside of Spark and become a general purpose resource

Re: Bouncing Mails

2015-01-17 Thread Patrick Wendell
Akhil, Those are handled by ASF infrastructure, not anyone in the Spark project. So this list is not the appropriate place to ask for help. - Patrick On Sat, Jan 17, 2015 at 12:56 AM, Akhil Das wrote: > My mails to the mailing list are getting rejected, have opened a Jira issue, > can someone t

Re: Accumulator value in Spark UI

2015-01-14 Thread Patrick Wendell
It should appear in the page for any stage in which accumulators are updated. On Wed, Jan 14, 2015 at 6:46 PM, Justin Yip wrote: > Hello, > > From accumulator documentation, it says that if the accumulator is named, it > will be displayed in the WebUI. However, I cannot find it anywhere. > > Do I

Re: action progress in ipython notebook?

2014-12-28 Thread Patrick Wendell
Hey Eric, I'm just curious - which specific features in 1.2 do you find most help with usability? This is a theme we're focusing on for 1.3 as well, so it's helpful to hear what makes a difference. - Patrick On Sun, Dec 28, 2014 at 1:36 AM, Eric Friedman wrote: > Hi Josh, > > Thanks for the inf

Re: Long-running job cleanup

2014-12-28 Thread Patrick Wendell
What do you mean when you say "the overhead of spark shuffles start to accumulate"? Could you elaborate more? In newer versions of Spark shuffle data is cleaned up automatically when an RDD goes out of scope. It is safe to remove shuffle data at this point because the RDD can no longer be referenc

Re: Question on saveAsTextFile with overwrite option

2014-12-24 Thread Patrick Wendell
ble as any alternatives. This is already pretty easy IMO. - Patrick On Wed, Dec 24, 2014 at 11:28 PM, Cheng, Hao wrote: > I am wondering if we can provide more friendly API, other than configuration > for this purpose. What do you think Patrick? > > Cheng Hao > > -Original

Re: Question on saveAsTextFile with overwrite option

2014-12-24 Thread Patrick Wendell
Is it sufficient to set "spark.hadoop.validateOutputSpecs" to false? http://spark.apache.org/docs/latest/configuration.html - Patrick On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai wrote: > Hi, > > > > We have such requirements to save RDD output to HDFS with saveAsTextFile > like API, but need

Re: Announcing Spark Packages

2014-12-22 Thread Patrick Wendell
Hey Nick, I think Hitesh was just trying to be helpful and point out the policy - not necessarily saying there was an issue. We've taken a close look at this and I think we're in good shape her vis-a-vis this policy. - Patrick On Mon, Dec 22, 2014 at 5:29 PM, Nicholas Chammas wrote: > Hitesh, >

Re: Announcing Spark Packages

2014-12-22 Thread Patrick Wendell
Xiangrui asked me to report that it's back and running :) On Mon, Dec 22, 2014 at 3:21 PM, peng wrote: > Me 2 :) > > > On 12/22/2014 06:14 PM, Andrew Ash wrote: > > Hi Xiangrui, > > That link is currently returning a 503 Over Quota error message. Would you > mind pinging back out when the page i

Re: Announcing Spark 1.2!

2014-12-19 Thread Patrick Wendell
2.0 and v1.2.0-rc2 are pointed to different commits in >> https://github.com/apache/spark/releases >> >> Best Regards, >> >> Shixiong Zhu >> >> 2014-12-19 16:52 GMT+08:00 Patrick Wendell : >>> >>> I'm happy to announce the availability of S

Announcing Spark 1.2!

2014-12-19 Thread Patrick Wendell
I'm happy to announce the availability of Spark 1.2.0! Spark 1.2.0 is the third release on the API-compatible 1.X line. It is Spark's largest release ever, with contributions from 172 developers and more than 1,000 commits! This release brings operational and performance improvements in Spark core

Re: spark streaming kafa best practices ?

2014-12-17 Thread Patrick Wendell
rd. > > On Sat, Dec 6, 2014 at 12:57 AM, Patrick Wendell wrote: >> >> The second choice is better. Once you call collect() you are pulling >> all of the data onto a single node, you want to do most of the >> processing in parallel on the cluster, which is what map() w

Re: Spark Server - How to implement

2014-12-12 Thread Patrick Wendell
Hey Manoj, One proposal potentially of interest is the Spark Kernel project from IBM - you should look for their. The interface in that project is more of a "remote REPL" interface, i.e. you submit commands (as strings) and get back results (as strings), but you don't have direct programmatic acce

Re: Stateful mapPartitions

2014-12-05 Thread Patrick Wendell
Yeah the main way to do this would be to have your own static cache of connections. These could be using an object in Scala or just a static variable in Java (for instance a set of connections that you can borrow from). - Patrick On Thu, Dec 4, 2014 at 5:26 PM, Tobias Pfeiffer wrote: > Hi, > > O

Re: spark streaming kafa best practices ?

2014-12-05 Thread Patrick Wendell
The second choice is better. Once you call collect() you are pulling all of the data onto a single node, you want to do most of the processing in parallel on the cluster, which is what map() will do. Ideally you'd try to summarize the data or reduce it before calling collect(). On Fri, Dec 5, 201

Re: Exception adding resource files in latest Spark

2014-12-04 Thread Patrick Wendell
Thanks for flagging this. I reverted the relevant YARN fix in Spark 1.2 release. We can try to debug this in master. On Thu, Dec 4, 2014 at 9:51 PM, Jianshi Huang wrote: > I created a ticket for this: > > https://issues.apache.org/jira/browse/SPARK-4757 > > > Jianshi > > On Fri, Dec 5, 2014 at

Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-30 Thread Patrick Wendell
error. > This was caused by a local change, so no impact on the 1.2 release. > -Original Message- > From: Patrick Wendell [mailto:pwend...@gmail.com] > Sent: Wednesday, November 26, 2014 8:17 AM > To: Judy Nash > Cc: Denny Lee; Cheng Lian; u...@spark.incubator.apache.org >

Re: Opening Spark on IntelliJ IDEA

2014-11-29 Thread Patrick Wendell
I recently posted instructions on loading Spark in Intellij from scratch: https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-BuildingSparkinIntelliJIDEA You need to do a few extra steps for the YARN project to work. Also, for questions like this that re

Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-26 Thread Patrick Wendell
uot;org/spark-project/guava/common/base/Preconitions".checkArgument:(ZLjava/lang/Object;)V 50: invokestatic #502// Method "org/spark-project/guava/common/base/Preconitions".checkArgument:(ZLjava/lang/Object;)V On Wed, Nov 26, 2014 at 11:08 AM, Patri

Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-26 Thread Patrick Wendell
Hi Judy, Are you somehow modifying Spark's classpath to include jars from Hadoop and Hive that you have running on the machine? The issue seems to be that you are somehow including a version of Hadoop that references the original guava package. The Hadoop that is bundled in the Spark jars should n

Re: toLocalIterator in Spark 1.0.0

2014-11-13 Thread Patrick Wendell
It looks like you are trying to directly import the toLocalIterator function. You can't import functions, it should just appear as a method of an existing RDD if you have one. - Patrick On Thu, Nov 13, 2014 at 10:21 PM, Deep Pradhan wrote: > Hi, > > I am using Spark 1.0.0 and Scala 2.10.3. > > I

Re: Spark and Play

2014-11-11 Thread Patrick Wendell
Hi There, Because Akka versions are not binary compatible with one another, it might not be possible to integrate Play with Spark 1.1.0. - Patrick On Tue, Nov 11, 2014 at 8:21 AM, Akshat Aranya wrote: > Hi, > > Sorry if this has been asked before; I didn't find a satisfactory answer > when sear

Re: Still struggling with building documentation

2014-11-11 Thread Patrick Wendell
The doc build appears to be broken in master. We'll get it patched up before the release: https://issues.apache.org/jira/browse/SPARK-4326 On Tue, Nov 11, 2014 at 10:50 AM, Alessandro Baretta wrote: > Nichols and Patrick, > > Thanks for your help, but, no, it still does not work. The latest mast

Re: Ending a job early

2014-10-28 Thread Patrick Wendell
Hey Jim, There are some experimental (unstable) API's that support running jobs which might short-circuit: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1126 This can be used for doing online aggregations like you are describing. And in one

Re: Support Hive 0.13 .1 in Spark SQL

2014-10-27 Thread Patrick Wendell
Hey Cheng, Right now we aren't using stable API's to communicate with the Hive Metastore. We didn't want to drop support for Hive 0.12 so right now we are using a shim layer to support compiling for 0.12 and 0.13. This is very costly to maintain. If Hive has a stable meta-data API for talking to

Re: About "Memory usage" in the Spark UI

2014-10-22 Thread Patrick Wendell
It shows the amount of memory used to store RDD blocks, which are created when you run .cache()/.persist() on an RDD. On Wed, Oct 22, 2014 at 10:07 PM, Haopu Wang wrote: > Hi, please take a look at the attached screen-shot. I wonders what's the > "Memory Used" column mean. > > > > I give 2GB me

Re: scalac crash when compiling DataTypeConversions.scala

2014-10-22 Thread Patrick Wendell
Hey Ryan, I've found that filing issues with the Scala/Typesafe JIRA is pretty helpful if the issue can be fully reproduced, and even sometimes helpful if it can't. You can file bugs here: https://issues.scala-lang.org/secure/Dashboard.jspa The Spark SQL code in particular is typically the sourc

Re: sparksql connect remote hive cluster

2014-10-08 Thread Patrick Wendell
Spark will need to connect both to the hive metastore and to all HDFS nodes (NN and DN's). If that is all in place then it should work. In this case it looks like maybe it can't connect to a datanode in HDFS to get the raw data. Keep in mind that the performance might not be very good if you are tr

Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

2014-10-08 Thread Patrick Wendell
IIRC - the random is seeded with the index, so it will always produce the same result for the same index. Maybe I don't totally follow though. Could you give a small example of how this might change the RDD ordering in a way that you don't expect? In general repartition() will not preserve the orde

Re: Spot instances on Amazon EMR

2014-09-18 Thread Patrick Wendell
Hey Grzegorz, EMR is a service that is not maintained by the Spark community. So this list isn't the right place to ask EMR questions. - Patrick On Thu, Sep 18, 2014 at 3:19 AM, Grzegorz Białek wrote: > Hi, > I would like to run Spark application on Amazon EMR. I have some questions > about tha

Re: spark-submit: fire-and-forget mode?

2014-09-18 Thread Patrick Wendell
I agree, that's a good idea Marcelo. There isn't AFAIK any reason the client needs to hang there for correct operation. On Thu, Sep 18, 2014 at 9:39 AM, Marcelo Vanzin wrote: > Yes, what Sandy said. > > On top of that, I would suggest filing a bug for a new command line > argument for spark-submi

Re: partitioned groupBy

2014-09-17 Thread Patrick Wendell
nsRDD.compute(). I guess I would need to > override mapPartitions() directly within my RDD. Right? > > On Tue, Sep 16, 2014 at 4:57 PM, Patrick Wendell wrote: >> >> If each partition can fit in memory, you can do this using >> mapPartitions and then building an inverse

Re: partitioned groupBy

2014-09-16 Thread Patrick Wendell
If each partition can fit in memory, you can do this using mapPartitions and then building an inverse mapping within each partition. You'd need to construct a hash map within each partition yourself. On Tue, Sep 16, 2014 at 4:27 PM, Akshat Aranya wrote: > I have a use case where my RDD is set up

Re: spark-1.1.0 with make-distribution.sh problem

2014-09-14 Thread Patrick Wendell
Yeah that issue has been fixed by adding better docs, it just didn't make it in time for the release: https://github.com/apache/spark/blob/branch-1.1/make-distribution.sh#L54 On Thu, Sep 11, 2014 at 11:57 PM, Zhanfeng Huo wrote: > resolved: > > ./make-distribution.sh --name spark-hadoop-2.3.0

Re: Spark 1.1.0: Cannot load main class from JAR

2014-09-12 Thread Patrick Wendell
Hey SK, Yeah, the documented format is the same (we expect users to add the jar at the end) but the old spark-submit had a bug where it would actually accept inputs that did not match the documented format. Sorry if this was difficult to find! - Patrick On Fri, Sep 12, 2014 at 1:50 PM, SK wrote

Re: Use Case of mutable RDD - any ideas around will help.

2014-09-12 Thread Patrick Wendell
[moving to user@] This would typically be accomplished with a union() operation. You can't mutate an RDD in-place, but you can create a new RDD with a union() which is an inexpensive operator. On Fri, Sep 12, 2014 at 5:28 AM, Archit Thakur wrote: > Hi, > > We have a use case where we are plannin

Announcing Spark 1.1.0!

2014-09-11 Thread Patrick Wendell
I am happy to announce the availability of Spark 1.1.0! Spark 1.1.0 is the second release on the API-compatible 1.X line. It is Spark's largest release ever, with contributions from 171 developers! This release brings operational and performance improvements in Spark core including a new implement

Re: Deployment model popularity - Standard vs. YARN vs. Mesos vs. SIMR

2014-09-07 Thread Patrick Wendell
I would say that the first three are all used pretty heavily. Mesos was the first one supported (long ago), the standalone is the simplest and most popular today, and YARN is newer but growing a lot in activity. SIMR is not used as much... it was designed mostly for environments where users had a

Re: memory size for caching RDD

2014-09-03 Thread Patrick Wendell
Changing this is not supported, it si immutable similar to other spark configuration settings. On Wed, Sep 3, 2014 at 8:13 PM, 牛兆捷 wrote: > Dear all: > > Spark uses memory to cache RDD and the memory size is specified by > "spark.storage.memoryFraction". > > One the Executor starts, does Spark su

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Patrick Wendell
Yeah - each batch will produce a new RDD. On Wed, Aug 27, 2014 at 3:33 PM, Soumitra Kumar wrote: > Thanks. > > Just to double check, rdd.id would be unique for a batch in a DStream? > > > On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng wrote: >> >> You can use RDD id as the seed, which is unique

Submit to the "Powered By Spark" Page!

2014-08-26 Thread Patrick Wendell
Hi All, I want to invite users to submit to the Spark "Powered By" page. This page is a great way for people to learn about Spark use cases. Since Spark activity has increased a lot in the higher level libraries and people often ask who uses each one, we'll include information about which componen

Re: Understanding RDD.GroupBy OutOfMemory Exceptions

2014-08-25 Thread Patrick Wendell
ou mention might land in Spark 1.2, > is that being tracked in https://issues.apache.org/jira/browse/SPARK-1823 > or is there another ticket I should be following? > > Thanks! > Andrew > > > On Tue, Aug 5, 2014 at 3:39 PM, Patrick Wendell > wrote: > >> Hi Jens,

Re: Advantage of using cache()

2014-08-23 Thread Patrick Wendell
Yep - that's correct. As an optimization we save the shuffle output and re-use if if you execute a stage twice. So this can make A:B tests like this a bit confusing. - Patrick On Friday, August 22, 2014, Nieyuan wrote: > Because map-reduce tasks like join will save shuffle data to disk . So the

Re: Web UI doesn't show some stages

2014-08-20 Thread Patrick Wendell
The reason is that some operators get pipelined into a single stage. rdd.map(XX).filter(YY) - this executes in a single stage since there is no data movement needed in between these operations. If you call toDeubgString on the final RDD it will give you some information about the exact lineage. In

Re: Broadcast vs simple variable

2014-08-20 Thread Patrick Wendell
For large objects, it will be more efficient to broadcast it. If your array is small it won't really matter. How many centers do you have? Unless you are finding that you have very large tasks (and Spark will print a warning about this), it could be okay to just reference it directly. On Wed, Aug

Re: Advantage of using cache()

2014-08-20 Thread Patrick Wendell
Your rdd2 and rdd3 differ in two ways so it's hard to track the exact effect of caching. In rdd3, in addition to the fact that rdd will be cached, you are also doing a bunch of extra random number generation. So it will be hard to isolate the effect of caching. On Wed, Aug 20, 2014 at 7:48 AM, Gr

Re: Understanding RDD.GroupBy OutOfMemory Exceptions

2014-08-05 Thread Patrick Wendell
ay each group out sequentially on disk on one big file, you can call `sortByKey` with a hashed suffix as well. The sort functions are externalized in Spark 1.1 (which is in pre-release). - Patrick On Tue, Aug 5, 2014 at 2:39 PM, Jens Kristian Geyti wrote: > Patrick Wendell wrote > > In

Re: Understanding RDD.GroupBy OutOfMemory Exceptions

2014-08-05 Thread Patrick Wendell
Hey Jens, If you are trying to do an aggregation such as a count, the best way to do it is using reduceByKey or (in newer versions of Spark) aggregateByKey. keyVals.reduceByKey(_ + _) The groupBy operator in Spark is not an aggregation operator (e.g. in SQL where you do select sum(salary) group

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Patrick Wendell
gt; > Thanks, > Ron > > On Aug 4, 2014, at 10:01 AM, Ron's Yahoo! wrote: > > That failed since it defaulted the versions for yarn and hadoop > I'll give it a try with just 2.4.0 for both yarn and hadoop... > > Thanks, > Ron > > On Aug 4, 2014, at 9:44

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Patrick Wendell
4 -Dhadoop.version=2.4.0.2.1.1.0-385 > -DskipTests clean package > > I haven¹t tried building a distro, but it should be similar. > > > - SteveN > > On 8/4/14, 1:25, "Sean Owen" wrote: > > For any Hadoop 2.4 distro, yes, set hadoop.version but also set > -Phadoop

Re: Bad Digest error while doing aws s3 put

2014-08-04 Thread Patrick Wendell
You are hitting this issue: https://issues.apache.org/jira/browse/SPARK-2075 On Mon, Jul 28, 2014 at 5:40 AM, lmk wrote: > Hi > I was using saveAsTextFile earlier. It was working fine. When we migrated > to > spark-1.0, I started getting the following error: > java.lang.ClassNotFoundException:

Re: Cached RDD Block Size - Uneven Distribution

2014-08-04 Thread Patrick Wendell
Are you directly caching files from Hadoop or are you doing some transformation on them first? If you are doing a groupBy or some type of transformation, then you could be causing data skew that way. On Sun, Aug 3, 2014 at 1:19 PM, iramaraju wrote: > I am running spark 1.0.0, Tachyon 0.5 and Ha

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Patrick Wendell
For hortonworks, I believe it should work to just link against the corresponding upstream version. I.e. just set the Hadoop version to "2.4.0" Does that work? - Patrick On Mon, Aug 4, 2014 at 12:13 AM, Ron's Yahoo! wrote: > Hi, > Not sure whose issue this is, but if I run make-distribution

Re: What should happen if we try to cache more data than the cluster can hold in memory?

2014-08-04 Thread Patrick Wendell
BTW - the reason why the workaround could help is because when persisting to DISK_ONLY, we explicitly avoid materializing the RDD partition in memory... we just pass it through to disk On Mon, Aug 4, 2014 at 1:10 AM, Patrick Wendell wrote: > It seems possible that you are running out of mem

Re: What should happen if we try to cache more data than the cluster can hold in memory?

2014-08-04 Thread Patrick Wendell
It seems possible that you are running out of memory unrolling a single partition of the RDD. This is something that can cause your executor to OOM, especially if the cache is close to being full so the executor doesn't have much free memory left. How large are your executors? At the time of failur

Re: Low Level Kafka Consumer for Spark

2014-08-03 Thread Patrick Wendell
I'll let TD chime on on this one, but I'm guessing this would be a welcome addition. It's great to see community effort on adding new streams/receivers, adding a Java API for receivers was something we did specifically to allow this :) - Patrick On Sat, Aug 2, 2014 at 10:09 AM, Dibyendu Bhattach

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-03 Thread Patrick Wendell
Blog: https://www.dbtsai.com > LinkedIn: https://www.linkedin.com/in/dbtsai > > > On Fri, Aug 1, 2014 at 3:31 PM, Patrick Wendell > wrote: > > I've had intermiddent access to the artifacts themselves, but for me the > > directory listing always 404's. > > >

Re: disable log4j for spark-shell

2014-08-03 Thread Patrick Wendell
If you want to customize the logging behavior - the simplest way is to copy conf/log4j.properties.tempate to conf/log4j.properties. Then you can go and modify the log level in there. The spark shells should pick this up. On Sun, Aug 3, 2014 at 6:16 AM, Sean Owen wrote: > That's just a templat

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread Patrick Wendell
I've had intermiddent access to the artifacts themselves, but for me the directory listing always 404's. I think if sbt hits a 404 on the directory, it sends a somewhat confusing error message that it can't download the artifact. - Patrick On Fri, Aug 1, 2014 at 3:28 PM, Shivaram Venkataraman <

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread Patrick Wendell
This is a Scala bug - I filed something upstream, hopefully they can fix it soon and/or we can provide a work around: https://issues.scala-lang.org/browse/SI-8772 - Patrick On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau wrote: > Currently scala 2.10.2 can't be pulled in from maven central it se

Re: how to publish spark inhouse?

2014-07-28 Thread Patrick Wendell
All of the scripts we use to publish Spark releases are in the Spark repo itself, so you could follow these as a guideline. The publishing process in Maven is similar to in SBT: https://github.com/apache/spark/blob/master/dev/create-release/create-release.sh#L65 On Mon, Jul 28, 2014 at 12:39 PM,

Re: Catalyst dependency on Spark Core

2014-07-14 Thread Patrick Wendell
Adding new build modules is pretty high overhead, so if this is a case where a small amount of duplicated code could get rid of the dependency, that could also be a good short-term option. - Patrick On Mon, Jul 14, 2014 at 2:15 PM, Matei Zaharia wrote: > Yeah, I'd just add a spark-util that has

Re: Announcing Spark 1.0.1

2014-07-12 Thread Patrick Wendell
> -Brad > > On Fri, Jul 11, 2014 at 8:44 PM, Henry Saputra > wrote: >> Congrats to the Spark community ! >> >> On Friday, July 11, 2014, Patrick Wendell wrote: >>> >>> I am happy to announce the availability of Spark 1.0.1! This release >>

Announcing Spark 1.0.1

2014-07-11 Thread Patrick Wendell
I am happy to announce the availability of Spark 1.0.1! This release includes contributions from 70 developers. Spark 1.0.0 includes fixes across several areas of Spark, including the core API, PySpark, and MLlib. It also includes new features in Spark's (alpha) SQL library, including support for J

Re: issues with ./bin/spark-shell for standalone mode

2014-07-09 Thread Patrick Wendell
Hey Mikhail, I think (hope?) the -em and -dm options were never in an official Spark release. They were just in the master branch at some point. Did you use these during a previous Spark release or were you just on master? - Patrick On Wed, Jul 9, 2014 at 9:18 AM, Mikhail Strebkov wrote: > Than

Re: Purpose of spark-submit?

2014-07-09 Thread Patrick Wendell
It fulfills a few different functions. The main one is giving users a way to inject Spark as a runtime dependency separately from their program and make sure they get exactly the right version of Spark. So a user can bundle an application and then use spark-submit to send it to different types of c

Re: How to clear the list of Completed Appliations in Spark web UI?

2014-07-09 Thread Patrick Wendell
There isn't currently a way to do this, but it will start dropping older applications once more than 200 are stored. On Wed, Jul 9, 2014 at 4:04 PM, Haopu Wang wrote: > Besides restarting the Master, is there any other way to clear the > Completed Applications in Master web UI?

Re: hadoop + yarn + spark

2014-06-27 Thread Patrick Wendell
Hi There, There is an issue with PySpark-on-YARN that requires users build with Java 6. The issue has to do with how Java 6 and 7 package jar files differently. Can you try building spark with Java 6 and trying again? - Patrick On Fri, Jun 27, 2014 at 5:00 PM, sdeb wrote: > Hello, > > I have i

Re: 1.0.1 release plan

2014-06-20 Thread Patrick Wendell
Hey There, I'd like to start voting on this release shortly because there are a few important fixes that have queued up. We're just waiting to fix an akka issue. I'd guess we'll cut a vote in the next few days. - Patrick On Thu, Jun 19, 2014 at 10:47 AM, Mingyu Kim wrote: > Hi all, > > Is there

Re: Trailing Tasks Saving to HDFS

2014-06-19 Thread Patrick Wendell
I'll make a comment on the JIRA - thanks for reporting this, let's get to the bottom of it. On Thu, Jun 19, 2014 at 11:19 AM, Surendranauth Hiraman wrote: > I've created an issue for this but if anyone has any advice, please let me > know. > > Basically, on about 10 GBs of data, saveAsTextFile()

Re: Java IO Stream Corrupted - Invalid Type AC?

2014-06-17 Thread Patrick Wendell
Out of curiosity - are you guys using speculation, shuffle consolidation, or any other non-default option? If so that would help narrow down what's causing this corruption. On Tue, Jun 17, 2014 at 10:40 AM, Surendranauth Hiraman wrote: > Matt/Ryan, > > Did you make any headway on this? My team is

Re: Wildcard support in input path

2014-06-17 Thread Patrick Wendell
These paths get passed directly to the Hadoop FileSystem API and I think the support globbing out-of-the box. So AFAIK it should just work. On Tue, Jun 17, 2014 at 9:09 PM, MEETHU MATHEW wrote: > Hi Jianshi, > > I have used wild card characters (*) in my program and it worked.. > My code was like

Re: Enormous EC2 price jump makes "r3.large" patch more important

2014-06-17 Thread Patrick Wendell
w about? I'm really beginning to lean in the > direction of Cassandra as the distributed data store... > > > On Wed, Jun 18, 2014 at 1:46 PM, Patrick Wendell wrote: >> >> By the way, in case it's not clear, I mean our maintenance branches: >> >> https://gith

Re: Enormous EC2 price jump makes "r3.large" patch more important

2014-06-17 Thread Patrick Wendell
By the way, in case it's not clear, I mean our maintenance branches: https://github.com/apache/spark/tree/branch-1.0 On Tue, Jun 17, 2014 at 8:35 PM, Patrick Wendell wrote: > Hey Jeremy, > > This is patched in the 1.0 and 0.9 branches of Spark. We're likely to > make a 1.

Re: Enormous EC2 price jump makes "r3.large" patch more important

2014-06-17 Thread Patrick Wendell
Hey Jeremy, This is patched in the 1.0 and 0.9 branches of Spark. We're likely to make a 1.0.1 release soon (this patch being one of the main reasons), but if you are itching for this sooner, you can just checkout the head of branch-1.0 and you will be able to use r3.XXX instances. - Patrick On

Re: Setting spark memory limit

2014-06-09 Thread Patrick Wendell
I you run locally then Spark doesn't launch remote executors. However, in this case you can set the memory with --spark-driver-memory flag to spark-submit. Does that work? - Patrick On Mon, Jun 9, 2014 at 3:24 PM, Henggang Cui wrote: > Hi, > > I'm trying to run the SimpleApp example > (http://sp

  1   2   3   >