Re: SparkSQL errors in 1.4 rc when using with Hive 0.12 metastore

2015-05-24 Thread Mark Hamstra
This discussion belongs on the dev list. Please post any replies there. On Sat, May 23, 2015 at 10:19 PM, Cheolsoo Park wrote: > Hi, > > I've been testing SparkSQL in 1.4 rc and found two issues. I wanted to > confirm whether these are bugs or not before opening a jira. > > *1)* I can no longer

Re: Spark error "value join is not a member of org.apache.spark.rdd.RDD[((String, String), String, String)]"

2015-06-08 Thread Mark Hamstra
Correct; and PairRDDFunctions#join does still exist in versions of Spark that do have DataFrame, so you don't necessarily have to use DataFrame to do this even then (although there are advantages to using the DataFrame approach.) Your basic problem is that you have an RDD of tuples, where each tup

Re: RDD of RDDs

2015-06-09 Thread Mark Hamstra
That would constitute a major change in Spark's architecture. It's not happening anytime soon. On Tue, Jun 9, 2015 at 1:34 AM, kiran lonikar wrote: > Possibly in future, if and when spark architecture allows workers to > launch spark jobs (the functions passed to transformation or action APIs o

Re: Different Sorting RDD methods in Apache Spark

2015-06-09 Thread Mark Hamstra
Correct. Trading away scalability for increased performance is not an option for the standard Spark API. On Tue, Jun 9, 2015 at 3:05 AM, Daniel Darabos < daniel.dara...@lynxanalytics.com> wrote: > It would be even faster to load the data on the driver and sort it there > without using Spark :).

Re: Different Sorting RDD methods in Apache Spark

2015-06-09 Thread Mark Hamstra
t; Collect() would involve gathering all the data on a single machine as well. > > Thanks, > Raghav > > On Tuesday, June 9, 2015, Mark Hamstra wrote: > >> Correct. Trading away scalability for increased performance is not an >> option for the standard Spark API. >

Re: Fully in-memory shuffles

2015-06-11 Thread Mark Hamstra
> > I would guess in such shuffles the bottleneck is serializing the data > rather than raw IO, so I'm not sure explicitly buffering the data in the > JVM process would yield a large improvement. Good guess! It is very hard to beat the performance of retrieving shuffle outputs from the OS buffe

Re: Cannot iterate items in rdd.mapPartition()

2015-06-26 Thread Mark Hamstra
Do you want to transform the RDD, or just produce some side effect with its contents? If the latter, you want foreachPartition, not mapPartitions. On Fri, Jun 26, 2015 at 11:52 AM, Wang, Ningjun (LNG-NPV) < ningjun.w...@lexisnexis.com> wrote: > In rdd.mapPartition(…) if I try to iterate through

Re: Spark shell crumbles after memory is full

2015-06-29 Thread Mark Hamstra
No. He is collecting the results of the SQL query, not the whole dataset. The REPL does retain references to prior results, so it's not really the best tool to be using when you want no-longer-needed results to be automatically garbage collected. On Mon, Jun 29, 2015 at 9:13 AM, ayan guha wrote:

Re: Fair scheduler pool details

2016-03-02 Thread Mark Hamstra
standalone deployment (it is slightly mentioned in SPARK-9882, but it seems > to be abandoned). Do you know if there is such an activity? > > -- > Be well! > Jean Morozov > > On Sun, Feb 21, 2016 at 4:32 AM, Mark Hamstra > wrote: > >> It's 2 -- and it's

Re: Understanding the Web_UI 4040

2016-03-07 Thread Mark Hamstra
There's probably nothing wrong other than a glitch in the reporting of Executor state transitions to the UI -- one of those low-priority items I've been meaning to look at for awhile On Mon, Mar 7, 2016 at 12:15 AM, Sonal Goyal wrote: > Maybe check the worker logs to see what's going wrong w

Re: Spark on RAID

2016-03-08 Thread Mark Hamstra
One issue is that RAID levels providing data replication are not necessary since HDFS already replicates blocks on multiple nodes. On Tue, Mar 8, 2016 at 8:45 AM, Alex Kozlov wrote: > Parallel disk IO? But the effect should be less noticeable compared to > Hadoop which reads/writes a lot. Much

Re: Can we use spark inside a web service?

2016-03-10 Thread Mark Hamstra
> > For example, if you're looking to scale out to 1000 concurrent requests, > this is 1000 concurrent Spark jobs. This would require a cluster with 1000 > cores. This doesn't make sense. A Spark Job is a driver/DAGScheduler concept without any 1:1 correspondence between Worker cores and Jobs.

Re: Can we use spark inside a web service?

2016-03-10 Thread Mark Hamstra
uction. i > would love for someone to prove otherwise. > > On Thu, Mar 10, 2016 at 5:44 PM, Mark Hamstra > wrote: > >> For example, if you're looking to scale out to 1000 concurrent requests, >>> this is 1000 concurrent Spark jobs. This would require a cluster

Re: Spark UI Completed Jobs

2016-03-15 Thread Mark Hamstra
It's not just if the RDD is explicitly cached, but also if the map outputs for stages have been materialized into shuffle files and are still accessible through the map output tracker. Because of that, explicitly caching RDD actions often gains you little or nothing, since even without a call to c

Re: Spark UI Completed Jobs

2016-03-15 Thread Mark Hamstra
g may not be 100% accurate and bug free. On Tue, Mar 15, 2016 at 6:34 PM, Prabhu Joseph wrote: > Okay, so out of 164 stages, is 163 are skipped. And how 41405 tasks are > skipped if the total is only 19788. > > On Wed, Mar 16, 2016 at 6:31 AM, Mark Hamstra > wrote: > >>

Re: Error using collectAsMap() in scala

2016-03-20 Thread Mark Hamstra
You're not getting what Ted is telling you. Your `dict` is an RDD[String] -- i.e. it is a collection of a single value type, String. But `collectAsMap` is only defined for PairRDDs that have key-value pairs for their data elements. Both a key and a value are needed to collect into a Map[K, V].

Re: What's the benifit of RDD checkpoint against RDD save

2016-03-23 Thread Mark Hamstra
Neither of you is making any sense to me. If you just have an RDD for which you have specified a series of transformations but you haven't run any actions, then neither checkpointing nor saving makes sense -- you haven't computed anything yet, you've only written out the recipe for how the computa

Re: What's the benifit of RDD checkpoint against RDD save

2016-03-23 Thread Mark Hamstra
On Wed, Mar 23, 2016 at 7:38 PM, Ted Yu wrote: > bq. when I get the last RDD > If I read Todd's first email correctly, the computation has been done. > I could be wrong. > > On Wed, Mar 23, 2016 at 7:34 PM, Mark Hamstra > wrote: > >> Neither of you is making a

Re: No active SparkContext

2016-03-24 Thread Mark Hamstra
You seem to be confusing the concepts of Job and Application. A Spark Application has a SparkContext. A Spark Application is capable of running multiple Jobs, each with its own ID, visible in the webUI. On Thu, Mar 24, 2016 at 6:11 AM, Max Schmidt wrote: > Am 24.03.2016 um 10:34 schrieb Simon

Re: Spark and N-tier architecture

2016-03-29 Thread Mark Hamstra
Yes and no. The idea of n-tier architecture is about 20 years older than Spark and doesn't really apply to Spark as n-tier was original conceived. If the n-tier model helps you make sense of some things related to Spark, then use it; but don't get hung up on trying to force a Spark architecture in

Re: Spark and N-tier architecture

2016-03-29 Thread Mark Hamstra
file/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 30 March 2016 at 00:22, Mark Hamstra wrote: > >> Yes and no. The i

Re: Executor shutdown hooks?

2016-04-06 Thread Mark Hamstra
Why would the Executors shutdown when the Job is terminated? Executors are bound to Applications, not Jobs. Furthermore, unless spark.job.interruptOnCancel is set to true, canceling the Job at the Application and DAGScheduler level won't actually interrupt the Tasks running on the Executors. If

Re: Spark GUI, Workers and Executors

2016-04-09 Thread Mark Hamstra
https://spark.apache.org/docs/latest/cluster-overview.html On Sat, Apr 9, 2016 at 12:28 AM, Ashok Kumar wrote: > On Spark GUI I can see the list of Workers. > > I always understood that workers are used by executors. > > What is the relationship between workers and executors please. Is it one >

Re: YARN vs Standalone Spark Usage in production

2016-04-14 Thread Mark Hamstra
That's also available in standalone. On Thu, Apr 14, 2016 at 12:47 PM, Alexander Pivovarov wrote: > Spark on Yarn supports dynamic resource allocation > > So, you can run several spark-shells / spark-submits / spark-jobserver / > zeppelin on one cluster without defining upfront how many executor

Re: Apache Flink

2016-04-17 Thread Mark Hamstra
To be fair, the Stratosphere project from which Flink springs was started as a collaborative university research project in Germany about the same time that Spark was first released as Open Source, so they are near contemporaries rather than Flink having been started only well after Spark was an es

Re: Spark UI doesn't give visibility on which stage job actually failed (due to lazy eval nature)

2016-05-25 Thread Mark Hamstra
You appear to be misunderstanding the nature of a Stage. Individual transformation steps such as `map` do not define the boundaries of Stages. Rather, a sequence of transformations in which there is only a NarrowDependency between each of the transformations will be pipelined into a single Stage.

Re: Spark UI doesn't give visibility on which stage job actually failed (due to lazy eval nature)

2016-05-25 Thread Mark Hamstra
box then it's fine but when you > have large number of people on this site complaining about OOM and shuffle > error all the time you need to start providing some transparency to > address that. > > Thanks > > > On Wed, May 25, 2016 at 6:41 PM, Mark Hamstra > wrote: >

Re: What is the interpretation of Cores in Spark doc

2016-06-13 Thread Mark Hamstra
I don't know what documentation you were referring to, but this is clearly an erroneous statement: "Threads are virtual cores." At best it is terminology abuse by a hardware manufacturer. Regardless, Spark can't get too concerned about how any particular hardware vendor wants to refer to the spec

Re: What is the interpretation of Cores in Spark doc

2016-06-16 Thread Mark Hamstra
d >> applications. >> >> HTH >> >> Dr Mich Talebzadeh >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABU

Re: What is the interpretation of Cores in Spark doc

2016-06-16 Thread Mark Hamstra
‘logical’ processors vs. cores and POSIX threaded >> applications. >> >> HTH >> >> Dr Mich Talebzadeh >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/

Re: What is the interpretation of Cores in Spark doc

2016-06-16 Thread Mark Hamstra
kedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 16 June 2016 at 19:07, Mark Hamstra wrot

Re: standalone mode only supports FIFO scheduler across applications ? still in spark 2.0 time ?

2016-07-15 Thread Mark Hamstra
Nothing has changed in that regard, nor is there likely to be "progress", since more sophisticated or capable resource scheduling at the Application level is really beyond the design goals for standalone mode. If you want more in the way of multi-Application resource scheduling, then you should be

Re: SPARK Exception thrown in awaitResult

2016-07-28 Thread Mark Hamstra
Don't use Spark 2.0.0-preview. That was a preview release with known issues, and was intended to be used only for early, pre-release testing purpose. Spark 2.0.0 is now released, and you should be using that. On Thu, Jul 28, 2016 at 3:48 AM, Carlo.Allocca wrote: > and, of course I am using > >

Re: Spark 2.0.1 / 2.1.0 on Maven

2016-08-09 Thread Mark Hamstra
What are you expecting to find? There currently are no releases beyond Spark 2.0.0. On Tue, Aug 9, 2016 at 9:55 AM, Jestin Ma wrote: > If we want to use versions of Spark beyond the official 2.0.0 release, > specifically on Maven + Java, what steps should we take to upgrade? I can't > find the

Re: Spark 2.0.1 / 2.1.0 on Maven

2016-08-09 Thread Mark Hamstra
imer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damag

Re: RE: Spark assembly in Maven repo?

2015-12-10 Thread Mark Hamstra
No, publishing a spark assembly jar is not fine. See the doc attached to https://issues.apache.org/jira/browse/SPARK-11157 and be aware that a likely goal of Spark 2.0 will be the elimination of assemblies. On Thu, Dec 10, 2015 at 11:19 PM, fightf...@163.com wrote: > Using maven to download the

Re: SparkContext.cancelJob - what part of Spark uses it? Nothing in webUI to kill jobs?

2015-12-16 Thread Mark Hamstra
It can be used, and is used in user code, but it isn't always as straightforward as you might think. This is mostly because a Job often isn't a Job -- or rather it is more than one Job. There are several RDD transformations that aren't lazy, so they end up launching "hidden" Jobs that you may not

Re: SparkContext.cancelJob - what part of Spark uses it? Nothing in webUI to kill jobs?

2015-12-17 Thread Mark Hamstra
g-apache-spark/ > Follow me at https://twitter.com/jaceklaskowski > Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski > > > On Wed, Dec 16, 2015 at 10:55 AM, Mark Hamstra > wrote: > > It can be used, and is used in user code, but it isn't always as >

Re: Can spark.scheduler.pool be applied globally ?

2016-01-05 Thread Mark Hamstra
I don't understand. If you're using fair scheduling and don't set a pool, the default pool will be used. On Tue, Jan 5, 2016 at 1:57 AM, Jeff Zhang wrote: > > It seems currently spark.scheduler.pool must be set as localProperties > (associate with thread). Any reason why spark.scheduler.pool ca

Re: Can spark.scheduler.pool be applied globally ?

2016-01-05 Thread Mark Hamstra
ant is the default pool is fair > scheduling. But seems if I want to use fair scheduling now, I have to set > spark.scheduler.pool explicitly. > > On Wed, Jan 6, 2016 at 2:03 AM, Mark Hamstra > wrote: > >> I don't understand. If you're using fair scheduling an

Re: Can spark.scheduler.pool be applied globally ?

2016-01-05 Thread Mark Hamstra
I can override the root pool in configuration file, Thanks Mark. > > On Wed, Jan 6, 2016 at 8:45 AM, Mark Hamstra > wrote: > >> Just configure with >> FAIR in fairscheduler.xml (or >> in spark.scheduler.allocation.file if you have over-riden the default name >>

Re: spark 1.6 Issue

2016-01-06 Thread Mark Hamstra
It's not a bug, but a larger heap is required with the new UnifiedMemoryManager: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/memory/UnifiedMemoryManager.scala#L172 On Wed, Jan 6, 2016 at 6:35 AM, kali.tumm...@gmail.com < kali.tumm...@gmail.com> wrote: > Hi All

Re: simultaneous actions

2016-01-17 Thread Mark Hamstra
Same SparkContext means same pool of Workers. It's up to the Scheduler, not the SparkContext, whether the exact same Workers or Executors will be used to calculate simultaneous actions against the same RDD. It is likely that many of the same Workers and Executors will be used as the Scheduler tri

Re: Reuse Executor JVM across different JobContext

2016-01-17 Thread Mark Hamstra
-dev What do you mean by JobContext? That is a Hadoop mapreduce concept, not Spark. On Sun, Jan 17, 2016 at 7:29 AM, Jia Zou wrote: > Dear all, > > Is there a way to reuse executor JVM across different JobContexts? Thanks. > > Best Regards, > Jia >

Re: simultaneous actions

2016-01-17 Thread Mark Hamstra
ice surprise to me > > On Sun, Jan 17, 2016 at 3:17 PM, Mark Hamstra > wrote: > >> Same SparkContext means same pool of Workers. It's up to the Scheduler, >> not the SparkContext, whether the exact same Workers or Executors will be >> used to calculate simultaneou

Re: Reuse Executor JVM across different JobContext

2016-01-17 Thread Mark Hamstra
a Zou wrote: > Hi, Mark, sorry, I mean SparkContext. > I mean to change Spark into running all submitted jobs (SparkContexts) in > one executor JVM. > > Best Regards, > Jia > > On Sun, Jan 17, 2016 at 2:21 PM, Mark Hamstra > wrote: > >> -dev >> >&

Re: Reuse Executor JVM across different JobContext

2016-01-17 Thread Mark Hamstra
utor. After the application > runs to completion. The executor process will be killed. > But I hope that all applications submitted can run in the same executor, > can JobServer do that? If so, it’s really good news! > > Best Regards, > Jia > > On Jan 17, 2016, at 3:09 PM, Mark

Re: Reuse Executor JVM across different JobContext

2016-01-17 Thread Mark Hamstra
ess JobServer can fundamentally solve my problem, > so that jobs can be submitted at different time and still share RDDs. > > Best Regards, > Jia > > > On Jan 17, 2016, at 3:44 PM, Mark Hamstra wrote: > > There is a 1-to-1 relationship between Spark Applications and > SparkC

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-01-25 Thread Mark Hamstra
What do you think is preventing you from optimizing your own RDD-level transformations and actions? AFAIK, nothing that has been added in Catalyst precludes you from doing that. The fact of the matter is, though, that there is less type and semantic information available to Spark from the raw RDD

Re: Spark 2.0.0 release plan

2016-01-29 Thread Mark Hamstra
https://github.com/apache/spark/pull/10608 On Fri, Jan 29, 2016 at 11:50 AM, Jakob Odersky wrote: > I'm not an authoritative source but I think it is indeed the plan to > move the default build to 2.11. > > See this discussion for more detail > > http://apache-spark-developers-list.1001551.n3.na

Re: Creating HiveContext in Spark-Shell fails

2016-02-15 Thread Mark Hamstra
Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT /_/ Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_72) Type in expressions to have them evaluated. Type :he

Re: Fair scheduler pool details

2016-02-20 Thread Mark Hamstra
It's 2 -- and it's pretty hard to point to a line of code, a method, or even a class since the scheduling of Tasks involves a pretty complex interaction of several Spark components -- mostly the DAGScheduler, TaskScheduler/TaskSchedulerImpl, TaskSetManager, Schedulable and Pool, as well as the Sche

Re: SQL queries in Spark / YARN

2015-09-28 Thread Mark Hamstra
Yes. On Mon, Sep 28, 2015 at 12:46 PM, Robert Grandl wrote: > Hi guys, > > I was wondering if it's possible to submit SQL queries to Spark SQL, when > Spark is running atop YARN instead of standalone mode. > > Thanks, > Robert >

Re: foreachPartition

2015-10-30 Thread Mark Hamstra
The closure is sent to and executed an Executor, so you need to be looking at the stdout of the Executors, not on the Driver. On Fri, Oct 30, 2015 at 4:42 PM, Alex Nastetsky < alex.nastet...@vervemobile.com> wrote: > I'm just trying to do some operation inside foreachPartition, but I can't > even

Re: Why does sortByKey() transformation trigger a job in spark-shell?

2015-11-02 Thread Mark Hamstra
Hah! No, that is not a "starter" issue. It touches on some fairly deep Spark architecture, and there have already been a few attempts to resolve the issue -- none entirely satisfactory, but you should definitely search out the work that has already been done. On Mon, Nov 2, 2015 at 5:51 AM, Jace

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Mark Hamstra
For more than a small number of files, you'd be better off using SparkContext#union instead of RDD#union. That will avoid building up a lengthy lineage. On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky wrote: > Hey Jeff, > Do you mean reading from multiple text files? In that case, as a > workar

Re: Slow stage?

2015-11-11 Thread Mark Hamstra
Those are from the Application Web UI -- look for the "DAG Visualization" and "Event Timeline" elements on Job and Stage pages. On Wed, Nov 11, 2015 at 10:58 AM, Jakob Odersky wrote: > Hi Simone, > I'm afraid I don't have an answer to your question. However I noticed the > DAG figures in the att

Re: Relation between RDDs, DataFrames and Project Tungsten

2015-11-23 Thread Mark Hamstra
> > In the near future, I guess GUI interfaces of Spark will be available > soon. Spark users (e.g, CEOs) might not need to know what are RDDs at all. > They can analyze their data by clicking a few buttons, instead of writing > the programs. : ) That's not in the future. :) On Mon, Nov 23, 201

Re: Spark on yarn vs spark standalone

2015-11-30 Thread Mark Hamstra
Standalone mode also supports running the driver on a cluster node. See "cluster" mode in http://spark.apache.org/docs/latest/spark-standalone.html#launching-spark-applications . Also, http://spark.apache.org/docs/latest/spark-standalone.html#high-availability On Mon, Nov 30, 2015 at 9:47 AM, Ja

Re: Low Latency SQL query

2015-12-01 Thread Mark Hamstra
> > It is not designed for interactive queries. You might want to ask the designers of Spark, Spark SQL, and particularly some things built on top of Spark (such as BlinkDB) about their intent with regard to interactive queries. Interactive queries are not the only designed use of Spark, but it

Re: Low Latency SQL query

2015-12-01 Thread Mark Hamstra
ncy it may > not be a good fit. > > M > > On Dec 1, 2015, at 7:23 PM, Andrés Ivaldi wrote: > > Ok, so latency problem is being generated because I'm using SQL as source? > how about csv, hive, or another source? > > On Tue, Dec 1, 2015 at 9:18 PM, Mark

Re: Low Latency SQL query

2015-12-01 Thread Mark Hamstra
some other solution. On Tue, Dec 1, 2015 at 4:23 PM, Andrés Ivaldi wrote: > Ok, so latency problem is being generated because I'm using SQL as source? > how about csv, hive, or another source? > > On Tue, Dec 1, 2015 at 9:18 PM, Mark Hamstra > wrote: > >> It is no

Re: is Multiple Spark Contexts is supported in spark 1.5.0 ?

2015-12-04 Thread Mark Hamstra
Where it could start to make some sense is if you wanted a single application to be able to work with more than one Spark cluster -- but that's a pretty weird or unusual thing to do, and I'm pretty sure it wouldn't work correctly at present. On Fri, Dec 4, 2015 at 11:10 AM, Michael Armbrust wrote

Re: TCP/IP speedup

2015-08-01 Thread Mark Hamstra
https://spark-summit.org/2015/events/making-sense-of-spark-performance/ On Sat, Aug 1, 2015 at 3:24 PM, Simon Edelhaus wrote: > Hi All! > > How important would be a significant performance improvement to TCP/IP > itself, in terms of > overall job performance improvement. Which part would be most

Re: Set Job Descriptions for Scala application

2015-08-05 Thread Mark Hamstra
SparkContext#setJobDescription or SparkContext#setJobGroup On Wed, Aug 5, 2015 at 12:29 PM, Rares Vernica wrote: > Hello, > > My Spark application is written in Scala and submitted to a Spark cluster > in standalone mode. The Spark Jobs for my application are listed in the > Spark UI like this:

Re: A way to timeout and terminate a laggard 'Stage' ?

2015-09-15 Thread Mark Hamstra
There is the Async API ( https://github.com/clearstorydata/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala), which makes use of FutureAction ( https://github.com/clearstorydata/spark/blob/master/core/src/main/scala/org/apache/spark/FutureAction.scala). You could als

Re: Potential racing condition in DAGScheduler when Spark 1.5 caching

2015-09-24 Thread Mark Hamstra
Where do you see a race in the DAGScheduler? On a quick look at your stack trace, this just looks to me like a Job where a Stage failed and then the DAGScheduler aborted the failed Job. On Thu, Sep 24, 2015 at 12:00 PM, robin_up wrote: > Hi > > After upgrade to 1.5, we found a possible racing c

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Mark Hamstra
One way you can start to make this make more sense, Sean, is if you exploit the code/data duality so that the non-distributed data that you are sending out from the driver is actually paying a role more like code (or at least parameters.) What is sent from the driver to an Executer is then used (t

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Mark Hamstra
s/playing a role/paying a role/ On Thu, Aug 25, 2016 at 12:51 PM, Mark Hamstra wrote: > One way you can start to make this make more sense, Sean, is if you > exploit the code/data duality so that the non-distributed data that you are > sending out from the driver is actually paying a

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Mark Hamstra
vailability. It is used in Spark >>> Streaming with Kafka, it is also used with Hive for concurrency. It is also >>> a distributed locking system. >>> >>> HTH >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn *

Re: Spark scheduling mode

2016-09-01 Thread Mark Hamstra
Just because you've flipped spark.scheduler.mode to FAIR, that doesn't mean that Spark can magically configure and start multiple scheduling pools for you, nor can it know to which pools you want jobs assigned. Without doing any setup of additional scheduling pools or assigning of jobs to pools, y

Re: Spark scheduling mode

2016-09-01 Thread Mark Hamstra
mean, round robin for the jobs that belong to the default pool. > > Cheers, > -- > *From:* Mark Hamstra > *Sent:* Thursday, September 1, 2016 7:24:54 PM > *To:* enrico d'urso > *Cc:* user@spark.apache.org > *Subject:* Re: Spark schedul

Re: Spark scheduling mode

2016-09-01 Thread Mark Hamstra
scheduled in round robin way, > am I right? > > -- > *From:* Mark Hamstra > *Sent:* Thursday, September 1, 2016 8:19:44 PM > *To:* enrico d'urso > *Cc:* user@spark.apache.org > *Subject:* Re: Spark scheduling mode > > The default pool (``) can be configured like any > ot

Re: Spark scheduling mode

2016-09-02 Thread Mark Hamstra
ht? > > Thank you > ---------- > *From:* Mark Hamstra > *Sent:* Thursday, September 1, 2016 8:44:10 PM > > *To:* enrico d'urso > *Cc:* user@spark.apache.org > *Subject:* Re: Spark scheduling mode > > Spark's FairSchedulingAlgorithm is not round robin:

Re: Spark scheduling mode

2016-09-02 Thread Mark Hamstra
And, no, Spark's scheduler will not preempt already running Tasks. In fact, just killing running Tasks for any reason is trickier than we'd like it to be, so it isn't done by default: https://issues.apache.org/jira/browse/SPARK-17064 On Fri, Sep 2, 2016 at 11:34 AM, Mark Hamstra

Re: Spark 2.0.0 won't let you create a new SparkContext?

2016-09-13 Thread Mark Hamstra
It sounds like you should be writing an application and not trying to force the spark-shell to do more than what it was intended for. On Tue, Sep 13, 2016 at 11:53 AM, Kevin Burton wrote: > I sort of agree but the problem is that some of this should be code. > > Some of our ES indexes have 100-2

Re: Is executor computing time affected by network latency?

2016-09-23 Thread Mark Hamstra
> > The best network results are achieved when Spark nodes share the same > hosts as Hadoop or they happen to be on the same subnet. > That's only true for those portions of a Spark execution pipeline that are actually reading from HDFS. If you're re-using an RDD for which the needed shuffle file

Re: How to stop a running job

2016-10-05 Thread Mark Hamstra
Yes and no. Something that you need to be aware of is that a Job as such exists in the DAGScheduler as part of the Application running on the Driver. When talking about stopping or killing a Job, however, what people often mean is not just stopping the DAGScheduler from telling the Executors to r

Re: previous stage results are not saved?

2016-10-17 Thread Mark Hamstra
There is no need to do that if 1) the stage that you are concerned with either made use of or produced MapOutputs/shuffle files; 2) reuse of those shuffle files (which may very well be in the OS buffer cache of the worker nodes) is sufficient for your needs; 3) the relevant Stage objects haven't go

Re: Infinite Loop in Spark

2016-10-27 Thread Mark Hamstra
Using a single SparkContext for an extended period of time is how long-running Spark Applications such as the Spark Job Server work ( https://github.com/spark-jobserver/spark-jobserver). It's an established pattern. On Thu, Oct 27, 2016 at 11:46 AM, Gervásio Santos wrote: > Hi guys! > > I'm dev

Re: Is there any scheduled release date for Spark 2.1.0?

2016-12-28 Thread Mark Hamstra
The v2.1.0 tag is there: https://github.com/apache/spark/tree/v2.1.0 On Wed, Dec 28, 2016 at 2:04 PM, Koert Kuipers wrote: > seems like the artifacts are on maven central but the website is not yet > updated. > > strangely the tag v2.1.0 is not yet available on github. i assume its > equal to v2

Re: Is there any scheduled release date for Spark 2.1.0?

2016-12-28 Thread Mark Hamstra
; > Wed > Dec 28 20:01:10 UTC 2016 > 2.2.0-SNAPSHOT/ > <https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.11/2.2.0-SNAPSHOT/> > Wed > Dec 28 19:12:38 UTC 2016 > > What's with 2.1.1-SNAPSHOT? Is that version about to be released as well? &g

Re: Spark and Kafka integration

2017-01-12 Thread Mark Hamstra
See "API compatibility" in http://spark.apache.org/versioning-policy.html While code that is annotated as Experimental is still a good faith effort to provide a stable and useful API, the fact is that we're not yet confident enough that we've got the public API in exactly the form that we want to

Re:

2017-01-21 Thread Mark Hamstra
I wouldn't say that Executors are dumb, but there are some pretty clear divisions of concepts and responsibilities across the different pieces of the Spark architecture. A Job is a concept that is completely unknown to an Executor, which deals instead with just the Tasks that it is given. So you a

Re: DAG Visualization option is missing on Spark Web UI

2017-01-28 Thread Mark Hamstra
Try selecting a particular Job instead of looking at the summary page for all Jobs. On Sat, Jan 28, 2017 at 4:25 PM, Md. Rezaul Karim < rezaul.ka...@insight-centre.org> wrote: > Hi Jacek, > > I tried accessing Spark web UI on both Firefox and Google Chrome browsers > with ad blocker enabled. I do

Re: Having multiple spark context

2017-01-29 Thread Mark Hamstra
More than one Spark Context in a single Application is not supported. On Sun, Jan 29, 2017 at 9:08 PM, wrote: > Hi, > > > > I have a requirement in which, my application creates one Spark context in > Distributed mode whereas another Spark context in local mode. > > When I am creating this, my c

Re: can I use Spark Standalone with HDFS but no YARN

2017-02-03 Thread Mark Hamstra
yes On Fri, Feb 3, 2017 at 10:08 PM, kant kodali wrote: > can I use Spark Standalone with HDFS but no YARN? > > Thanks! >

Re: is dataframe thread safe?

2017-02-13 Thread Mark Hamstra
If you update the data, then you don't have the same DataFrame anymore. If you don't do like Assaf did, caching and forcing evaluation of the DataFrame before using that DataFrame concurrently, then you'll still get consistent and correct results, but not necessarily efficient results. If the fully

Fwd: Will Spark ever run the same task at the same time

2017-02-20 Thread Mark Hamstra
First, the word you are looking for is "straggler", not "strangler" -- very different words. Second, "idempotent" doesn't mean "only happens once", but rather "if it does happen more than once, the effect is no different than if it only happened once". It is possible to insert a nearly limitless v

Re: Can't transform RDD for the second time

2017-02-28 Thread Mark Hamstra
foreachPartition is not a transformation; it is an action. If you want to transform an RDD using an iterator in each partition, then use mapPartitions. On Tue, Feb 28, 2017 at 8:17 PM, jeremycod wrote: > Hi, > > I'm trying to transform one RDD two times. I'm using foreachParition and > embedded

Re: Spark shuffle files

2017-03-27 Thread Mark Hamstra
Shuffle files are cleaned when they are no longer referenced. See https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ContextCleaner.scala On Mon, Mar 27, 2017 at 12:38 PM, Ashwin Sai Shankar < ashan...@netflix.com.invalid> wrote: > Hi! > > In spark on yarn, when are

Re: Spark shuffle files

2017-03-27 Thread Mark Hamstra
When the RDD using them goes out of scope. On Mon, Mar 27, 2017 at 3:13 PM, Ashwin Sai Shankar wrote: > Thanks Mark! follow up question, do you know when shuffle files are > usually un-referenced? > > On Mon, Mar 27, 2017 at 2:35 PM, Mark Hamstra > wrote: > >> Shuffl

Re: Spark fair scheduler pools vs. YARN queues

2017-04-05 Thread Mark Hamstra
Your mixing up different levels of scheduling. Spark's fair scheduler pools are about scheduling Jobs, not Applications; whereas YARN queues with Spark are about scheduling Applications, not Jobs. On Wed, Apr 5, 2017 at 12:27 PM, Nick Chammas wrote: > I'm having trouble understanding the differe

Re: Spark fair scheduler pools vs. YARN queues

2017-04-05 Thread Mark Hamstra
grrr... s/your/you're/ On Wed, Apr 5, 2017 at 12:54 PM, Mark Hamstra wrote: > Your mixing up different levels of scheduling. Spark's fair scheduler > pools are about scheduling Jobs, not Applications; whereas YARN queues with > Spark are about scheduling Applications, not Jo

Re: Spark fair scheduler pools vs. YARN queues

2017-04-05 Thread Mark Hamstra
evant/useful in this context? > > On Wed, Apr 5, 2017 at 3:54 PM Mark Hamstra > wrote: > >> grrr... s/your/you're/ >> >> On Wed, Apr 5, 2017 at 12:54 PM, Mark Hamstra >> wrote: >> >> Your mixing up different levels of scheduling. Spark's

Re: Securing Spark Job on Cluster

2017-04-28 Thread Mark Hamstra
spark.local.dir http://spark.apache.org/docs/latest/configuration.html On Fri, Apr 28, 2017 at 8:51 AM, Shashi Vishwakarma < shashi.vish...@gmail.com> wrote: > Yes I am using HDFS .Just trying to understand couple of point. > > There would be two kind of encryption which would be required. > > 1

Re: scalastyle violation on mvn install but not on mvn package

2017-05-04 Thread Mark Hamstra
The check goal of the scalastyle plugin runs during the "verify" phase, which is between "package" and "install"; so running just "package" will not run scalastyle:check. On Thu, May 4, 2017 at 7:45 AM, yiskylee wrote: > ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean >

Re: Multiple CSV libs causes issues spark 2.1

2017-05-09 Thread Mark Hamstra
This looks more like a matter for Databricks support than spark-user. On Tue, May 9, 2017 at 2:02 PM, lucas.g...@gmail.com wrote: > df = spark.sqlContext.read.csv('out/df_in.csv') >> > > >> 17/05/09 15:51:29 WARN ObjectStore: Version information not found in >> metastore. hive.metastore.schema.v

Re: Multiple CSV libs causes issues spark 2.1

2017-05-09 Thread Mark Hamstra
ding which > lib to use. > > On 9 May 2017 at 14:30, Mark Hamstra wrote: > >> This looks more like a matter for Databricks support than spark-user. >> >> On Tue, May 9, 2017 at 2:02 PM, lucas.g...@gmail.com < >> lucas.g...@gmail.com> wrote: >> >>>

Re: scalastyle violation on mvn install but not on mvn package

2017-05-23 Thread Mark Hamstra
> two replies are not even in the same "email conversation". > I don't know the mechanics of why posts do or don't show up via Nabble, but Nabble is neither the canonical archive nor the system of record for Apache mailing lists. > On Thu, May 4, 2017 at 8:11 PM, Mark

  1   2   >