This discussion belongs on the dev list. Please post any replies there.
On Sat, May 23, 2015 at 10:19 PM, Cheolsoo Park
wrote:
> Hi,
>
> I've been testing SparkSQL in 1.4 rc and found two issues. I wanted to
> confirm whether these are bugs or not before opening a jira.
>
> *1)* I can no longer
Correct; and PairRDDFunctions#join does still exist in versions of Spark
that do have DataFrame, so you don't necessarily have to use DataFrame to
do this even then (although there are advantages to using the DataFrame
approach.)
Your basic problem is that you have an RDD of tuples, where each tup
That would constitute a major change in Spark's architecture. It's not
happening anytime soon.
On Tue, Jun 9, 2015 at 1:34 AM, kiran lonikar wrote:
> Possibly in future, if and when spark architecture allows workers to
> launch spark jobs (the functions passed to transformation or action APIs o
Correct. Trading away scalability for increased performance is not an
option for the standard Spark API.
On Tue, Jun 9, 2015 at 3:05 AM, Daniel Darabos <
daniel.dara...@lynxanalytics.com> wrote:
> It would be even faster to load the data on the driver and sort it there
> without using Spark :).
t; Collect() would involve gathering all the data on a single machine as well.
>
> Thanks,
> Raghav
>
> On Tuesday, June 9, 2015, Mark Hamstra wrote:
>
>> Correct. Trading away scalability for increased performance is not an
>> option for the standard Spark API.
>
>
> I would guess in such shuffles the bottleneck is serializing the data
> rather than raw IO, so I'm not sure explicitly buffering the data in the
> JVM process would yield a large improvement.
Good guess! It is very hard to beat the performance of retrieving shuffle
outputs from the OS buffe
Do you want to transform the RDD, or just produce some side effect with its
contents? If the latter, you want foreachPartition, not mapPartitions.
On Fri, Jun 26, 2015 at 11:52 AM, Wang, Ningjun (LNG-NPV) <
ningjun.w...@lexisnexis.com> wrote:
> In rdd.mapPartition(…) if I try to iterate through
No. He is collecting the results of the SQL query, not the whole dataset.
The REPL does retain references to prior results, so it's not really the
best tool to be using when you want no-longer-needed results to be
automatically garbage collected.
On Mon, Jun 29, 2015 at 9:13 AM, ayan guha wrote:
standalone deployment (it is slightly mentioned in SPARK-9882, but it seems
> to be abandoned). Do you know if there is such an activity?
>
> --
> Be well!
> Jean Morozov
>
> On Sun, Feb 21, 2016 at 4:32 AM, Mark Hamstra
> wrote:
>
>> It's 2 -- and it's
There's probably nothing wrong other than a glitch in the reporting of
Executor state transitions to the UI -- one of those low-priority items
I've been meaning to look at for awhile
On Mon, Mar 7, 2016 at 12:15 AM, Sonal Goyal wrote:
> Maybe check the worker logs to see what's going wrong w
One issue is that RAID levels providing data replication are not necessary
since HDFS already replicates blocks on multiple nodes.
On Tue, Mar 8, 2016 at 8:45 AM, Alex Kozlov wrote:
> Parallel disk IO? But the effect should be less noticeable compared to
> Hadoop which reads/writes a lot. Much
>
> For example, if you're looking to scale out to 1000 concurrent requests,
> this is 1000 concurrent Spark jobs. This would require a cluster with 1000
> cores.
This doesn't make sense. A Spark Job is a driver/DAGScheduler concept
without any 1:1 correspondence between Worker cores and Jobs.
uction. i
> would love for someone to prove otherwise.
>
> On Thu, Mar 10, 2016 at 5:44 PM, Mark Hamstra
> wrote:
>
>> For example, if you're looking to scale out to 1000 concurrent requests,
>>> this is 1000 concurrent Spark jobs. This would require a cluster
It's not just if the RDD is explicitly cached, but also if the map outputs
for stages have been materialized into shuffle files and are still
accessible through the map output tracker. Because of that, explicitly
caching RDD actions often gains you little or nothing, since even without a
call to c
g may not be 100% accurate and bug
free.
On Tue, Mar 15, 2016 at 6:34 PM, Prabhu Joseph
wrote:
> Okay, so out of 164 stages, is 163 are skipped. And how 41405 tasks are
> skipped if the total is only 19788.
>
> On Wed, Mar 16, 2016 at 6:31 AM, Mark Hamstra
> wrote:
>
>>
You're not getting what Ted is telling you. Your `dict` is an RDD[String]
-- i.e. it is a collection of a single value type, String. But
`collectAsMap` is only defined for PairRDDs that have key-value pairs for
their data elements. Both a key and a value are needed to collect into a
Map[K, V].
Neither of you is making any sense to me. If you just have an RDD for
which you have specified a series of transformations but you haven't run
any actions, then neither checkpointing nor saving makes sense -- you
haven't computed anything yet, you've only written out the recipe for how
the computa
On Wed, Mar 23, 2016 at 7:38 PM, Ted Yu wrote:
> bq. when I get the last RDD
> If I read Todd's first email correctly, the computation has been done.
> I could be wrong.
>
> On Wed, Mar 23, 2016 at 7:34 PM, Mark Hamstra
> wrote:
>
>> Neither of you is making a
You seem to be confusing the concepts of Job and Application. A Spark
Application has a SparkContext. A Spark Application is capable of running
multiple Jobs, each with its own ID, visible in the webUI.
On Thu, Mar 24, 2016 at 6:11 AM, Max Schmidt wrote:
> Am 24.03.2016 um 10:34 schrieb Simon
Yes and no. The idea of n-tier architecture is about 20 years older than
Spark and doesn't really apply to Spark as n-tier was original conceived.
If the n-tier model helps you make sense of some things related to Spark,
then use it; but don't get hung up on trying to force a Spark architecture
in
file/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 30 March 2016 at 00:22, Mark Hamstra wrote:
>
>> Yes and no. The i
Why would the Executors shutdown when the Job is terminated? Executors are
bound to Applications, not Jobs. Furthermore,
unless spark.job.interruptOnCancel is set to true, canceling the Job at the
Application and DAGScheduler level won't actually interrupt the Tasks
running on the Executors. If
https://spark.apache.org/docs/latest/cluster-overview.html
On Sat, Apr 9, 2016 at 12:28 AM, Ashok Kumar
wrote:
> On Spark GUI I can see the list of Workers.
>
> I always understood that workers are used by executors.
>
> What is the relationship between workers and executors please. Is it one
>
That's also available in standalone.
On Thu, Apr 14, 2016 at 12:47 PM, Alexander Pivovarov
wrote:
> Spark on Yarn supports dynamic resource allocation
>
> So, you can run several spark-shells / spark-submits / spark-jobserver /
> zeppelin on one cluster without defining upfront how many executor
To be fair, the Stratosphere project from which Flink springs was started
as a collaborative university research project in Germany about the same
time that Spark was first released as Open Source, so they are near
contemporaries rather than Flink having been started only well after Spark
was an es
You appear to be misunderstanding the nature of a Stage. Individual
transformation steps such as `map` do not define the boundaries of Stages.
Rather, a sequence of transformations in which there is only a
NarrowDependency between each of the transformations will be pipelined into
a single Stage.
box then it's fine but when you
> have large number of people on this site complaining about OOM and shuffle
> error all the time you need to start providing some transparency to
> address that.
>
> Thanks
>
>
> On Wed, May 25, 2016 at 6:41 PM, Mark Hamstra
> wrote:
>
I don't know what documentation you were referring to, but this is clearly
an erroneous statement: "Threads are virtual cores." At best it is
terminology abuse by a hardware manufacturer. Regardless, Spark can't get
too concerned about how any particular hardware vendor wants to refer to
the spec
d
>> applications.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn *
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABU
‘logical’ processors vs. cores and POSIX threaded
>> applications.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn *
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/
kedIn *
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 16 June 2016 at 19:07, Mark Hamstra wrot
Nothing has changed in that regard, nor is there likely to be "progress",
since more sophisticated or capable resource scheduling at the Application
level is really beyond the design goals for standalone mode. If you want
more in the way of multi-Application resource scheduling, then you should
be
Don't use Spark 2.0.0-preview. That was a preview release with known
issues, and was intended to be used only for early, pre-release testing
purpose. Spark 2.0.0 is now released, and you should be using that.
On Thu, Jul 28, 2016 at 3:48 AM, Carlo.Allocca
wrote:
> and, of course I am using
>
>
What are you expecting to find? There currently are no releases beyond
Spark 2.0.0.
On Tue, Aug 9, 2016 at 9:55 AM, Jestin Ma wrote:
> If we want to use versions of Spark beyond the official 2.0.0 release,
> specifically on Maven + Java, what steps should we take to upgrade? I can't
> find the
imer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damag
No, publishing a spark assembly jar is not fine. See the doc attached to
https://issues.apache.org/jira/browse/SPARK-11157 and be aware that a
likely goal of Spark 2.0 will be the elimination of assemblies.
On Thu, Dec 10, 2015 at 11:19 PM, fightf...@163.com
wrote:
> Using maven to download the
It can be used, and is used in user code, but it isn't always as
straightforward as you might think. This is mostly because a Job often
isn't a Job -- or rather it is more than one Job. There are several RDD
transformations that aren't lazy, so they end up launching "hidden" Jobs
that you may not
g-apache-spark/
> Follow me at https://twitter.com/jaceklaskowski
> Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski
>
>
> On Wed, Dec 16, 2015 at 10:55 AM, Mark Hamstra
> wrote:
> > It can be used, and is used in user code, but it isn't always as
>
I don't understand. If you're using fair scheduling and don't set a pool,
the default pool will be used.
On Tue, Jan 5, 2016 at 1:57 AM, Jeff Zhang wrote:
>
> It seems currently spark.scheduler.pool must be set as localProperties
> (associate with thread). Any reason why spark.scheduler.pool ca
ant is the default pool is fair
> scheduling. But seems if I want to use fair scheduling now, I have to set
> spark.scheduler.pool explicitly.
>
> On Wed, Jan 6, 2016 at 2:03 AM, Mark Hamstra
> wrote:
>
>> I don't understand. If you're using fair scheduling an
I can override the root pool in configuration file, Thanks Mark.
>
> On Wed, Jan 6, 2016 at 8:45 AM, Mark Hamstra
> wrote:
>
>> Just configure with
>> FAIR in fairscheduler.xml (or
>> in spark.scheduler.allocation.file if you have over-riden the default name
>>
It's not a bug, but a larger heap is required with the new
UnifiedMemoryManager:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/memory/UnifiedMemoryManager.scala#L172
On Wed, Jan 6, 2016 at 6:35 AM, kali.tumm...@gmail.com <
kali.tumm...@gmail.com> wrote:
> Hi All
Same SparkContext means same pool of Workers. It's up to the Scheduler,
not the SparkContext, whether the exact same Workers or Executors will be
used to calculate simultaneous actions against the same RDD. It is likely
that many of the same Workers and Executors will be used as the Scheduler
tri
-dev
What do you mean by JobContext? That is a Hadoop mapreduce concept, not
Spark.
On Sun, Jan 17, 2016 at 7:29 AM, Jia Zou wrote:
> Dear all,
>
> Is there a way to reuse executor JVM across different JobContexts? Thanks.
>
> Best Regards,
> Jia
>
ice surprise to me
>
> On Sun, Jan 17, 2016 at 3:17 PM, Mark Hamstra
> wrote:
>
>> Same SparkContext means same pool of Workers. It's up to the Scheduler,
>> not the SparkContext, whether the exact same Workers or Executors will be
>> used to calculate simultaneou
a Zou wrote:
> Hi, Mark, sorry, I mean SparkContext.
> I mean to change Spark into running all submitted jobs (SparkContexts) in
> one executor JVM.
>
> Best Regards,
> Jia
>
> On Sun, Jan 17, 2016 at 2:21 PM, Mark Hamstra
> wrote:
>
>> -dev
>>
>&
utor. After the application
> runs to completion. The executor process will be killed.
> But I hope that all applications submitted can run in the same executor,
> can JobServer do that? If so, it’s really good news!
>
> Best Regards,
> Jia
>
> On Jan 17, 2016, at 3:09 PM, Mark
ess JobServer can fundamentally solve my problem,
> so that jobs can be submitted at different time and still share RDDs.
>
> Best Regards,
> Jia
>
>
> On Jan 17, 2016, at 3:44 PM, Mark Hamstra wrote:
>
> There is a 1-to-1 relationship between Spark Applications and
> SparkC
What do you think is preventing you from optimizing your own RDD-level
transformations and actions? AFAIK, nothing that has been added in
Catalyst precludes you from doing that. The fact of the matter is, though,
that there is less type and semantic information available to Spark from
the raw RDD
https://github.com/apache/spark/pull/10608
On Fri, Jan 29, 2016 at 11:50 AM, Jakob Odersky wrote:
> I'm not an authoritative source but I think it is indeed the plan to
> move the default build to 2.11.
>
> See this discussion for more detail
>
> http://apache-spark-developers-list.1001551.n3.na
Welcome to
__
/ __/__ ___ _/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT
/_/
Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java
1.8.0_72)
Type in expressions to have them evaluated.
Type :he
It's 2 -- and it's pretty hard to point to a line of code, a method, or
even a class since the scheduling of Tasks involves a pretty complex
interaction of several Spark components -- mostly the DAGScheduler,
TaskScheduler/TaskSchedulerImpl, TaskSetManager, Schedulable and Pool, as
well as the Sche
Yes.
On Mon, Sep 28, 2015 at 12:46 PM, Robert Grandl
wrote:
> Hi guys,
>
> I was wondering if it's possible to submit SQL queries to Spark SQL, when
> Spark is running atop YARN instead of standalone mode.
>
> Thanks,
> Robert
>
The closure is sent to and executed an Executor, so you need to be looking
at the stdout of the Executors, not on the Driver.
On Fri, Oct 30, 2015 at 4:42 PM, Alex Nastetsky <
alex.nastet...@vervemobile.com> wrote:
> I'm just trying to do some operation inside foreachPartition, but I can't
> even
Hah! No, that is not a "starter" issue. It touches on some fairly deep
Spark architecture, and there have already been a few attempts to resolve
the issue -- none entirely satisfactory, but you should definitely search
out the work that has already been done.
On Mon, Nov 2, 2015 at 5:51 AM, Jace
For more than a small number of files, you'd be better off using
SparkContext#union instead of RDD#union. That will avoid building up a
lengthy lineage.
On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky wrote:
> Hey Jeff,
> Do you mean reading from multiple text files? In that case, as a
> workar
Those are from the Application Web UI -- look for the "DAG Visualization"
and "Event Timeline" elements on Job and Stage pages.
On Wed, Nov 11, 2015 at 10:58 AM, Jakob Odersky wrote:
> Hi Simone,
> I'm afraid I don't have an answer to your question. However I noticed the
> DAG figures in the att
>
> In the near future, I guess GUI interfaces of Spark will be available
> soon. Spark users (e.g, CEOs) might not need to know what are RDDs at all.
> They can analyze their data by clicking a few buttons, instead of writing
> the programs. : )
That's not in the future. :)
On Mon, Nov 23, 201
Standalone mode also supports running the driver on a cluster node. See
"cluster" mode in
http://spark.apache.org/docs/latest/spark-standalone.html#launching-spark-applications
. Also,
http://spark.apache.org/docs/latest/spark-standalone.html#high-availability
On Mon, Nov 30, 2015 at 9:47 AM, Ja
>
> It is not designed for interactive queries.
You might want to ask the designers of Spark, Spark SQL, and particularly
some things built on top of Spark (such as BlinkDB) about their intent with
regard to interactive queries. Interactive queries are not the only
designed use of Spark, but it
ncy it may
> not be a good fit.
>
> M
>
> On Dec 1, 2015, at 7:23 PM, Andrés Ivaldi wrote:
>
> Ok, so latency problem is being generated because I'm using SQL as source?
> how about csv, hive, or another source?
>
> On Tue, Dec 1, 2015 at 9:18 PM, Mark
some other solution.
On Tue, Dec 1, 2015 at 4:23 PM, Andrés Ivaldi wrote:
> Ok, so latency problem is being generated because I'm using SQL as source?
> how about csv, hive, or another source?
>
> On Tue, Dec 1, 2015 at 9:18 PM, Mark Hamstra
> wrote:
>
>> It is no
Where it could start to make some sense is if you wanted a single
application to be able to work with more than one Spark cluster -- but
that's a pretty weird or unusual thing to do, and I'm pretty sure it
wouldn't work correctly at present.
On Fri, Dec 4, 2015 at 11:10 AM, Michael Armbrust
wrote
https://spark-summit.org/2015/events/making-sense-of-spark-performance/
On Sat, Aug 1, 2015 at 3:24 PM, Simon Edelhaus wrote:
> Hi All!
>
> How important would be a significant performance improvement to TCP/IP
> itself, in terms of
> overall job performance improvement. Which part would be most
SparkContext#setJobDescription or SparkContext#setJobGroup
On Wed, Aug 5, 2015 at 12:29 PM, Rares Vernica wrote:
> Hello,
>
> My Spark application is written in Scala and submitted to a Spark cluster
> in standalone mode. The Spark Jobs for my application are listed in the
> Spark UI like this:
There is the Async API (
https://github.com/clearstorydata/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala),
which makes use of FutureAction (
https://github.com/clearstorydata/spark/blob/master/core/src/main/scala/org/apache/spark/FutureAction.scala).
You could als
Where do you see a race in the DAGScheduler? On a quick look at your stack
trace, this just looks to me like a Job where a Stage failed and then the
DAGScheduler aborted the failed Job.
On Thu, Sep 24, 2015 at 12:00 PM, robin_up wrote:
> Hi
>
> After upgrade to 1.5, we found a possible racing c
One way you can start to make this make more sense, Sean, is if you exploit
the code/data duality so that the non-distributed data that you are sending
out from the driver is actually paying a role more like code (or at least
parameters.) What is sent from the driver to an Executer is then used
(t
s/playing a role/paying a role/
On Thu, Aug 25, 2016 at 12:51 PM, Mark Hamstra
wrote:
> One way you can start to make this make more sense, Sean, is if you
> exploit the code/data duality so that the non-distributed data that you are
> sending out from the driver is actually paying a
vailability. It is used in Spark
>>> Streaming with Kafka, it is also used with Hive for concurrency. It is also
>>> a distributed locking system.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn *
Just because you've flipped spark.scheduler.mode to FAIR, that doesn't mean
that Spark can magically configure and start multiple scheduling pools for
you, nor can it know to which pools you want jobs assigned. Without doing
any setup of additional scheduling pools or assigning of jobs to pools,
y
mean, round robin for the jobs that belong to the default pool.
>
> Cheers,
> --
> *From:* Mark Hamstra
> *Sent:* Thursday, September 1, 2016 7:24:54 PM
> *To:* enrico d'urso
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark schedul
scheduled in round robin way,
> am I right?
>
> --
> *From:* Mark Hamstra
> *Sent:* Thursday, September 1, 2016 8:19:44 PM
> *To:* enrico d'urso
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark scheduling mode
>
> The default pool (``) can be configured like any
> ot
ht?
>
> Thank you
> ----------
> *From:* Mark Hamstra
> *Sent:* Thursday, September 1, 2016 8:44:10 PM
>
> *To:* enrico d'urso
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark scheduling mode
>
> Spark's FairSchedulingAlgorithm is not round robin:
And, no, Spark's scheduler will not preempt already running Tasks. In
fact, just killing running Tasks for any reason is trickier than we'd like
it to be, so it isn't done by default:
https://issues.apache.org/jira/browse/SPARK-17064
On Fri, Sep 2, 2016 at 11:34 AM, Mark Hamstra
It sounds like you should be writing an application and not trying to force
the spark-shell to do more than what it was intended for.
On Tue, Sep 13, 2016 at 11:53 AM, Kevin Burton wrote:
> I sort of agree but the problem is that some of this should be code.
>
> Some of our ES indexes have 100-2
>
> The best network results are achieved when Spark nodes share the same
> hosts as Hadoop or they happen to be on the same subnet.
>
That's only true for those portions of a Spark execution pipeline that are
actually reading from HDFS. If you're re-using an RDD for which the needed
shuffle file
Yes and no. Something that you need to be aware of is that a Job as such
exists in the DAGScheduler as part of the Application running on the
Driver. When talking about stopping or killing a Job, however, what people
often mean is not just stopping the DAGScheduler from telling the Executors
to r
There is no need to do that if 1) the stage that you are concerned with
either made use of or produced MapOutputs/shuffle files; 2) reuse of those
shuffle files (which may very well be in the OS buffer cache of the worker
nodes) is sufficient for your needs; 3) the relevant Stage objects haven't
go
Using a single SparkContext for an extended period of time is how
long-running Spark Applications such as the Spark Job Server work (
https://github.com/spark-jobserver/spark-jobserver). It's an established
pattern.
On Thu, Oct 27, 2016 at 11:46 AM, Gervásio Santos wrote:
> Hi guys!
>
> I'm dev
The v2.1.0 tag is there: https://github.com/apache/spark/tree/v2.1.0
On Wed, Dec 28, 2016 at 2:04 PM, Koert Kuipers wrote:
> seems like the artifacts are on maven central but the website is not yet
> updated.
>
> strangely the tag v2.1.0 is not yet available on github. i assume its
> equal to v2
;
> Wed
> Dec 28 20:01:10 UTC 2016
> 2.2.0-SNAPSHOT/
> <https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.11/2.2.0-SNAPSHOT/>
> Wed
> Dec 28 19:12:38 UTC 2016
>
> What's with 2.1.1-SNAPSHOT? Is that version about to be released as well?
&g
See "API compatibility" in http://spark.apache.org/versioning-policy.html
While code that is annotated as Experimental is still a good faith effort
to provide a stable and useful API, the fact is that we're not yet
confident enough that we've got the public API in exactly the form that we
want to
I wouldn't say that Executors are dumb, but there are some pretty clear
divisions of concepts and responsibilities across the different pieces of
the Spark architecture. A Job is a concept that is completely unknown to an
Executor, which deals instead with just the Tasks that it is given. So you
a
Try selecting a particular Job instead of looking at the summary page for
all Jobs.
On Sat, Jan 28, 2017 at 4:25 PM, Md. Rezaul Karim <
rezaul.ka...@insight-centre.org> wrote:
> Hi Jacek,
>
> I tried accessing Spark web UI on both Firefox and Google Chrome browsers
> with ad blocker enabled. I do
More than one Spark Context in a single Application is not supported.
On Sun, Jan 29, 2017 at 9:08 PM, wrote:
> Hi,
>
>
>
> I have a requirement in which, my application creates one Spark context in
> Distributed mode whereas another Spark context in local mode.
>
> When I am creating this, my c
yes
On Fri, Feb 3, 2017 at 10:08 PM, kant kodali wrote:
> can I use Spark Standalone with HDFS but no YARN?
>
> Thanks!
>
If you update the data, then you don't have the same DataFrame anymore. If
you don't do like Assaf did, caching and forcing evaluation of the
DataFrame before using that DataFrame concurrently, then you'll still get
consistent and correct results, but not necessarily efficient results. If
the fully
First, the word you are looking for is "straggler", not "strangler" -- very
different words. Second, "idempotent" doesn't mean "only happens once", but
rather "if it does happen more than once, the effect is no different than
if it only happened once".
It is possible to insert a nearly limitless v
foreachPartition is not a transformation; it is an action. If you want to
transform an RDD using an iterator in each partition, then use
mapPartitions.
On Tue, Feb 28, 2017 at 8:17 PM, jeremycod wrote:
> Hi,
>
> I'm trying to transform one RDD two times. I'm using foreachParition and
> embedded
Shuffle files are cleaned when they are no longer referenced. See
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ContextCleaner.scala
On Mon, Mar 27, 2017 at 12:38 PM, Ashwin Sai Shankar <
ashan...@netflix.com.invalid> wrote:
> Hi!
>
> In spark on yarn, when are
When the RDD using them goes out of scope.
On Mon, Mar 27, 2017 at 3:13 PM, Ashwin Sai Shankar
wrote:
> Thanks Mark! follow up question, do you know when shuffle files are
> usually un-referenced?
>
> On Mon, Mar 27, 2017 at 2:35 PM, Mark Hamstra
> wrote:
>
>> Shuffl
Your mixing up different levels of scheduling. Spark's fair scheduler pools
are about scheduling Jobs, not Applications; whereas YARN queues with Spark
are about scheduling Applications, not Jobs.
On Wed, Apr 5, 2017 at 12:27 PM, Nick Chammas
wrote:
> I'm having trouble understanding the differe
grrr... s/your/you're/
On Wed, Apr 5, 2017 at 12:54 PM, Mark Hamstra
wrote:
> Your mixing up different levels of scheduling. Spark's fair scheduler
> pools are about scheduling Jobs, not Applications; whereas YARN queues with
> Spark are about scheduling Applications, not Jo
evant/useful in this context?
>
> On Wed, Apr 5, 2017 at 3:54 PM Mark Hamstra
> wrote:
>
>> grrr... s/your/you're/
>>
>> On Wed, Apr 5, 2017 at 12:54 PM, Mark Hamstra
>> wrote:
>>
>> Your mixing up different levels of scheduling. Spark's
spark.local.dir
http://spark.apache.org/docs/latest/configuration.html
On Fri, Apr 28, 2017 at 8:51 AM, Shashi Vishwakarma <
shashi.vish...@gmail.com> wrote:
> Yes I am using HDFS .Just trying to understand couple of point.
>
> There would be two kind of encryption which would be required.
>
> 1
The check goal of the scalastyle plugin runs during the "verify" phase,
which is between "package" and "install"; so running just "package" will
not run scalastyle:check.
On Thu, May 4, 2017 at 7:45 AM, yiskylee wrote:
> ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean
>
This looks more like a matter for Databricks support than spark-user.
On Tue, May 9, 2017 at 2:02 PM, lucas.g...@gmail.com
wrote:
> df = spark.sqlContext.read.csv('out/df_in.csv')
>>
>
>
>> 17/05/09 15:51:29 WARN ObjectStore: Version information not found in
>> metastore. hive.metastore.schema.v
ding which
> lib to use.
>
> On 9 May 2017 at 14:30, Mark Hamstra wrote:
>
>> This looks more like a matter for Databricks support than spark-user.
>>
>> On Tue, May 9, 2017 at 2:02 PM, lucas.g...@gmail.com <
>> lucas.g...@gmail.com> wrote:
>>
>>>
> two replies are not even in the same "email conversation".
>
I don't know the mechanics of why posts do or don't show up via Nabble, but
Nabble is neither the canonical archive nor the system of record for Apache
mailing lists.
> On Thu, May 4, 2017 at 8:11 PM, Mark
1 - 100 of 198 matches
Mail list logo