ark though and it might need
some more manual tweaks.
ThanksShivaram
On Fri, Jun 26, 2015 at 9:59 AM, Mark Stephenson
wrote:
Thanks!
In your demo video, were you using RStudio to hit a separate EC2 Spark cluster?
I noticed that it appeared your browser that you were using EC2 at that time,
so
I have Avro records stored in Parquet files in HDFS. I want to read these
out as an RDD and save that RDD in Tachyon for any spark job that wants the
data.
How do I save the RDD in Tachyon? What format do I use? Which RDD
'saveAs...' method do I want?
Thanks
Hi All
I need to be able to create, submit and report on Spark jobs
programmatically in response to events arriving on a Kafka bus. I also need
end-users to be able to create data queries that launch Spark jobs 'behind
the scenes'.
I would expect to use the same API for both, and be able to provi
I have a Spark job that computes some values and needs to write those
values to a data store. The classes that write to the data store are not
serializable (eg, Cassandra session objects etc).
I don't want to collect all the results at the driver, I want each worker
to write the data - what is the
I am trying to tune a Spark job and have noticed some strange behavior -
tasks in a stage vary in execution time, ranging from 2 seconds to 20
seconds. I assume tasks should all run in roughly the same amount of time
in a well tuned job.
So I did some investigation - the fast tasks appear to have
m partitioner for the keys so that they will get evenly
> distributed across tasks
>
> Thanks
> Best Regards
>
> On Fri, Sep 4, 2015 at 7:19 PM, mark wrote:
>
>> I am trying to tune a Spark job and have noticed some strange behavior -
>> tasks in a stage vary i
http://www.stratahadoopworld.com/singapore/index.html
On 8 Sep 2015 8:35 am, "Kevin Jung" wrote:
> Is there any plan to hold Spark summit in Asia?
> I'm very much looking forward to it.
>
> Kevin
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spar
is post here has a bit information
> http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/
>
> Thanks
> Best Regards
>
> On Wed, Sep 9, 2015 at 6:44 AM, mark wrote:
>
>> As I understand things (maybe naively
7;m confused by the behavior. my understanding was that load() was lazily
executed on the Spark worker. Why would some elements be executing on the
driver?
Thanks for your help
--
Mark Bidewell
http://www.linkedin.com/in/markbidewell
Hello,
I noticed that the apache/spark-py image for Spark's 3.4.1 release is not
available (apache/spark@3.4.1 is available). Would it be possible to get
the 3.4.1 release build for the apache/spark-py image published?
Thanks,
Mark
--
This communication, together wit
This discussion belongs on the dev list. Please post any replies there.
On Sat, May 23, 2015 at 10:19 PM, Cheolsoo Park
wrote:
> Hi,
>
> I've been testing SparkSQL in 1.4 rc and found two issues. I wanted to
> confirm whether these are bugs or not before opening a jira.
>
> *1)* I can no longer
Correct; and PairRDDFunctions#join does still exist in versions of Spark
that do have DataFrame, so you don't necessarily have to use DataFrame to
do this even then (although there are advantages to using the DataFrame
approach.)
Your basic problem is that you have an RDD of tuples, where each tup
That would constitute a major change in Spark's architecture. It's not
happening anytime soon.
On Tue, Jun 9, 2015 at 1:34 AM, kiran lonikar wrote:
> Possibly in future, if and when spark architecture allows workers to
> launch spark jobs (the functions passed to transformation or action APIs o
Correct. Trading away scalability for increased performance is not an
option for the standard Spark API.
On Tue, Jun 9, 2015 at 3:05 AM, Daniel Darabos <
daniel.dara...@lynxanalytics.com> wrote:
> It would be even faster to load the data on the driver and sort it there
> without using Spark :).
t; Collect() would involve gathering all the data on a single machine as well.
>
> Thanks,
> Raghav
>
> On Tuesday, June 9, 2015, Mark Hamstra wrote:
>
>> Correct. Trading away scalability for increased performance is not an
>> option for the standard Spark API.
>
wondering if there's a more
efficient way.
Also posted on SO: http://stackoverflow.com/q/30785615/2687324
Thanks,
Mark
Makes sense – I suspect what you suggested should work.
However, I think the overhead between this and using `String` would be similar
enough to warrant just using `String`.
Mark
From: Sonal Goyal [mailto:sonalgoy...@gmail.com]
Sent: June-11-15 12:58 PM
To: Mark Tse
Cc: user@spark.apache.org
>
> I would guess in such shuffles the bottleneck is serializing the data
> rather than raw IO, so I'm not sure explicitly buffering the data in the
> JVM process would yield a large improvement.
Good guess! It is very hard to beat the performance of retrieving shuffle
outputs from the OS buffe
q/30871109/2687324).
Has anyone experienced this error before while unit testing?
Thanks,
Mark
I think
https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
might shed some light on the behaviour you’re seeing.
Mark
From: canan chen [mailto:ccn...@gmail.com]
Sent: June-17-15 5:57 AM
To: spark users
Subject: Intermedate stage will be cached automatically ?
Here
hed and utilized SparkR 1.4.0 in this way, with
> RStudio Server running on top of the master instance? Are we on the right
> track, or should we manually launch a cluster and attempt to connect to it
> from another instance running R?
>
> Thank you in advance!
>
> Mark
Do you want to transform the RDD, or just produce some side effect with its
contents? If the latter, you want foreachPartition, not mapPartitions.
On Fri, Jun 26, 2015 at 11:52 AM, Wang, Ningjun (LNG-NPV) <
ningjun.w...@lexisnexis.com> wrote:
> In rdd.mapPartition(…) if I try to iterate through
No. He is collecting the results of the SQL query, not the whole dataset.
The REPL does retain references to prior results, so it's not really the
best tool to be using when you want no-longer-needed results to be
automatically garbage collected.
On Mon, Jun 29, 2015 at 9:13 AM, ayan guha wrote:
PM, Eugene Morozov
wrote:
> Mark,
>
> I'm trying to configure spark cluster to share resources between two pools.
>
> I can do that by assigning minimal shares (it works fine), but that means
> specific amount of cores is going to be wasted by just being ready to run
> anyth
There's probably nothing wrong other than a glitch in the reporting of
Executor state transitions to the UI -- one of those low-priority items
I've been meaning to look at for awhile
On Mon, Mar 7, 2016 at 12:15 AM, Sonal Goyal wrote:
> Maybe check the worker logs to see what's going wrong w
One issue is that RAID levels providing data replication are not necessary
since HDFS already replicates blocks on multiple nodes.
On Tue, Mar 8, 2016 at 8:45 AM, Alex Kozlov wrote:
> Parallel disk IO? But the effect should be less noticeable compared to
> Hadoop which reads/writes a lot. Much
>
> For example, if you're looking to scale out to 1000 concurrent requests,
> this is 1000 concurrent Spark jobs. This would require a cluster with 1000
> cores.
This doesn't make sense. A Spark Job is a driver/DAGScheduler concept
without any 1:1 correspondence between Worker cores and Jobs.
path.
On Thu, Mar 10, 2016 at 2:46 PM, Chris Fregly wrote:
> you are correct, mark. i misspoke. apologies for the confusion.
>
> so the problem is even worse given that a typical job requires multiple
> tasks/cores.
>
> i have yet to see this particular architecture work in prod
It's not just if the RDD is explicitly cached, but also if the map outputs
for stages have been materialized into shuffle files and are still
accessible through the map output tracker. Because of that, explicitly
caching RDD actions often gains you little or nothing, since even without a
call to c
g may not be 100% accurate and bug
free.
On Tue, Mar 15, 2016 at 6:34 PM, Prabhu Joseph
wrote:
> Okay, so out of 164 stages, is 163 are skipped. And how 41405 tasks are
> skipped if the total is only 19788.
>
> On Wed, Mar 16, 2016 at 6:31 AM, Mark Hamstra
> wrote:
>
>>
You're not getting what Ted is telling you. Your `dict` is an RDD[String]
-- i.e. it is a collection of a single value type, String. But
`collectAsMap` is only defined for PairRDDs that have key-value pairs for
their data elements. Both a key and a value are needed to collect into a
Map[K, V].
ipe for how
the computation should be done when it is needed. Neither does the "called
before any job" comment pose any restriction in this case since no jobs
have yet been executed on the RDD.
On Wed, Mar 23, 2016 at 7:18 PM, Ted Yu wrote:
> See the doc for checkpoint:
>
>* M
aveXXX action is
the only action being performed on the RDD, the rest of the chain being
purely transformations, then checkpointing instead of saving still wouldn't
execute any action on the RDD -- it would just mark the point at which
checkpointing should be done when an action is eventually run.
You seem to be confusing the concepts of Job and Application. A Spark
Application has a SparkContext. A Spark Application is capable of running
multiple Jobs, each with its own ID, visible in the webUI.
On Thu, Mar 24, 2016 at 6:11 AM, Max Schmidt wrote:
> Am 24.03.2016 um 10:34 schrieb Simon
Yes and no. The idea of n-tier architecture is about 20 years older than
Spark and doesn't really apply to Spark as n-tier was original conceived.
If the n-tier model helps you make sense of some things related to Spark,
then use it; but don't get hung up on trying to force a Spark architecture
in
ue, Mar 29, 2016 at 5:44 PM, Mich Talebzadeh
wrote:
> Hi Mark,
>
> I beg I agree to differ on the interpretation of N-tier architecture.
> Agreed that 3-tier and by extrapolation N-tier have been around since days
> of client-server architecture. However, they are as valid today as
Why would the Executors shutdown when the Job is terminated? Executors are
bound to Applications, not Jobs. Furthermore,
unless spark.job.interruptOnCancel is set to true, canceling the Job at the
Application and DAGScheduler level won't actually interrupt the Tasks
running on the Executors. If
https://spark.apache.org/docs/latest/cluster-overview.html
On Sat, Apr 9, 2016 at 12:28 AM, Ashok Kumar
wrote:
> On Spark GUI I can see the list of Workers.
>
> I always understood that workers are used by executors.
>
> What is the relationship between workers and executors please. Is it one
>
That's also available in standalone.
On Thu, Apr 14, 2016 at 12:47 PM, Alexander Pivovarov
wrote:
> Spark on Yarn supports dynamic resource allocation
>
> So, you can run several spark-shells / spark-submits / spark-jobserver /
> zeppelin on one cluster without defining upfront how many executor
To be fair, the Stratosphere project from which Flink springs was started
as a collaborative university research project in Germany about the same
time that Spark was first released as Open Source, so they are near
contemporaries rather than Flink having been started only well after Spark
was an es
Hi,
So i would like some custom metrics.
The environment we use is AWS EMR 4.5.0 with spark 1.6.1 and Ganglia.
the code snippit below shows how we register custom metrics (this worked in EMR
4.2.0 with spark 1.5.2)
package org.apache.spark.metrics.source
import com.codahale.metrics._
import o
Wel you got me fooled as wel ;)
Had it on my todolist to dive into this new component...
Mark
> Op 5 mei 2016 om 07:06 heeft Derek Chan het volgende
> geschreven:
>
> The blog post is a April Fool's joke. Read the last line in the post:
>
> https://databricks.com/blog/
You appear to be misunderstanding the nature of a Stage. Individual
transformation steps such as `map` do not define the boundaries of Stages.
Rather, a sequence of transformations in which there is only a
NarrowDependency between each of the transformations will be pipelined into
a single Stage.
irav Patel wrote:
> Hi Mark,
>
> I might have said stage instead of step in my last statement "UI just
> says Collect failed but in fact it could be any stage in that lazy chain of
> evaluation."
>
> Anyways even you agree that this visibility of underlaying ste
I don't know what documentation you were referring to, but this is clearly
an erroneous statement: "Threads are virtual cores." At best it is
terminology abuse by a hardware manufacturer. Regardless, Spark can't get
too concerned about how any particular hardware vendor wants to refer to
the spec
d
>> applications.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn *
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABU
‘logical’ processors vs. cores and POSIX threaded
>> applications.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn *
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/
kedIn *
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 16 June 2016 at 19:07, Mark Hamstra wrot
Hi Mich,
sorry for bothering did you manage to solve your problem? We have a similar
problem with Spark 1.5.2 using a JDBC connection with a DataFrame to an Oracle
Database.
Thanks,
Mark
> On 12 Feb 2016, at 11:45, Mich Talebzadeh <mailto:m...@peridale.co.uk>> wrote:
>
> H
Thanks Mich,
we have got it working using the example here under ;)
Mark
> On 11 Jul 2016, at 09:45, Mich Talebzadeh wrote:
>
> Hi Mark,
>
> Hm. It should work. This is Spark 1.6.1 on Oracle 12c
>
>
> scala> val HiveContext = new org.apache.spark.sql.hive.Hi
Nothing has changed in that regard, nor is there likely to be "progress",
since more sophisticated or capable resource scheduling at the Application
level is really beyond the design goals for standalone mode. If you want
more in the way of multi-Application resource scheduling, then you should
be
Don't use Spark 2.0.0-preview. That was a preview release with known
issues, and was intended to be used only for early, pre-release testing
purpose. Spark 2.0.0 is now released, and you should be using that.
On Thu, Jul 28, 2016 at 3:48 AM, Carlo.Allocca
wrote:
> and, of course I am using
>
>
> ts.groupBy("b").count().orderBy(col("count"), ascending=False)
Sent from my iPhone
> On Jul 30, 2016, at 2:54 PM, Don Drake wrote:
>
> Try:
>
> ts.groupBy("b").count().orderBy(col("count").desc());
>
> -Don
>
>> On Sat, Jul 30, 2016 at 1:30 PM, Tony Lane wrote:
>> just to clarify I am try
What are you expecting to find? There currently are no releases beyond
Spark 2.0.0.
On Tue, Aug 9, 2016 at 9:55 AM, Jestin Ma wrote:
> If we want to use versions of Spark beyond the official 2.0.0 release,
> specifically on Maven + Java, what steps should we take to upgrade? I can't
> find the
imer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damag
No, publishing a spark assembly jar is not fine. See the doc attached to
https://issues.apache.org/jira/browse/SPARK-11157 and be aware that a
likely goal of Spark 2.0 will be the elimination of assemblies.
On Thu, Dec 10, 2015 at 11:19 PM, fightf...@163.com
wrote:
> Using maven to download the
It can be used, and is used in user code, but it isn't always as
straightforward as you might think. This is mostly because a Job often
isn't a Job -- or rather it is more than one Job. There are several RDD
transformations that aren't lazy, so they end up launching "hidden" Jobs
that you may not
ing cancellation, or more extensive reuse functionality as in
https://issues.apache.org/jira/browse/SPARK-11838 If you don't want to
spend a lot of time looking at Job cancellation issues, best to back away
now! :)
On Wed, Dec 16, 2015 at 4:26 PM, Jacek Laskowski wrote:
> Thanks Mark f
files with
“sparkContext.sequenceFile('hdfs:///data/*/*/*/*.seq')” correct?
Looking at our results it seems to be working fine and as described above.
Thanks,
Mark
I don't understand. If you're using fair scheduling and don't set a pool,
the default pool will be used.
On Tue, Jan 5, 2016 at 1:57 AM, Jeff Zhang wrote:
>
> It seems currently spark.scheduler.pool must be set as localProperties
> (associate with thread). Any reason why spark.scheduler.pool ca
ant is the default pool is fair
> scheduling. But seems if I want to use fair scheduling now, I have to set
> spark.scheduler.pool explicitly.
>
> On Wed, Jan 6, 2016 at 2:03 AM, Mark Hamstra
> wrote:
>
>> I don't understand. If you're using fair scheduling an
I can override the root pool in configuration file, Thanks Mark.
>
> On Wed, Jan 6, 2016 at 8:45 AM, Mark Hamstra
> wrote:
>
>> Just configure with
>> FAIR in fairscheduler.xml (or
>> in spark.scheduler.allocation.file if you have over-riden the default name
>>
It's not a bug, but a larger heap is required with the new
UnifiedMemoryManager:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/memory/UnifiedMemoryManager.scala#L172
On Wed, Jan 6, 2016 at 6:35 AM, kali.tumm...@gmail.com <
kali.tumm...@gmail.com> wrote:
> Hi All
Same SparkContext means same pool of Workers. It's up to the Scheduler,
not the SparkContext, whether the exact same Workers or Executors will be
used to calculate simultaneous actions against the same RDD. It is likely
that many of the same Workers and Executors will be used as the Scheduler
tri
-dev
What do you mean by JobContext? That is a Hadoop mapreduce concept, not
Spark.
On Sun, Jan 17, 2016 at 7:29 AM, Jia Zou wrote:
> Dear all,
>
> Is there a way to reuse executor JVM across different JobContexts? Thanks.
>
> Best Regards,
> Jia
>
ice surprise to me
>
> On Sun, Jan 17, 2016 at 3:17 PM, Mark Hamstra
> wrote:
>
>> Same SparkContext means same pool of Workers. It's up to the Scheduler,
>> not the SparkContext, whether the exact same Workers or Executors will be
>> used to calculate simultaneou
a Zou wrote:
> Hi, Mark, sorry, I mean SparkContext.
> I mean to change Spark into running all submitted jobs (SparkContexts) in
> one executor JVM.
>
> Best Regards,
> Jia
>
> On Sun, Jan 17, 2016 at 2:21 PM, Mark Hamstra
> wrote:
>
>> -dev
>>
>&
, Jan 17, 2016 at 1:15 PM, Jia wrote:
> Hi, Mark, sorry for the confusion.
>
> Let me clarify, when an application is submitted, the master will tell
> each Spark worker to spawn an executor JVM process. All the task sets of
> the application will be executed by the exec
Yes, that is one of the basic reasons to use a
jobserver/shared-SparkContext. Otherwise, in order share the data in an
RDD you have to use an external storage system, such as a distributed
filesystem or Tachyon.
On Sun, Jan 17, 2016 at 1:52 PM, Jia wrote:
> Thanks, Mark. Then, I gu
What do you think is preventing you from optimizing your own RDD-level
transformations and actions? AFAIK, nothing that has been added in
Catalyst precludes you from doing that. The fact of the matter is, though,
that there is less type and semantic information available to Spark from
the raw RDD
https://github.com/apache/spark/pull/10608
On Fri, Jan 29, 2016 at 11:50 AM, Jakob Odersky wrote:
> I'm not an authoritative source but I think it is indeed the plan to
> move the default build to 2.11.
>
> See this discussion for more detail
>
> http://apache-spark-developers-list.1001551.n3.na
I am running Spark on Windows. When I try to view the Executor logs in the UI
I get the following error:
HTTP ERROR 500
Problem accessing /logPage/. Reason:
Server Error
Caused by:
java.net.URISyntaxException: Illegal character in path at index 1:
.\work/app-20160129154716-0038/2/
a
We have created JIRA ticket
https://issues.apache.org/jira/browse/SPARK-13142 and will submit a pull
request next week.
Mark
-Original Message-
From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: 01 February 2016 14:24
To: Mark Pavey
Cc: user@spark.apache.org
Subject: Re: Can't
I have submitted a pull request: https://github.com/apache/spark/pull/11135.
Mark
-Original Message-
From: Mark Pavey [mailto:mark.pa...@thefilter.com]
Sent: 05 February 2016 17:09
To: 'Ted Yu'
Cc: user@spark.apache.org
Subject: RE: Can't view executor logs in web UI
Welcome to
__
/ __/__ ___ _/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT
/_/
Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java
1.8.0_72)
Type in expressions to have them evaluated.
Type :he
It's 2 -- and it's pretty hard to point to a line of code, a method, or
even a class since the scheduling of Tasks involves a pretty complex
interaction of several Spark components -- mostly the DAGScheduler,
TaskScheduler/TaskSchedulerImpl, TaskSetManager, Schedulable and Pool, as
well as the Sche
Yes.
On Mon, Sep 28, 2015 at 12:46 PM, Robert Grandl
wrote:
> Hi guys,
>
> I was wondering if it's possible to submit SQL queries to Spark SQL, when
> Spark is running atop YARN instead of standalone mode.
>
> Thanks,
> Robert
>
Here is the log file from the worker node
15/09/30 23:49:37 INFO Worker: Executor app-20150930233113-/8 finished
with state EXITED message Command exited with code 1 exitStatus \
1
15/09/30 23:49:37 INFO Worker: Asked to launch executor
app-20150930233113-/9 for PythonPi
15/09/30 23:49:37
Hi Everyone,
I am busy trying out ‘Spark-Testing-Base
<https://github.com/holdenk/spark-testing-base>’. I have the following
questions?
Can you test Spark Streaming Jobs using Java?
Can I use Spark-Testing-Base 1.3.0_0.1.1 together with Spark 1.3.1?
Thanks.
Greetings,
Mark
Hi Holden,
Thanks for the information, I think that a Java Base Class in order to test
SparkStreaming using Java would be useful for the community. Unfortunately not
all of our customers are willing to use Scala or Python.
If i am not wrong it’s 4:00 AM for you in California ;)
Regards,
Mark
.1",
"submissionId" : "driver-20151028115443-0002",
"success" : true
}
The driver is created correctly, but it never starts the application.
What am i missing?
Regards,
Mark
.1",
"submissionId" : "driver-20151028115443-0002",
"success" : true
}
The driver is created correctly, but it never starts the application.
What am i missing?
Regards,
Mark
The closure is sent to and executed an Executor, so you need to be looking
at the stdout of the Executors, not on the Driver.
On Fri, Oct 30, 2015 at 4:42 PM, Alex Nastetsky <
alex.nastet...@vervemobile.com> wrote:
> I'm just trying to do some operation inside foreachPartition, but I can't
> even
Hah! No, that is not a "starter" issue. It touches on some fairly deep
Spark architecture, and there have already been a few attempts to resolve
the issue -- none entirely satisfactory, but you should definitely search
out the work that has already been done.
On Mon, Nov 2, 2015 at 5:51 AM, Jace
For more than a small number of files, you'd be better off using
SparkContext#union instead of RDD#union. That will avoid building up a
lengthy lineage.
On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky wrote:
> Hey Jeff,
> Do you mean reading from multiple text files? In that case, as a
> workar
Those are from the Application Web UI -- look for the "DAG Visualization"
and "Event Timeline" elements on Job and Stage pages.
On Wed, Nov 11, 2015 at 10:58 AM, Jakob Odersky wrote:
> Hi Simone,
> I'm afraid I don't have an answer to your question. However I noticed the
> DAG figures in the att
>
> In the near future, I guess GUI interfaces of Spark will be available
> soon. Spark users (e.g, CEOs) might not need to know what are RDDs at all.
> They can analyze their data by clicking a few buttons, instead of writing
> the programs. : )
That's not in the future. :)
On Mon, Nov 23, 201
Standalone mode also supports running the driver on a cluster node. See
"cluster" mode in
http://spark.apache.org/docs/latest/spark-standalone.html#launching-spark-applications
. Also,
http://spark.apache.org/docs/latest/spark-standalone.html#high-availability
On Mon, Nov 30, 2015 at 9:47 AM, Ja
>
> It is not designed for interactive queries.
You might want to ask the designers of Spark, Spark SQL, and particularly
some things built on top of Spark (such as BlinkDB) about their intent with
regard to interactive queries. Interactive queries are not the only
designed use of Spark, but it
ncy it may
> not be a good fit.
>
> M
>
> On Dec 1, 2015, at 7:23 PM, Andrés Ivaldi wrote:
>
> Ok, so latency problem is being generated because I'm using SQL as source?
> how about csv, hive, or another source?
>
> On Tue, Dec 1, 2015 at 9:18 PM, Mark
some other solution.
On Tue, Dec 1, 2015 at 4:23 PM, Andrés Ivaldi wrote:
> Ok, so latency problem is being generated because I'm using SQL as source?
> how about csv, hive, or another source?
>
> On Tue, Dec 1, 2015 at 9:18 PM, Mark Hamstra
> wrote:
>
>> It is no
Where it could start to make some sense is if you wanted a single
application to be able to work with more than one Spark cluster -- but
that's a pretty weird or unusual thing to do, and I'm pretty sure it
wouldn't work correctly at present.
On Fri, Dec 4, 2015 at 11:10 AM, Michael Armbrust
wrote
https://spark-summit.org/2015/events/making-sense-of-spark-performance/
On Sat, Aug 1, 2015 at 3:24 PM, Simon Edelhaus wrote:
> Hi All!
>
> How important would be a significant performance improvement to TCP/IP
> itself, in terms of
> overall job performance improvement. Which part would be most
SparkContext#setJobDescription or SparkContext#setJobGroup
On Wed, Aug 5, 2015 at 12:29 PM, Rares Vernica wrote:
> Hello,
>
> My Spark application is written in Scala and submitted to a Spark cluster
> in standalone mode. The Spark Jobs for my application are listed in the
> Spark UI like this:
for piping the
result of the first *map* operation as a parameter into the following *map*
operation?
Any ideas and feedback appreciated, thanks a lot.
Best regards,
Mark
Thanks a lot guys, that's exactly what I hoped for :-).
Cheers,
Mark
2015-08-13 6:35 GMT+02:00 Hemant Bhanawat :
> A chain of map and flatmap does not cause any
> serialization-deserialization.
>
>
>
> On Wed, Aug 12, 2015 at 4:02 PM, Mark Heimann
> wrote:
>
Hi everyone,
I have two questions regarding the random forest implementation in mllib
1- maxBins: Say the value of a feature is between [0,100]. In my dataset there
are a lot of data points between [0,10] and one datapoint at 100 and nothing
between (10, 100). I am wondering how does the binning
There is the Async API (
https://github.com/clearstorydata/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala),
which makes use of FutureAction (
https://github.com/clearstorydata/spark/blob/master/core/src/main/scala/org/apache/spark/FutureAction.scala).
You could als
Where do you see a race in the DAGScheduler? On a quick look at your stack
trace, this just looks to me like a Job where a Stage failed and then the
DAGScheduler aborted the failed Job.
On Thu, Sep 24, 2015 at 12:00 PM, robin_up wrote:
> Hi
>
> After upgrade to 1.5, we found a possible racing c
One way you can start to make this make more sense, Sean, is if you exploit
the code/data duality so that the non-distributed data that you are sending
out from the driver is actually paying a role more like code (or at least
parameters.) What is sent from the driver to an Executer is then used
(t
1 - 100 of 250 matches
Mail list logo