-dev +user
1). Is that the reason why it's always slow in the first run? Or are there
> any other reasons? Apparently it loads data to memory every time so it
> shouldn't be something to do with disk read should it?
>
You are probably seeing the effect of the JVMs JIT. The first run is
executing
Hi Spark Devs,
I am doing a performance evaluation of Spark using pyspark. I am using
Spark 1.5 with a Hadoop 2.6 cluster of 4 nodes and ran these tests on local
mode.
After a few dozen test executions, it turned out that the very first
SparkSQL query execution is always slower than the subsequen
Hi Ankur,
Could you help with explanation of the problem below?
Best regards, Alexander
From: Ulanov, Alexander
Sent: Friday, October 02, 2015 11:39 AM
To: 'Robin East'
Cc: dev@spark.apache.org
Subject: RE: GraphX PageRank keeps 3 copies of graph in memory
Hi Robin,
Sounds interesting. I am ru
Hi,
I want to understand the code flow starting from the Spark jar that I submit
through spark-submit, how does Spark identify and extract the closures, clean
and serialize them and ship them to workers to execute as tasks. Can someone
point me to any documentation or a pointer to the source
Is this limited only to grand multiple count distincts or does it extends
to all kinds of multiple count distincts? More precisely would the
following multiple count distinct query also be affected?
select a, b, count(distinct x), count(distinct y) from foo group by a,b;
It would be unfortunate to
We could also fallback to approximate count distincts when the user
requests multiple count distincts. This is less invasive than throwing an
AnalysisException, but it could violate the principle of least surprise.
Met vriendelijke groet/Kind regards,
Herman van Hövell tot Westerflier
QuestTec
Adding user list too.
-- Forwarded message --
From: Reynold Xin
Date: Tue, Oct 6, 2015 at 5:54 PM
Subject: Re: multiple count distinct in SQL/DataFrame?
To: "dev@spark.apache.org"
To provide more context, if we do remove this feature, the following SQL
query would throw an An
This is about the s3.amazonaws.com files, not dist.apache.org right?
or does it affect both?
(BTW you can keep as many old release artifacts around on the
apache.org archives as you like; I think the suggestion is to remove
all but the most recent releases from the set that's replicated to all
the
Sounds good to me.
For my purposes, I'm less concerned about old Spark artifacts and more
concerned about the consistency of the set of artifacts that get generated
with new releases. (e.g. Each new release will always include one artifact
each for Hadoop 1, Hadoop 1 + Scala 2.11, etc...)
It soun
I don't think we have a firm contract around that. So far we've never
removed old artifacts, but the ASF has asked us at time to decrease the
size of binaries we post. In the future at some point we may drop older
ones since we keep adding new ones.
If downstream projects are depending on our arti
When running stand-alone cluster mode job, the process hangs up randomly during
a DataFrame flatMap or explode operation, in HiveContext:
-->> df.flatMap(r => for (n <- 1 to r.getInt(ind)) yield r)
This does not happen either with SQLContext in cluster, or Hive/SQL in local
mode, where it works
Hi YiZhi Liu,
The spark.ml classes are part of the higher-level "Pipelines" API, which
works with DataFrames. When creating this API, we decided to separate it
from the old API to avoid confusion. You can read more about it here:
http://spark.apache.org/docs/latest/ml-guide.html
For (3): We use
Please do.
On Wed, Oct 7, 2015 at 9:49 AM, Russell Spitzer
wrote:
> Should I make up a new ticket for this? Or is there something already
> underway?
>
> On Mon, Oct 5, 2015 at 4:31 PM Russell Spitzer
> wrote:
>
>> That sounds fine to me, we already do the filtering so populating that
>> field
Thanks guys.
Regarding this earlier question:
More importantly, is there some rough specification for what packages we
should be able to expect in this S3 bucket with every release?
Is the implied answer that we should continue to expect the same set of
artifacts for every release for the forese
Should I make up a new ticket for this? Or is there something already
underway?
On Mon, Oct 5, 2015 at 4:31 PM Russell Spitzer
wrote:
> That sounds fine to me, we already do the filtering so populating that
> field would be pretty simple.
>
> On Sun, Sep 27, 2015 at 2:08 PM Michael Armbrust
> w
15 matches
Mail list logo