Thank you
2023年4月5日(水) 21:32 yangjie01 :
> +1
>
>
>
> *发件人**: *Yuming Wang
> *日期**: *2023年4月5日 星期三 14:39
> *收件人**: *Xinrong Meng
> *抄送**: *Hyukjin Kwon , Chao Sun ,
> Holden Karau , "L. C. Hsieh" ,
> Mridul Muralidharan , "dev@spark.apache.org&
It's very likely a GitHub issue
On Wed, 11 May 2022 at 18:01, Yang,Jie(INF) wrote:
> Hi, teams
>
>
>
> The contributors data in the following page seems no longer updated after
> May 1, Can anyone fix it?
>
>
>
>
> https://github.com/apache/spark/graphs/c
Hi, teams
The contributors data in the following page seems no longer updated after May
1, Can anyone fix it?
https://github.com/apache/spark/graphs/contributors?from=2022-05-01&to=2022-05-11&type=c
Warm regards,
YangJie
e a
> manifest file to the job attempt dir pointing to the successful task
> attempt; commit that with their atomic file rename. The committer plugin
> point in MR lets you declare a committer factory for each FS, so it could
> be done without any further changes to spark.
>
> On Thu, 25
, so it could
be done without any further changes to spark.
On Thu, 25 Jun 2020 at 22:38, Waleed Fateem wrote:
> I was trying to make my email short and concise, but the rationale behind
> setting that as 1 by default is because it's safer. With algorithm version
> 2 you run the r
I was trying to make my email short and concise, but the rationale behind
setting that as 1 by default is because it's safer. With algorithm version
2 you run the risk of having bad data in cases where tasks fail or even
duplicate data if a task fails and succeeds on a reattempt (I don'
I think is a Hadoop property that is just passed through? if the
default is different in Hadoop 3 we could mention that in the docs. i
don't know if we want to always set it to 1 as a Spark default, even
in Hadoop 3 right?
On Thu, Jun 25, 2020 at 2:43 PM Waleed Fateem wrote:
>
> H
Hello!
I noticed that in the documentation starting with 2.2.0 it states that the
parameter spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version is 1
by default:
https://issues.apache.org/jira/browse/SPARK-20107
I don't actually see this being set anywhere explicitly in the Spark
Here are my notes for the latest DSv2 community sync. As usual, if you have
comments or corrections, please reply. If you’d like to be invited to the
next sync, email me directly. Everyone is welcome to attend.
*Attendees*:
Ryan Blue
John Zhuge
Andrew Long
Bruce Robbins
Dilip Biswal
Gengliang Wang
rmany.
(Springer LNCS Proceedings)
Date: June 20, 2019
Workshop URL: http://vhpc.org
Paper Submission Deadline: May 1, 2019 (extended)
Springer LNCS, rolling abstract submission
Abstract/Paper Submission Link: https://edas
Hi,
Just ran into it today and wonder whether it's a bug or something I may
have missed before.
scala> spark.version
res21: String = 2.3.2
// that's OK
scala> spark.range(1).write.saveAsTable("t1")
org.apache.spark.sql.AnalysisException: Table
In fact not all tasks belong to the same stage. Thus, per task may be is
deferent for the dependence of memory. For example, the executor
are running two tasks(A and B), and the ExecutionMemoryPool own 1000M. We
can hope the task-A occupy 900M, and task-B occupy 100M due to the task-A
need much mo
gt;
>> > you might wanna have a look into using a PartitionPruningRDD to select
>> > a subset of partitions by ID. This approach worked very well for
>> > multi-key lookups for us [1].
>> >
>> > A major advantage compared to scan-based operations is that,
to work?
>
> - Thodoris
>
>
> > On 15 Apr 2018, at 01:40, Matthias Boehm wrote:
> >
> > you might wanna have a look into using a PartitionPruningRDD to select
> > a subset of partitions by ID. This approach worked very well for
> > multi-key lookups for us [1].
&
for
> multi-key lookups for us [1].
>
> A major advantage compared to scan-based operations is that, if your
> source RDD has an existing partitioner, only relevant partitions are
> accessed.
>
> [1]
> https://github.com/apache/systemml/blob/master/src/main/java/org/apa
you might wanna have a look into using a PartitionPruningRDD to select
a subset of partitions by ID. This approach worked very well for
multi-key lookups for us [1].
A major advantage compared to scan-based operations is that, if your
source RDD has an existing partitioner, only relevant
Hello list,
I am sorry for sending this message here, but I could not manage to get any
response in “users”. For specific purposes I would like to isolate 1 partition
of the RDD and perform computations only to this.
For instance, suppose that a user asks Spark to create 500 partitions for
One more option is to override writeReplace [1] in
LegacyAccumulatorWrapper to prevent such failures.
What do you think?
[1]
https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L158
On Fri, Mar 16, 2018 at
Hi there,
I've noticed that accumulators of Spark 1.x no longer work with Spark
2.x failing with
java.lang.AssertionError: assertion failed: copyAndReset must return a
zero value copy
It happens while serializing an accumulator here [1] although
copyAndReset returns zero-value copy for
any suggestion from spark dev group?
From: Link Qian
Sent: Friday, June 23, 2017 9:58 AM
To: u...@spark.apache.org
Subject: Container exited with a non-zero exit code 1
Hello,
I submit a spark job to YARN cluster with spark-submit command. the environment
---
From: "Sean Owen"
Date: 2017/6/15 16:13:11
To:
"user";"dev";"??"<1427357...@qq.com>;
Subject: Re: the dependence length of RDD, can its size be greater than 1
pleaae?
Yes. Imagine an RDD that results from a union of other RDDs.
O
Yes. Imagine an RDD that results from a union of other RDDs.
On Thu, Jun 15, 2017, 09:11 萝卜丝炒饭 <1427357...@qq.com> wrote:
> Hi all,
>
> The RDD code keeps a member as below:
> dependencies_ : seq[Dependency[_]]
>
> It is a seq, that means it can keep more than one dependency.
>
> I have an issue
A join?
On Thu, Jun 15, 2017 at 1:11 AM 萝卜丝炒饭 <1427357...@qq.com> wrote:
> Hi all,
>
> The RDD code keeps a member as below:
> dependencies_ : seq[Dependency[_]]
>
> It is a seq, that means it can keep more than one dependency.
>
> I have an issue about this.
>
Hi all,
The RDD code keeps a member as below:
dependencies_ : seq[Dependency[_]]
It is a seq, that means it can keep more than one dependency.
I have an issue about this.
Is it possible that its size is greater than one please?
If yes, how to produce it please? Would you like show me some cod
and we're back! :)
On Thu, Feb 16, 2017 at 10:22 AM, shane knapp wrote:
> we don't have many builds running right now, and i need to restart the
> daemon quickly to enable a new plugin.
>
> i'll wait until the pull request builder jobs are finished and then
> (gently) kick jenkins.
>
> updates a
we don't have many builds running right now, and i need to restart the
daemon quickly to enable a new plugin.
i'll wait until the pull request builder jobs are finished and then
(gently) kick jenkins.
updates as they come,
shane (who's always nervous about touching this house of cards)
vecmodel-exceeds-max-rpc-size-for-saving)
>> * "feature parity" with mllib (issues with one large model file already
>> solved for mllib in SPARK-11994
>> <https://issues.apache.org/jira/browse/SPARK-11994>)
>>
>>
>> On Fri, Jan 13, 2017 at 1:02 PM, Nick Pe
ipate in saving the model
> * avoids rpc issues (
> http://stackoverflow.com/questions/40842736/spark-word2vecmodel-exceeds-max-rpc-size-for-saving
> )
> * "feature parity" with mllib (issues with one large model file already
> solved for mllib in SPARK-11994
> <https://issues.apa
parity" with mllib (issues with one large model file already
solved for mllib in SPARK-11994
<https://issues.apache.org/jira/browse/SPARK-11994>)
On Fri, Jan 13, 2017 at 1:02 PM, Nick Pentreath
wrote:
> Yup - it's because almost all model data in spark ML (model coefficients
27;re referring to code that serializes models, which are quite small.
> For example a PCA model consists of a few principal component vector. It's
> a Dataset of just one element being saved here. It's re-using the code path
> normally used to save big data sets, to output 1 file with 1 th
You're referring to code that serializes models, which are quite small. For
example a PCA model consists of a few principal component vector. It's a
Dataset of just one element being saved here. It's re-using the code path
normally used to save big data sets, to output 1 file
Fri, Jan 13, 2017 at 5:23 PM Asher Krim wrote:
>
>> Hi,
>>
>> I'm curious why it's common for data to be repartitioned to 1 partition
>> when saving ml models:
>>
>> sqlContext.createDataFrame(Seq(data)).repartition(1).write.
>> parquet(dataPath)
That is usually so the result comes out in one file, not partitioned over n
files.
On Fri, Jan 13, 2017 at 5:23 PM Asher Krim wrote:
> Hi,
>
> I'm curious why it's common for data to be repartitioned to 1 partition
> when saving ml models:
>
> sqlContext.createDataFr
Hi,
I'm curious why it's common for data to be repartitioned to 1 partition
when saving ml models:
sqlContext.createDataFrame(Seq(data)).repartition(1).write.parquet(dataPath)
This shows up in most ml models I've seen (Word2Vec
<https://github.com/apache/spark/blob/master/ml
If you create a HiveContext before starting StreamingContext, then
`SQLContext.getOrCreate` in foreachRDD will return the HiveContext you
created. You can just call asInstanceOf[HiveContext] to convert it to
HiveContext.
On Tue, Nov 22, 2016 at 8:25 AM, Dirceu Semighini Filho <
dirceu.semigh...@gm
Hi Koert,
Certainly it's not a good idea, I was trying to use SQLContext.getOrCreate
but it will return a SQLContext and not a HiveContext.
As I'm using a checkpoint, whenever I start the context by reading the
checkpoint it didn't create my hive context, unless I create it foreach
microbach.
I did
you are creating a new hive context per microbatch? is that a good idea?
On Tue, Nov 22, 2016 at 8:51 AM, Dirceu Semighini Filho <
dirceu.semigh...@gmail.com> wrote:
> Has anybody seen this behavior (see tha attached picture) in Spark
> Streaming?
> It started to happen here after I changed the H
Has anybody seen this behavior (see tha attached picture) in Spark
Streaming?
It started to happen here after I changed the HiveContext creation to
stream.foreachRDD {
rdd =>
val hiveContext = new HiveContext(rdd.sparkContext)
}
Is this expected?
Kind Regards,
Dirceu
n't
> mean you can't use classifiers.
>
It is worse (or better) than that, profiles didn't work for us in
combination with Scala 2.10/2.11, so we modify the POM in place as part of
CI and the release process.
> I have seen it used for HBase, core Hadoop. I am not sure I've seen
ore Hadoop. I am not sure I've seen it
used for Spark 2 vs 1 but no reason it couldn't be. Frequently
projects would instead declare that as of some version, Spark 2 is
required, rather than support both. Or shim over an API difference
with reflection if that's all there was to it.
Have you seen any successful applications of this for Spark 1.x/2.x?
>From the doc "The classifier allows to distinguish artifacts that were
built from the same POM but differ in their content."
We'd be building from different POMs, since we'd be modifying the Spark
This is also what "classifiers" are for in Maven, to have variations
on one artifact and version. https://maven.apache.org/pom.html
It has been used to ship code for Hadoop 1 vs 2 APIs.
In a way it's the same idea as Scala's "_2.xx" naming convention, with
a less un
Ah yes, thank you for the clarification.
On Wed, Aug 24, 2016 at 11:44 AM, Ted Yu wrote:
> 'Spark 1.x and Scala 2.10 & 2.11' was repeated.
>
> I guess your second line should read:
>
> org.bdgenomics.adam:adam-{core,apis,cli}-spark2_2.1[0,1] for Spark 2.x
>
'Spark 1.x and Scala 2.10 & 2.11' was repeated.
I guess your second line should read:
org.bdgenomics.adam:adam-{core,apis,cli}-spark2_2.1[0,1] for Spark 2.x and
Scala 2.10 & 2.11
On Wed, Aug 24, 2016 at 9:41 AM, Michael Heuer wrote:
> Hello,
>
> We're a proj
Hello,
We're a project downstream of Spark and need to provide separate artifacts
for Spark 1.x and Spark 2.x. Has any convention been established or even
proposed for artifact names and/or qualifiers?
We are currently thinking
org.bdgenomics.adam:adam-{core,apis,cli}_2.1[0,1] for Spar
Use dominant resource calculator instead of default resource calculator
will get the expected vcores as you wanted. Basically by default yarn does
not honor cpu cores as resource, so you will always see vcore is 1 no
matter what number of cores you set in spark.
On Wed, Aug 3, 2016 at 12:11 PM
Hi All,
I am trying to run a spark job using yarn, and i specify --executor-cores
value as 20.
But when i go check the "nodes of the cluster" page in
http://hostname:8088/cluster/nodes then i see 4 containers getting created
on each of the node in cluster.
But can only see 1 vco
BlockManager)
16/06/15 19:45:43 ERROR RetryingBlockFetcher: Exception while beginning
fetch of 1 outstanding blocks
java.io.IOException: Failed to connect to /192.168.56.1:56413
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
at
Here is the JIRA and PR for supporting PolynomialExpansion with degree 1,
and it has been merged.
https://issues.apache.org/jira/browse/SPARK-13338
https://github.com/apache/spark/pull/11216
2016-05-02 9:20 GMT-07:00 Nick Pentreath :
> There is a JIRA and PR around for supporting polynom
There is a JIRA and PR around for supporting polynomial expansion with
degree 1. Offhand I can't recall if it's been merged
On Mon, 2 May 2016 at 17:45, Julio Antonio Soto de Vicente
wrote:
> Hi,
>
> Same goes for the PolynomialExpansion in org.apache.spark.ml.feature. It
Hi,
Same goes for the PolynomialExpansion in org.apache.spark.ml.feature. It would
be dice to cross-validate with degree 1 polynomial expansion (this is, with no
expansion at all) vs other degree polynomial expansions. Unfortunately, degree
is forced to be >= 2.
--
Julio
> El 2 may 2
it be an okay idea to generalize the cross validator so it can work
with k-fold value of 1? The only purpose for this is to avoid maintaining
two different code paths and in functionality it should be similar to as if
the cross validation is not present.
--
View this message in context:
http
>> one of the basic guarantees of kafka, which is in-order processing on
>> a per-topicpartition basis.
>>
>> As far as PRs go, because of the new consumer interface for kafka 0.9
>> and 0.10, there's a lot of potential change already underway.
>>
>>
a lot of potential change already underway.
>
> See
>
> https://issues.apache.org/jira/browse/SPARK-12177
>
> On Thu, Mar 10, 2016 at 1:59 PM, Renyi Xiong
> wrote:
> > Hi TD,
> >
> > Thanks a lot for offering to look at our PR (if we fire one) at the
> > conf
erway.
See
https://issues.apache.org/jira/browse/SPARK-12177
On Thu, Mar 10, 2016 at 1:59 PM, Renyi Xiong wrote:
> Hi TD,
>
> Thanks a lot for offering to look at our PR (if we fire one) at the
> conference NYC.
>
> As we discussed briefly the issues of unbalanced and
take that dependency
c. Topic partition split happens only when configured
there're some other more complicated changes related to fault
tolerance which are irrelevant here (but you're more than welcome to
comment on them too) and are introduced to unblock the scenarios we're
exper
(SPARK-3424<https://issues.apache.org/jira/browse/SPARK-3424>) in the
initialisation phase and is local to driver using 1 core only.
If I use random, the job completed in 1.5mins compared to 1hr+.
Should I move this to the dev list?
Regards,
Liming
Fr
ue, Feb 9, 2016 at 12:36 PM Alexander Pivovarov
> wrote:
>
>> Thanks Jonathan
>>
>> Actually I'd like to use maximizeResourceAllocation.
>>
>> Ideally for me would be to add new instance group having single small box
>> labelled as AM
>> I'm n
eResourceAllocation.
>
> Ideally for me would be to add new instance group having single small box
> labelled as AM
> I'm not sure "aws emr create-cluster" supports setting custom LABELS , the
> only settings awailable are:
>
> InstanceCount=1,BidPrice=0.5,Name=spark
t
> Spark AMs are probably running even on TASK instances currently, which is
> OK but not consistent with what we do for MapReduce. I'll make sure we
> set spark.yarn.am.nodeLabelExpression appropriately in the next EMR release.
>
> ~ Jonathan
>
> On Tue, Feb 9, 2016 at 1:3
ake sure we
set spark.yarn.am.nodeLabelExpression appropriately in the next EMR release.
~ Jonathan
On Tue, Feb 9, 2016 at 1:30 PM Marcelo Vanzin wrote:
> On Tue, Feb 9, 2016 at 12:16 PM, Jonathan Kelly
> wrote:
> > And we do set yarn.app.mapreduce.am.labels=CORE
>
> That sou
On Tue, Feb 9, 2016 at 12:16 PM, Jonathan Kelly wrote:
> And we do set yarn.app.mapreduce.am.labels=CORE
That sounds very mapreduce-specific, so I doubt Spark (or anything
non-MR) would honor it.
--
Marcelo
-
To unsubscribe, e
Thanks Jonathan
Actually I'd like to use maximizeResourceAllocation.
Ideally for me would be to add new instance group having single small box
labelled as AM
I'm not sure "aws emr create-cluster" supports setting custom LABELS , the
only settings awailable are:
InstanceCount
>> >> > nodes
> >> >> > being utilized only by the AM and not an executor.
> >> >> >
> >> >> > However, as you point out, the only viable fix seems to be to
> reserve
> >> >> > enough
> >> >> > memo
enabled .
Sent from Samsung Mobile.
Sent from Samsung Mobile.
Original message From: Alexander Pivovarov
Date:09/02/2016 10:33 (GMT+05:30)
To: dev@spark.apache.org Cc: Subject: spark
on yarn wastes one box (or 1 GB on each box) for am container
Lets say that yarn has 53GB
I mean Jonathan
On Tue, Feb 9, 2016 at 10:41 AM, Alexander Pivovarov
wrote:
> I decided to do YARN over-commit and add 896
> to yarn.nodemanager.resource.memory-mb
> it was 54,272
> now I set it to 54,272+896 = 55,168
>
> Kelly, can I ask you couple questions
> 1. it is
I decided to do YARN over-commit and add 896
to yarn.nodemanager.resource.memory-mb
it was 54,272
now I set it to 54,272+896 = 55,168
Kelly, can I ask you couple questions
1. it is possible to add yarn label to particular instance group boxes on
EMR?
2. in addition to maximizeResourceAllocation
of somewhat of a bug that makes
> it
> >> >> > not
> >> >> > reserve any space for the AM, which ultimately results in one of
> the
> >> >> > nodes
> >> >> > being utilized only by the AM and not an executor.
> >> &
reserve any space for the AM, which ultimately results in one of the
>> >> > nodes
>> >> > being utilized only by the AM and not an executor.
>> >> >
>> >> > However, as you point out, the only viable fix seems to be to reserve
>> &
or.
> >> >
> >> > However, as you point out, the only viable fix seems to be to reserve
> >> > enough
> >> > memory for the AM on *every single node*, which in some cases might
> >> > actually
> >> > be worse than wasting a lot o
Praveen,
You mean cluster mode, right? That would still in a sense cause one box to
be "wasted", but at least it would be used a bit more to its full
potential, especially if you set spark.driver.memory to higher than its 1g
default. Also, cluster mode is not an option for some applications, such
which in some cases might
> >> > actually
> >> > be worse than wasting a lot of memory on a single node.
> >> >
> >> > So yeah, I also don't like either option. Is this just the price you
> pay
> >> > for
> >> > running o
How about running in client mode, so that the client from which it is run
becomes the driver.
Regards,
Praveen
On 9 Feb 2016 16:59, "Steve Loughran" wrote:
>
> > On 9 Feb 2016, at 06:53, Sean Owen wrote:
> >
> >
> > I think you can let YARN over-commit RAM though, and allocate more
> > memory t
> On 9 Feb 2016, at 06:53, Sean Owen wrote:
>
>
> I think you can let YARN over-commit RAM though, and allocate more
> memory than it actually has. It may be beneficial to let them all
> think they have an extra GB, and let one node running the AM
> technically be overcommitted, a state which w
t;> > memory for the AM on *every single node*, which in some cases might
>> > actually
>> > be worse than wasting a lot of memory on a single node.
>> >
>> > So yeah, I also don't like either option. Is this just the price you pay
>> > for
> > be worse than wasting a lot of memory on a single node.
> >
> > So yeah, I also don't like either option. Is this just the price you pay
> for
> > running on YARN?
> >
> >
> > ~ Jonathan
> >
> > On Mon, Feb 8, 2016 at 9:03 PM Alexand
lot of memory on a single node.
>
> So yeah, I also don't like either option. Is this just the price you pay for
> running on YARN?
>
>
> ~ Jonathan
>
> On Mon, Feb 8, 2016 at 9:03 PM Alexander Pivovarov
> wrote:
>>
>> Lets say that yarn has 53GB memor
g on YARN?
~ Jonathan
On Mon, Feb 8, 2016 at 9:03 PM Alexander Pivovarov
wrote:
> Lets say that yarn has 53GB memory available on each slave
>
> spark.am container needs 896MB. (512 + 384)
>
> I see two options to configure spark:
>
> 1. configure spark executors to use 5
Lets say that yarn has 53GB memory available on each slave
spark.am container needs 896MB. (512 + 384)
I see two options to configure spark:
1. configure spark executors to use 52GB and leave 1 GB on each box. So,
some box will also run am container. So, 1GB memory will not be used on all
Thank you for your reply!
I have already done the change locally. So for changing it would be fine.
I just wanted to be sure which way is correct.
On 9 Dec 2015 18:20, "Fengdong Yu" wrote:
> I don’t think there is performance difference between 1.x API and 2.x API.
>
> but i
I don’t think there is performance difference between 1.x API and 2.x API.
but it’s not a big issue for your change, only
com.databricks.hadoop.mapreduce.lib.input.XmlInputFormat.java
<https://github.com/databricks/spark-xml/blob/master/src/main/java/com/databricks/hadoop/mapreduce/lib/in
Hi all,
I am writing this email to both user-group and dev-group since this is
applicable to both.
I am now working on Spark XML datasource (
https://github.com/databricks/spark-xml).
This uses a InputFormat implementation which I downgraded to Hadoop 1.x for
version compatibility.
However, I
S3 bucket with every release?
>>
>> Is the implied answer that we should continue to expect the same set of
>> artifacts for every release for the foreseeable future?
>>
>> Nick
>>
>>
>> On Tue, Oct 6, 2015 at 1:13 AM Patrick Wendell wrote:
>>
Sounds good to me.
For my purposes, I'm less concerned about old Spark artifacts and more
concerned about the consistency of the set of artifacts that get generated
with new releases. (e.g. Each new release will always include one artifact
each for Hadoop 1, Hadoop 1 + Scala 2.11, etc...
ture?
>
> Nick
>
>
> On Tue, Oct 6, 2015 at 1:13 AM Patrick Wendell wrote:
>
>> The missing artifacts are uploaded now. Things should propagate in the
>> next 24 hours. If there are still issues past then ping this thread. Thanks!
>>
>> - Patrick
>>
foreseeable future?
Nick
On Tue, Oct 6, 2015 at 1:13 AM Patrick Wendell wrote:
> The missing artifacts are uploaded now. Things should propagate in the
> next 24 hours. If there are still issues past then ping this thread. Thanks!
>
> - Patrick
>
> On Mon, Oct 5, 2015 at 2:41 PM
The missing artifacts are uploaded now. Things should propagate in the next
24 hours. If there are still issues past then ping this thread. Thanks!
- Patrick
On Mon, Oct 5, 2015 at 2:41 PM, Nicholas Chammas wrote:
> Thanks for looking into this Josh.
>
> On Mon, Oct 5, 2015 at 5:39 PM Josh Rose
Thanks for looking into this Josh.
On Mon, Oct 5, 2015 at 5:39 PM Josh Rosen wrote:
> I'm working on a fix for this right now. I'm planning to re-run a modified
> copy of the release packaging scripts which will emit only the missing
> artifacts (so we won't upload new artifacts with different S
I'm working on a fix for this right now. I'm planning to re-run a modified
copy of the release packaging scripts which will emit only the missing
artifacts (so we won't upload new artifacts with different SHAs for the
builds which *did* succeed).
I expect to have this finished in the next day or s
Blaž said:
Also missing is
http://s3.amazonaws.com/spark-related-packages/spark-1.5.1-bin-hadoop1.tgz
which breaks spark-ec2 script.
This is the package I am referring to in my original email.
Nick said:
It appears that almost every version of Spark up to and including 1.5.0 has
included a —bin
Also missing is
http://s3.amazonaws.com/spark-related-packages/spark-1.5.1-bin-hadoop1.tgz
which breaks spark-ec2 script.
On Mon, Oct 5, 2015 at 5:20 AM, Ted Yu wrote:
> hadoop1 package for Scala 2.10 wasn't in RC1 either:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
hadoop1 package for Scala 2.10 wasn't in RC1 either:
http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
On Sun, Oct 4, 2015 at 5:17 PM, Nicholas Chammas wrote:
> I’m looking here:
>
> https://s3.amazonaws.com/spark-related-packages/
>
> I believe this is where one set of offi
I’m looking here:
https://s3.amazonaws.com/spark-related-packages/
I believe this is where one set of official packages is published. Please
correct me if this is not the case.
It appears that almost every version of Spark up to and including 1.5.0 has
included a --bin-hadoop1.tgz release (e.g.
he
>> driver is receiving the result of only 4 tasks, which is relatively
>> small.
>>
>> Mike
>>
>>
>> On 9/26/15, Evan R. Sparks wrote:
>> > Mike,
>> >
>> > I believe the reason you're seeing near identical performance on the
>&
wrote:
> > Mike,
> >
> > I believe the reason you're seeing near identical performance on the
> > gradient computations is twofold
> > 1) Gradient computations for GLM models are computationally pretty cheap
> > from a FLOPs/byte read perspective. They are essential
very level. Furthermore, the
driver is receiving the result of only 4 tasks, which is relatively
small.
Mike
On 9/26/15, Evan R. Sparks wrote:
> Mike,
>
> I believe the reason you're seeing near identical performance on the
> gradient computations is twofold
> 1) Gradient c
Looks like the problem is df.rdd does not work very well with limit. In
scala, df.limit(1).rdd will also trigger the issue you observed. I will add
this in the jira.
On Mon, Sep 21, 2015 at 10:44 AM, Jerry Lam wrote:
> I just noticed you found 1.4 has the same issue. I added that as well
I just noticed you found 1.4 has the same issue. I added that as well in
the ticket.
On Mon, Sep 21, 2015 at 1:43 PM, Jerry Lam wrote:
> Hi Yin,
>
> You are right! I just tried the scala version with the above lines, it
> works as expected.
> I'm not sure if it happens als
actually a bit. I created a
ticket for this (SPARK-10731
<https://issues.apache.org/jira/browse/SPARK-10731>).
Best Regards,
Jerry
On Mon, Sep 21, 2015 at 1:01 PM, Yin Huai wrote:
> btw, does 1.4 has the same problem?
>
> On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai wrote:
>
&
t;
>> Thanks,
>>
>> Yin
>>
>> On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam wrote:
>>
>>> Hi Spark Developers,
>>>
>>> I just ran some very simple operations on a dataset. I was surprise by
>>> the execution plan of take(1), head
1 - 100 of 252 matches
Mail list logo