+1

2023-04-05 Thread ISHIZAKI Kazuaki
Thank you 2023年4月5日(水) 21:32 yangjie01 : > +1 > > > > *发件人**: *Yuming Wang > *日期**: *2023年4月5日 星期三 14:39 > *收件人**: *Xinrong Meng > *抄送**: *Hyukjin Kwon , Chao Sun , > Holden Karau , "L. C. Hsieh" , > Mridul Muralidharan , "dev@spark.apache.org&

Re: Contributor data in github-page no longer updated after May 1

2022-05-11 Thread Hyukjin Kwon
It's very likely a GitHub issue On Wed, 11 May 2022 at 18:01, Yang,Jie(INF) wrote: > Hi, teams > > > > The contributors data in the following page seems no longer updated after > May 1, Can anyone fix it? > > > > > https://github.com/apache/spark/graphs/c

Contributor data in github-page no longer updated after May 1

2022-05-11 Thread Yang,Jie(INF)
Hi, teams The contributors data in the following page seems no longer updated after May 1, Can anyone fix it? https://github.com/apache/spark/graphs/contributors?from=2022-05-01&to=2022-05-11&type=c Warm regards, YangJie

Re: Setting spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 and Doc issue

2020-07-01 Thread Steve Loughran
e a > manifest file to the job attempt dir pointing to the successful task > attempt; commit that with their atomic file rename. The committer plugin > point in MR lets you declare a committer factory for each FS, so it could > be done without any further changes to spark. > > On Thu, 25

Re: Setting spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 and Doc issue

2020-06-29 Thread Steve Loughran
, so it could be done without any further changes to spark. On Thu, 25 Jun 2020 at 22:38, Waleed Fateem wrote: > I was trying to make my email short and concise, but the rationale behind > setting that as 1 by default is because it's safer. With algorithm version > 2 you run the r

Re: Setting spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 and Doc issue

2020-06-25 Thread Waleed Fateem
I was trying to make my email short and concise, but the rationale behind setting that as 1 by default is because it's safer. With algorithm version 2 you run the risk of having bad data in cases where tasks fail or even duplicate data if a task fails and succeeds on a reattempt (I don'

Re: Setting spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 and Doc issue

2020-06-25 Thread Sean Owen
I think is a Hadoop property that is just passed through? if the default is different in Hadoop 3 we could mention that in the docs. i don't know if we want to always set it to 1 as a Spark default, even in Hadoop 3 right? On Thu, Jun 25, 2020 at 2:43 PM Waleed Fateem wrote: > > H

Setting spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 and Doc issue

2020-06-25 Thread Waleed Fateem
Hello! I noticed that in the documentation starting with 2.2.0 it states that the parameter spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version is 1 by default: https://issues.apache.org/jira/browse/SPARK-20107 I don't actually see this being set anywhere explicitly in the Spark

DataSourceV2 community sync notes - 1 May 2019

2019-05-06 Thread Ryan Blue
Here are my notes for the latest DSv2 community sync. As usual, if you have comments or corrections, please reply. If you’d like to be invited to the next sync, email me directly. Everyone is welcome to attend. *Attendees*: Ryan Blue John Zhuge Andrew Long Bruce Robbins Dilip Biswal Gengliang Wang

CfP VHPC19: HPC Virtualization-Containers: Paper due May 1, 2019 (extended)

2019-04-03 Thread VHPC 19
rmany. (Springer LNCS Proceedings) Date: June 20, 2019 Workshop URL: http://vhpc.org Paper Submission Deadline: May 1, 2019 (extended) Springer LNCS, rolling abstract submission Abstract/Paper Submission Link: https://edas

Why does spark.range(1).write.mode("overwrite").saveAsTable("t1") throw an Exception?

2018-10-30 Thread Jacek Laskowski
Hi, Just ran into it today and wonder whether it's a bug or something I may have missed before. scala> spark.version res21: String = 2.3.2 // that's OK scala> spark.range(1).write.saveAsTable("t1") org.apache.spark.sql.AnalysisException: Table

Why can per task‘s memory only reach 1 / numTasks , not greater than 1 / numTasks in ExecutionMemoryPool ?

2018-06-05 Thread John Fang
In fact not all tasks belong to the same stage. Thus, per task may be is deferent for the dependence of memory. For example, the executor are running two tasks(A and B), and the ExecutionMemoryPool own 1000M. We can hope the task-A occupy 900M, and task-B occupy 100M due to the task-A need much mo

Re: Isolate 1 partition and perform computations

2018-04-16 Thread Thodoris Zois
gt; >> > you might wanna have a look into using a PartitionPruningRDD to select >> > a subset of partitions by ID. This approach worked very well for >> > multi-key lookups for us [1]. >> > >> > A major advantage compared to scan-based operations is that,

Re: Isolate 1 partition and perform computations

2018-04-16 Thread Anastasios Zouzias
to work? > > - Thodoris > > > > On 15 Apr 2018, at 01:40, Matthias Boehm wrote: > > > > you might wanna have a look into using a PartitionPruningRDD to select > > a subset of partitions by ID. This approach worked very well for > > multi-key lookups for us [1]. &

Re: Isolate 1 partition and perform computations

2018-04-14 Thread Thodoris Zois
for > multi-key lookups for us [1]. > > A major advantage compared to scan-based operations is that, if your > source RDD has an existing partitioner, only relevant partitions are > accessed. > > [1] > https://github.com/apache/systemml/blob/master/src/main/java/org/apa

Re: Isolate 1 partition and perform computations

2018-04-14 Thread Matthias Boehm
you might wanna have a look into using a PartitionPruningRDD to select a subset of partitions by ID. This approach worked very well for multi-key lookups for us [1]. A major advantage compared to scan-based operations is that, if your source RDD has an existing partitioner, only relevant

Isolate 1 partition and perform computations

2018-04-14 Thread Thodoris Zois
Hello list, I am sorry for sending this message here, but I could not manage to get any response in “users”. For specific purposes I would like to isolate 1 partition of the RDD and perform computations only to this. For instance, suppose that a user asks Spark to create 500 partitions for

Re: Accumulators of Spark 1.x no longer work with Spark 2.x

2018-03-15 Thread Sergey Zhemzhitsky
One more option is to override writeReplace [1] in LegacyAccumulatorWrapper to prevent such failures. What do you think? [1] https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L158 On Fri, Mar 16, 2018 at

Accumulators of Spark 1.x no longer work with Spark 2.x

2018-03-15 Thread Sergey Zhemzhitsky
Hi there, I've noticed that accumulators of Spark 1.x no longer work with Spark 2.x failing with java.lang.AssertionError: assertion failed: copyAndReset must return a zero value copy It happens while serializing an accumulator here [1] although copyAndReset returns zero-value copy for

Container exited with a non-zero exit code 1

2017-06-24 Thread Link Qian
any suggestion from spark dev group? From: Link Qian Sent: Friday, June 23, 2017 9:58 AM To: u...@spark.apache.org Subject: Container exited with a non-zero exit code 1 Hello, I submit a spark job to YARN cluster with spark-submit command. the environment

Re: the dependence length of RDD, can its size be greater than 1 pleaae?

2017-06-15 Thread ??????????
--- From: "Sean Owen" Date: 2017/6/15 16:13:11 To: "user";"dev";"??"<1427357...@qq.com>; Subject: Re: the dependence length of RDD, can its size be greater than 1 pleaae? Yes. Imagine an RDD that results from a union of other RDDs. O

Re: the dependence length of RDD, can its size be greater than 1 pleaae?

2017-06-15 Thread Sean Owen
Yes. Imagine an RDD that results from a union of other RDDs. On Thu, Jun 15, 2017, 09:11 萝卜丝炒饭 <1427357...@qq.com> wrote: > Hi all, > > The RDD code keeps a member as below: > dependencies_ : seq[Dependency[_]] > > It is a seq, that means it can keep more than one dependency. > > I have an issue

Re: the dependence length of RDD, can its size be greater than 1 pleaae?

2017-06-15 Thread Reynold Xin
A join? On Thu, Jun 15, 2017 at 1:11 AM 萝卜丝炒饭 <1427357...@qq.com> wrote: > Hi all, > > The RDD code keeps a member as below: > dependencies_ : seq[Dependency[_]] > > It is a seq, that means it can keep more than one dependency. > > I have an issue about this. >

the dependence length of RDD, can its size be greater than 1 pleaae?

2017-06-15 Thread ??????????
Hi all, The RDD code keeps a member as below: dependencies_ : seq[Dependency[_]] It is a seq, that means it can keep more than one dependency. I have an issue about this. Is it possible that its size is greater than one please? If yes, how to produce it please? Would you like show me some cod

Re: [build system] jenkins restart in ~1 hour

2017-02-16 Thread shane knapp
and we're back! :) On Thu, Feb 16, 2017 at 10:22 AM, shane knapp wrote: > we don't have many builds running right now, and i need to restart the > daemon quickly to enable a new plugin. > > i'll wait until the pull request builder jobs are finished and then > (gently) kick jenkins. > > updates a

[build system] jenkins restart in ~1 hour

2017-02-16 Thread shane knapp
we don't have many builds running right now, and i need to restart the daemon quickly to enable a new plugin. i'll wait until the pull request builder jobs are finished and then (gently) kick jenkins. updates as they come, shane (who's always nervous about touching this house of cards)

Re: Why are ml models repartition(1)'d in save methods?

2017-01-16 Thread Asher Krim
vecmodel-exceeds-max-rpc-size-for-saving) >> * "feature parity" with mllib (issues with one large model file already >> solved for mllib in SPARK-11994 >> <https://issues.apache.org/jira/browse/SPARK-11994>) >> >> >> On Fri, Jan 13, 2017 at 1:02 PM, Nick Pe

Re: Why are ml models repartition(1)'d in save methods?

2017-01-13 Thread Sean Owen
ipate in saving the model > * avoids rpc issues ( > http://stackoverflow.com/questions/40842736/spark-word2vecmodel-exceeds-max-rpc-size-for-saving > ) > * "feature parity" with mllib (issues with one large model file already > solved for mllib in SPARK-11994 > <https://issues.apa

Re: Why are ml models repartition(1)'d in save methods?

2017-01-13 Thread Asher Krim
parity" with mllib (issues with one large model file already solved for mllib in SPARK-11994 <https://issues.apache.org/jira/browse/SPARK-11994>) On Fri, Jan 13, 2017 at 1:02 PM, Nick Pentreath wrote: > Yup - it's because almost all model data in spark ML (model coefficients

Re: Why are ml models repartition(1)'d in save methods?

2017-01-13 Thread Nick Pentreath
27;re referring to code that serializes models, which are quite small. > For example a PCA model consists of a few principal component vector. It's > a Dataset of just one element being saved here. It's re-using the code path > normally used to save big data sets, to output 1 file with 1 th

Re: Why are ml models repartition(1)'d in save methods?

2017-01-13 Thread Sean Owen
You're referring to code that serializes models, which are quite small. For example a PCA model consists of a few principal component vector. It's a Dataset of just one element being saved here. It's re-using the code path normally used to save big data sets, to output 1 file

Re: Why are ml models repartition(1)'d in save methods?

2017-01-13 Thread Asher Krim
Fri, Jan 13, 2017 at 5:23 PM Asher Krim wrote: > >> Hi, >> >> I'm curious why it's common for data to be repartitioned to 1 partition >> when saving ml models: >> >> sqlContext.createDataFrame(Seq(data)).repartition(1).write. >> parquet(dataPath)

Re: Why are ml models repartition(1)'d in save methods?

2017-01-13 Thread Sean Owen
That is usually so the result comes out in one file, not partitioned over n files. On Fri, Jan 13, 2017 at 5:23 PM Asher Krim wrote: > Hi, > > I'm curious why it's common for data to be repartitioned to 1 partition > when saving ml models: > > sqlContext.createDataFr

Why are ml models repartition(1)'d in save methods?

2017-01-13 Thread Asher Krim
Hi, I'm curious why it's common for data to be repartitioned to 1 partition when saving ml models: sqlContext.createDataFrame(Seq(data)).repartition(1).write.parquet(dataPath) This shows up in most ml models I've seen (Word2Vec <https://github.com/apache/spark/blob/master/ml

Re: [SparkStreaming] 1 SQL tab for each SparkStreaming batch in SparkUI

2016-11-22 Thread Shixiong(Ryan) Zhu
If you create a HiveContext before starting StreamingContext, then `SQLContext.getOrCreate` in foreachRDD will return the HiveContext you created. You can just call asInstanceOf[HiveContext] to convert it to HiveContext. On Tue, Nov 22, 2016 at 8:25 AM, Dirceu Semighini Filho < dirceu.semigh...@gm

Re: [SparkStreaming] 1 SQL tab for each SparkStreaming batch in SparkUI

2016-11-22 Thread Dirceu Semighini Filho
Hi Koert, Certainly it's not a good idea, I was trying to use SQLContext.getOrCreate but it will return a SQLContext and not a HiveContext. As I'm using a checkpoint, whenever I start the context by reading the checkpoint it didn't create my hive context, unless I create it foreach microbach. I did

Re: [SparkStreaming] 1 SQL tab for each SparkStreaming batch in SparkUI

2016-11-22 Thread Koert Kuipers
you are creating a new hive context per microbatch? is that a good idea? On Tue, Nov 22, 2016 at 8:51 AM, Dirceu Semighini Filho < dirceu.semigh...@gmail.com> wrote: > Has anybody seen this behavior (see tha attached picture) in Spark > Streaming? > It started to happen here after I changed the H

[SparkStreaming] 1 SQL tab for each SparkStreaming batch in SparkUI

2016-11-22 Thread Dirceu Semighini Filho
Has anybody seen this behavior (see tha attached picture) in Spark Streaming? It started to happen here after I changed the HiveContext creation to stream.foreachRDD { rdd => val hiveContext = new HiveContext(rdd.sparkContext) } Is this expected? Kind Regards, Dirceu

Re: Spark 1.x/2.x qualifiers in downstream artifact names

2016-09-16 Thread Michael Heuer
n't > mean you can't use classifiers. > It is worse (or better) than that, profiles didn't work for us in combination with Scala 2.10/2.11, so we modify the POM in place as part of CI and the release process. > I have seen it used for HBase, core Hadoop. I am not sure I've seen

Re: Spark 1.x/2.x qualifiers in downstream artifact names

2016-08-24 Thread Sean Owen
ore Hadoop. I am not sure I've seen it used for Spark 2 vs 1 but no reason it couldn't be. Frequently projects would instead declare that as of some version, Spark 2 is required, rather than support both. Or shim over an API difference with reflection if that's all there was to it.

Re: Spark 1.x/2.x qualifiers in downstream artifact names

2016-08-24 Thread Michael Heuer
Have you seen any successful applications of this for Spark 1.x/2.x? >From the doc "The classifier allows to distinguish artifacts that were built from the same POM but differ in their content." We'd be building from different POMs, since we'd be modifying the Spark

Re: Spark 1.x/2.x qualifiers in downstream artifact names

2016-08-24 Thread Sean Owen
This is also what "classifiers" are for in Maven, to have variations on one artifact and version. https://maven.apache.org/pom.html It has been used to ship code for Hadoop 1 vs 2 APIs. In a way it's the same idea as Scala's "_2.xx" naming convention, with a less un

Re: Spark 1.x/2.x qualifiers in downstream artifact names

2016-08-24 Thread Michael Heuer
Ah yes, thank you for the clarification. On Wed, Aug 24, 2016 at 11:44 AM, Ted Yu wrote: > 'Spark 1.x and Scala 2.10 & 2.11' was repeated. > > I guess your second line should read: > > org.bdgenomics.adam:adam-{core,apis,cli}-spark2_2.1[0,1] for Spark 2.x >

Re: Spark 1.x/2.x qualifiers in downstream artifact names

2016-08-24 Thread Ted Yu
'Spark 1.x and Scala 2.10 & 2.11' was repeated. I guess your second line should read: org.bdgenomics.adam:adam-{core,apis,cli}-spark2_2.1[0,1] for Spark 2.x and Scala 2.10 & 2.11 On Wed, Aug 24, 2016 at 9:41 AM, Michael Heuer wrote: > Hello, > > We're a proj

Spark 1.x/2.x qualifiers in downstream artifact names

2016-08-24 Thread Michael Heuer
Hello, We're a project downstream of Spark and need to provide separate artifacts for Spark 1.x and Spark 2.x. Has any convention been established or even proposed for artifact names and/or qualifiers? We are currently thinking org.bdgenomics.adam:adam-{core,apis,cli}_2.1[0,1] for Spar

Re: Spark on yarn, only 1 or 2 vcores getting allocated to the containers getting created.

2016-08-03 Thread Saisai Shao
Use dominant resource calculator instead of default resource calculator will get the expected vcores as you wanted. Basically by default yarn does not honor cpu cores as resource, so you will always see vcore is 1 no matter what number of cores you set in spark. On Wed, Aug 3, 2016 at 12:11 PM

Spark on yarn, only 1 or 2 vcores getting allocated to the containers getting created.

2016-08-02 Thread satyajit vegesna
Hi All, I am trying to run a spark job using yarn, and i specify --executor-cores value as 20. But when i go check the "nodes of the cluster" page in http://hostname:8088/cluster/nodes then i see 4 containers getting created on each of the node in cluster. But can only see 1 vco

ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks

2016-06-15 Thread VG
BlockManager) 16/06/15 19:45:43 ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks java.io.IOException: Failed to connect to /192.168.56.1:56413 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228) at

Re: Cross Validator to work with K-Fold value of 1?

2016-05-03 Thread Yanbo Liang
Here is the JIRA and PR for supporting PolynomialExpansion with degree 1, and it has been merged. https://issues.apache.org/jira/browse/SPARK-13338 https://github.com/apache/spark/pull/11216 2016-05-02 9:20 GMT-07:00 Nick Pentreath : > There is a JIRA and PR around for supporting polynom

Re: Cross Validator to work with K-Fold value of 1?

2016-05-02 Thread Nick Pentreath
There is a JIRA and PR around for supporting polynomial expansion with degree 1. Offhand I can't recall if it's been merged On Mon, 2 May 2016 at 17:45, Julio Antonio Soto de Vicente wrote: > Hi, > > Same goes for the PolynomialExpansion in org.apache.spark.ml.feature. It

Re: Cross Validator to work with K-Fold value of 1?

2016-05-02 Thread Julio Antonio Soto de Vicente
Hi, Same goes for the PolynomialExpansion in org.apache.spark.ml.feature. It would be dice to cross-validate with degree 1 polynomial expansion (this is, with no expansion at all) vs other degree polynomial expansions. Unfortunately, degree is forced to be >= 2. -- Julio > El 2 may 2

Cross Validator to work with K-Fold value of 1?

2016-05-02 Thread Rahul Tanwani
it be an okay idea to generalize the cross validator so it can work with k-fold value of 1? The only purpose for this is to avoid maintaining two different code paths and in functionality it should be similar to as if the cross validation is not present. -- View this message in context: http

Re: DynamicPartitionKafkaRDD - 1:n mapping between kafka and RDD partition

2016-03-15 Thread Cody Koeninger
>> one of the basic guarantees of kafka, which is in-order processing on >> a per-topicpartition basis. >> >> As far as PRs go, because of the new consumer interface for kafka 0.9 >> and 0.10, there's a lot of potential change already underway. >> >>

Re: DynamicPartitionKafkaRDD - 1:n mapping between kafka and RDD partition

2016-03-14 Thread Renyi Xiong
a lot of potential change already underway. > > See > > https://issues.apache.org/jira/browse/SPARK-12177 > > On Thu, Mar 10, 2016 at 1:59 PM, Renyi Xiong > wrote: > > Hi TD, > > > > Thanks a lot for offering to look at our PR (if we fire one) at the > > conf

Re: DynamicPartitionKafkaRDD - 1:n mapping between kafka and RDD partition

2016-03-10 Thread Cody Koeninger
erway. See https://issues.apache.org/jira/browse/SPARK-12177 On Thu, Mar 10, 2016 at 1:59 PM, Renyi Xiong wrote: > Hi TD, > > Thanks a lot for offering to look at our PR (if we fire one) at the > conference NYC. > > As we discussed briefly the issues of unbalanced and

DynamicPartitionKafkaRDD - 1:n mapping between kafka and RDD partition

2016-03-10 Thread Renyi Xiong
take that dependency c. Topic partition split happens only when configured there're some other more complicated changes related to fault tolerance which are irrelevant here (but you're more than welcome to comment on them too) and are introduced to unblock the scenarios we're exper

Re: Kmeans++ using 1 core only Was: Slowness in Kmeans calculating fastSquaredDistance

2016-02-09 Thread Li Ming Tsai
(SPARK-3424<https://issues.apache.org/jira/browse/SPARK-3424>) in the initialisation phase and is local to driver using 1 core only. If I use random, the job completed in 1.5mins compared to 1hr+. Should I move this to the dev list? Regards, Liming Fr

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Alexander Pivovarov
ue, Feb 9, 2016 at 12:36 PM Alexander Pivovarov > wrote: > >> Thanks Jonathan >> >> Actually I'd like to use maximizeResourceAllocation. >> >> Ideally for me would be to add new instance group having single small box >> labelled as AM >> I'm n

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Jonathan Kelly
eResourceAllocation. > > Ideally for me would be to add new instance group having single small box > labelled as AM > I'm not sure "aws emr create-cluster" supports setting custom LABELS , the > only settings awailable are: > > InstanceCount=1,BidPrice=0.5,Name=spark

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Alexander Pivovarov
t > Spark AMs are probably running even on TASK instances currently, which is > OK but not consistent with what we do for MapReduce. I'll make sure we > set spark.yarn.am.nodeLabelExpression appropriately in the next EMR release. > > ~ Jonathan > > On Tue, Feb 9, 2016 at 1:3

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Jonathan Kelly
ake sure we set spark.yarn.am.nodeLabelExpression appropriately in the next EMR release. ~ Jonathan On Tue, Feb 9, 2016 at 1:30 PM Marcelo Vanzin wrote: > On Tue, Feb 9, 2016 at 12:16 PM, Jonathan Kelly > wrote: > > And we do set yarn.app.mapreduce.am.labels=CORE > > That sou

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Marcelo Vanzin
On Tue, Feb 9, 2016 at 12:16 PM, Jonathan Kelly wrote: > And we do set yarn.app.mapreduce.am.labels=CORE That sounds very mapreduce-specific, so I doubt Spark (or anything non-MR) would honor it. -- Marcelo - To unsubscribe, e

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Alexander Pivovarov
Thanks Jonathan Actually I'd like to use maximizeResourceAllocation. Ideally for me would be to add new instance group having single small box labelled as AM I'm not sure "aws emr create-cluster" supports setting custom LABELS , the only settings awailable are: InstanceCount

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Jonathan Kelly
>> >> > nodes > >> >> > being utilized only by the AM and not an executor. > >> >> > > >> >> > However, as you point out, the only viable fix seems to be to > reserve > >> >> > enough > >> >> > memo

RE: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Diwakar Dhanuskodi
 enabled . Sent from Samsung Mobile. Sent from Samsung Mobile. Original message From: Alexander Pivovarov Date:09/02/2016 10:33 (GMT+05:30) To: dev@spark.apache.org Cc: Subject: spark on yarn wastes one box (or 1 GB on each box) for am container Lets say that yarn has 53GB

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Alexander Pivovarov
I mean Jonathan On Tue, Feb 9, 2016 at 10:41 AM, Alexander Pivovarov wrote: > I decided to do YARN over-commit and add 896 > to yarn.nodemanager.resource.memory-mb > it was 54,272 > now I set it to 54,272+896 = 55,168 > > Kelly, can I ask you couple questions > 1. it is

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Alexander Pivovarov
I decided to do YARN over-commit and add 896 to yarn.nodemanager.resource.memory-mb it was 54,272 now I set it to 54,272+896 = 55,168 Kelly, can I ask you couple questions 1. it is possible to add yarn label to particular instance group boxes on EMR? 2. in addition to maximizeResourceAllocation

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Alexander Pivovarov
of somewhat of a bug that makes > it > >> >> > not > >> >> > reserve any space for the AM, which ultimately results in one of > the > >> >> > nodes > >> >> > being utilized only by the AM and not an executor. > >> &

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Marcelo Vanzin
reserve any space for the AM, which ultimately results in one of the >> >> > nodes >> >> > being utilized only by the AM and not an executor. >> >> > >> >> > However, as you point out, the only viable fix seems to be to reserve >> &

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Alexander Pivovarov
or. > >> > > >> > However, as you point out, the only viable fix seems to be to reserve > >> > enough > >> > memory for the AM on *every single node*, which in some cases might > >> > actually > >> > be worse than wasting a lot o

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Jonathan Kelly
Praveen, You mean cluster mode, right? That would still in a sense cause one box to be "wasted", but at least it would be used a bit more to its full potential, especially if you set spark.driver.memory to higher than its 1g default. Also, cluster mode is not an option for some applications, such

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Jonathan Kelly
which in some cases might > >> > actually > >> > be worse than wasting a lot of memory on a single node. > >> > > >> > So yeah, I also don't like either option. Is this just the price you > pay > >> > for > >> > running o

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread praveen S
How about running in client mode, so that the client from which it is run becomes the driver. Regards, Praveen On 9 Feb 2016 16:59, "Steve Loughran" wrote: > > > On 9 Feb 2016, at 06:53, Sean Owen wrote: > > > > > > I think you can let YARN over-commit RAM though, and allocate more > > memory t

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Steve Loughran
> On 9 Feb 2016, at 06:53, Sean Owen wrote: > > > I think you can let YARN over-commit RAM though, and allocate more > memory than it actually has. It may be beneficial to let them all > think they have an extra GB, and let one node running the AM > technically be overcommitted, a state which w

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Sean Owen
t;> > memory for the AM on *every single node*, which in some cases might >> > actually >> > be worse than wasting a lot of memory on a single node. >> > >> > So yeah, I also don't like either option. Is this just the price you pay >> > for

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Alexander Pivovarov
> > be worse than wasting a lot of memory on a single node. > > > > So yeah, I also don't like either option. Is this just the price you pay > for > > running on YARN? > > > > > > ~ Jonathan > > > > On Mon, Feb 8, 2016 at 9:03 PM Alexand

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-08 Thread Sean Owen
lot of memory on a single node. > > So yeah, I also don't like either option. Is this just the price you pay for > running on YARN? > > > ~ Jonathan > > On Mon, Feb 8, 2016 at 9:03 PM Alexander Pivovarov > wrote: >> >> Lets say that yarn has 53GB memor

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-08 Thread Jonathan Kelly
g on YARN? ~ Jonathan On Mon, Feb 8, 2016 at 9:03 PM Alexander Pivovarov wrote: > Lets say that yarn has 53GB memory available on each slave > > spark.am container needs 896MB. (512 + 384) > > I see two options to configure spark: > > 1. configure spark executors to use 5

spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-08 Thread Alexander Pivovarov
Lets say that yarn has 53GB memory available on each slave spark.am container needs 896MB. (512 + 384) I see two options to configure spark: 1. configure spark executors to use 52GB and leave 1 GB on each box. So, some box will also run am container. So, 1GB memory will not be used on all

Re: Differences between Spark APIs for Hadoop 1.x and Hadoop 2.x in terms of performance, progress reporting and IO metrics.

2015-12-09 Thread Hyukjin Kwon
Thank you for your reply! I have already done the change locally. So for changing it would be fine. I just wanted to be sure which way is correct. On 9 Dec 2015 18:20, "Fengdong Yu" wrote: > I don’t think there is performance difference between 1.x API and 2.x API. > > but i

Re: Differences between Spark APIs for Hadoop 1.x and Hadoop 2.x in terms of performance, progress reporting and IO metrics.

2015-12-09 Thread Fengdong Yu
I don’t think there is performance difference between 1.x API and 2.x API. but it’s not a big issue for your change, only com.databricks.hadoop.mapreduce.lib.input.XmlInputFormat.java <https://github.com/databricks/spark-xml/blob/master/src/main/java/com/databricks/hadoop/mapreduce/lib/in

Differences between Spark APIs for Hadoop 1.x and Hadoop 2.x in terms of performance, progress reporting and IO metrics.

2015-12-09 Thread Hyukjin Kwon
Hi all, I am writing this email to both user-group and dev-group since this is applicable to both. I am now working on Spark XML datasource ( https://github.com/databricks/spark-xml). This uses a InputFormat implementation which I downgraded to Hadoop 1.x for version compatibility. However, I

Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-07 Thread Sean Owen
S3 bucket with every release? >> >> Is the implied answer that we should continue to expect the same set of >> artifacts for every release for the foreseeable future? >> >> Nick >> >> >> On Tue, Oct 6, 2015 at 1:13 AM Patrick Wendell wrote: >>

Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-07 Thread Nicholas Chammas
Sounds good to me. For my purposes, I'm less concerned about old Spark artifacts and more concerned about the consistency of the set of artifacts that get generated with new releases. (e.g. Each new release will always include one artifact each for Hadoop 1, Hadoop 1 + Scala 2.11, etc...

Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-07 Thread Patrick Wendell
ture? > > Nick > ​ > > On Tue, Oct 6, 2015 at 1:13 AM Patrick Wendell wrote: > >> The missing artifacts are uploaded now. Things should propagate in the >> next 24 hours. If there are still issues past then ping this thread. Thanks! >> >> - Patrick >>

Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-07 Thread Nicholas Chammas
foreseeable future? Nick ​ On Tue, Oct 6, 2015 at 1:13 AM Patrick Wendell wrote: > The missing artifacts are uploaded now. Things should propagate in the > next 24 hours. If there are still issues past then ping this thread. Thanks! > > - Patrick > > On Mon, Oct 5, 2015 at 2:41 PM

Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-05 Thread Patrick Wendell
The missing artifacts are uploaded now. Things should propagate in the next 24 hours. If there are still issues past then ping this thread. Thanks! - Patrick On Mon, Oct 5, 2015 at 2:41 PM, Nicholas Chammas wrote: > Thanks for looking into this Josh. > > On Mon, Oct 5, 2015 at 5:39 PM Josh Rose

Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-05 Thread Nicholas Chammas
Thanks for looking into this Josh. On Mon, Oct 5, 2015 at 5:39 PM Josh Rosen wrote: > I'm working on a fix for this right now. I'm planning to re-run a modified > copy of the release packaging scripts which will emit only the missing > artifacts (so we won't upload new artifacts with different S

Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-05 Thread Josh Rosen
I'm working on a fix for this right now. I'm planning to re-run a modified copy of the release packaging scripts which will emit only the missing artifacts (so we won't upload new artifacts with different SHAs for the builds which *did* succeed). I expect to have this finished in the next day or s

Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-05 Thread Nicholas Chammas
Blaž said: Also missing is http://s3.amazonaws.com/spark-related-packages/spark-1.5.1-bin-hadoop1.tgz which breaks spark-ec2 script. This is the package I am referring to in my original email. Nick said: It appears that almost every version of Spark up to and including 1.5.0 has included a —bin

Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-05 Thread Blaž Šnuderl
Also missing is http://s3.amazonaws.com/spark-related-packages/spark-1.5.1-bin-hadoop1.tgz which breaks spark-ec2 script. On Mon, Oct 5, 2015 at 5:20 AM, Ted Yu wrote: > hadoop1 package for Scala 2.10 wasn't in RC1 either: > http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/

Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-04 Thread Ted Yu
hadoop1 package for Scala 2.10 wasn't in RC1 either: http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/ On Sun, Oct 4, 2015 at 5:17 PM, Nicholas Chammas wrote: > I’m looking here: > > https://s3.amazonaws.com/spark-related-packages/ > > I believe this is where one set of offi

Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-04 Thread Nicholas Chammas
I’m looking here: https://s3.amazonaws.com/spark-related-packages/ I believe this is where one set of official packages is published. Please correct me if this is not the case. It appears that almost every version of Spark up to and including 1.5.0 has included a --bin-hadoop1.tgz release (e.g.

Re: treeAggregate timing / SGD performance with miniBatchFraction < 1

2015-09-26 Thread Mike Hynes
he >> driver is receiving the result of only 4 tasks, which is relatively >> small. >> >> Mike >> >> >> On 9/26/15, Evan R. Sparks wrote: >> > Mike, >> > >> > I believe the reason you're seeing near identical performance on the >&

Re: treeAggregate timing / SGD performance with miniBatchFraction < 1

2015-09-26 Thread Evan R. Sparks
wrote: > > Mike, > > > > I believe the reason you're seeing near identical performance on the > > gradient computations is twofold > > 1) Gradient computations for GLM models are computationally pretty cheap > > from a FLOPs/byte read perspective. They are essential

treeAggregate timing / SGD performance with miniBatchFraction < 1

2015-09-26 Thread Mike Hynes
very level. Furthermore, the driver is receiving the result of only 4 tasks, which is relatively small. Mike On 9/26/15, Evan R. Sparks wrote: > Mike, > > I believe the reason you're seeing near identical performance on the > gradient computations is twofold > 1) Gradient c

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
Looks like the problem is df.rdd does not work very well with limit. In scala, df.limit(1).rdd will also trigger the issue you observed. I will add this in the jira. On Mon, Sep 21, 2015 at 10:44 AM, Jerry Lam wrote: > I just noticed you found 1.4 has the same issue. I added that as well

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Jerry Lam
I just noticed you found 1.4 has the same issue. I added that as well in the ticket. On Mon, Sep 21, 2015 at 1:43 PM, Jerry Lam wrote: > Hi Yin, > > You are right! I just tried the scala version with the above lines, it > works as expected. > I'm not sure if it happens als

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Jerry Lam
actually a bit. I created a ticket for this (SPARK-10731 <https://issues.apache.org/jira/browse/SPARK-10731>). Best Regards, Jerry On Mon, Sep 21, 2015 at 1:01 PM, Yin Huai wrote: > btw, does 1.4 has the same problem? > > On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai wrote: > &

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
t; >> Thanks, >> >> Yin >> >> On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam wrote: >> >>> Hi Spark Developers, >>> >>> I just ran some very simple operations on a dataset. I was surprise by >>> the execution plan of take(1), head

  1   2   3   >