Both Spark AM and Client are trying to delete Staging Directory

2017-01-13 Thread Rostyslav Sotnychenko
Hi all! I am a bit confused why Spark AM and Client are both trying to delete Staging Directory. https://github.com/apache/spark/blob/branch-2.1/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L1110 https://github.com/apache/spark/blob/branch-2.1/yarn/src/main/scala/org/apache/spark

Re: Why are ml models repartition(1)'d in save methods?

2017-01-13 Thread Sean Owen
Yes, certainly debatable for word2vec. You have a good point that this could overrun the 2GB limit if the model is one big datum, for large but not crazy models. This model could probably easily be serialized as individual vectors in this case. It would introduce a backwards-compatibility issue but

Re: Why are ml models repartition(1)'d in save methods?

2017-01-13 Thread Asher Krim
I guess it depends on the definition of "small". A Word2vec model with vectorSize=300 and vocabulary=3m takes nearly 4gb. While it does fit on a single machine (so isn't really "big" data), I don't see the benefit in having the model stored in one file. On the contrary, it seems that we would want

Re: Why are ml models repartition(1)'d in save methods?

2017-01-13 Thread Nick Pentreath
Yup - it's because almost all model data in spark ML (model coefficients) is "small" - i.e. Non distributed. If you look at ALS you'll see there is no repartitioning since the factor dataframes can be large On Fri, 13 Jan 2017 at 19:42, Sean Owen wrote: > You're referring to code that serializes

Re: Why are ml models repartition(1)'d in save methods?

2017-01-13 Thread Sean Owen
You're referring to code that serializes models, which are quite small. For example a PCA model consists of a few principal component vector. It's a Dataset of just one element being saved here. It's re-using the code path normally used to save big data sets, to output 1 file with 1 thing as Parque

Re: Why are ml models repartition(1)'d in save methods?

2017-01-13 Thread Asher Krim
But why is that beneficial? The data is supposedly quite large, distributing it across many partitions/files would seem to make sense. On Fri, Jan 13, 2017 at 12:25 PM, Sean Owen wrote: > That is usually so the result comes out in one file, not partitioned over > n files. > > On Fri, Jan 13, 201

Re: Why are ml models repartition(1)'d in save methods?

2017-01-13 Thread Sean Owen
That is usually so the result comes out in one file, not partitioned over n files. On Fri, Jan 13, 2017 at 5:23 PM Asher Krim wrote: > Hi, > > I'm curious why it's common for data to be repartitioned to 1 partition > when saving ml models: > > sqlContext.createDataFrame(Seq(data)).repartition(1

Why are ml models repartition(1)'d in save methods?

2017-01-13 Thread Asher Krim
Hi, I'm curious why it's common for data to be repartitioned to 1 partition when saving ml models: sqlContext.createDataFrame(Seq(data)).repartition(1).write.parquet(dataPath) This shows up in most ml models I've seen (Word2Vec

Re: Can anyone edit JIRAs SPARK-19191 to SPARK-19202?

2017-01-13 Thread Sean Owen
A-ha. I'll try to clean them up and ask for new JIRAs to be created in some cases. On Fri, Jan 13, 2017 at 4:15 PM Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > FWIW there is an option to Delete the issue (in More -> Delete). > > Shivaram > >

Re: Can anyone edit JIRAs SPARK-19191 to SPARK-19202?

2017-01-13 Thread Shivaram Venkataraman
FWIW there is an option to Delete the issue (in More -> Delete). Shivaram On Fri, Jan 13, 2017 at 8:11 AM, Shivaram Venkataraman wrote: > I can't see the resolve button either - Maybe we can forward this to > Apache Infra and see if they can close these issues ? > > Shivaram > > On Fri, Jan 13,

Re: Can anyone edit JIRAs SPARK-19191 to SPARK-19202?

2017-01-13 Thread Shivaram Venkataraman
I can't see the resolve button either - Maybe we can forward this to Apache Infra and see if they can close these issues ? Shivaram On Fri, Jan 13, 2017 at 6:35 AM, Sean Owen wrote: > Yes, I'm asking about a specific range: 19191 - 19202. These seem to be the > ones created during the downtime.

Re: Can anyone edit JIRAs SPARK-19191 to SPARK-19202?

2017-01-13 Thread Artur Sukhenko
None of JIRA issues of range 19191 -19202 have Resolve/Close Issue button. e.g. SPARK-19214 "Resolve Issue" button has link https://issues.apache.org/jira/secure/WorkflowUIDispatcher.jspa?id=13034656&action=5&atl_token= ... Meaning that this is something to do with *Workflow* (action 5) (as it ha

Re: Can anyone edit JIRAs SPARK-19191 to SPARK-19202?

2017-01-13 Thread Sean Owen
Yes, I'm asking about a specific range: 19191 - 19202. These seem to be the ones created during the downtime. Most are duplicates or incomplete. On Fri, Jan 13, 2017 at 2:32 PM Artur Sukhenko wrote: > Yes, I can resolve/close SPARK-19214 for example. > > On Fri, Jan 13, 2017 at 4:29 PM Sean Owen

Re: Can anyone edit JIRAs SPARK-19191 to SPARK-19202?

2017-01-13 Thread Artur Sukhenko
Yes, I can resolve/close SPARK-19214 for example. On Fri, Jan 13, 2017 at 4:29 PM Sean Owen wrote: > Do you see a button to resolve other issues? you may not be able to > resolve any of them. I am a JIRA admin though, like most other devs, so > should be able to resolve anything. > > Yes, I cert

Re: Can anyone edit JIRAs SPARK-19191 to SPARK-19202?

2017-01-13 Thread Sean Owen
Do you see a button to resolve other issues? you may not be able to resolve any of them. I am a JIRA admin though, like most other devs, so should be able to resolve anything. Yes, I certainly know how resolving issues works but it's suddenly today only working for a subset of issues, and I bet it

Re: Can anyone edit JIRAs SPARK-19191 to SPARK-19202?

2017-01-13 Thread Artur Sukhenko
Hello Sean, I can't resolve SPARK-19191 to SPARK-19202 too. I believe this is a bug. Here is JIRA Documentation which states this or similar problems - How to Edit the Resolution of an Issue On Fri,

Can anyone edit JIRAs SPARK-19191 to SPARK-19202?

2017-01-13 Thread Sean Owen
Looks like the JIRA maintenance left a bunch of duplicate JIRAs, from SPARK-19191 to SPARK-19202. For some reason, I can't resolve these issues, but I can resolve others. Does anyone else see the same? I know SPARK-19190 was similarly borked but closed by its owner.

Both Spark AM and Client are trying to delete Staging Directory

2017-01-13 Thread Rostyslav Sotnychenko
Hi all! I am a bit confused why Spark AM and Client are both trying to delete Staging Directory. https://github.com/apache/spark/blob/branch-2.1/yarn/ src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L1110 https://github.com/apache/spark/blob/branch-2.1/yarn/ src/main/scala/org/apache/spa