Re: [Apache Spark Jenkins] build system shutting down Dec 23th, 2021

2021-12-06 Thread Nick Pentreath
Wow! end of an era Thanks so much to you Shane for all you work over 10 (!!) years. And to Amplab also! Farewell Spark Jenkins! N On Tue, Dec 7, 2021 at 6:49 AM Nicholas Chammas wrote: > Farewell to Jenkins and its classic weather forecast build status icons: > > [image: health-80plus.png][im

Re: Welcoming six new Apache Spark committers

2021-03-29 Thread Nick Pentreath
Congratulations to all the new committers. Welcome! On Fri, 26 Mar 2021 at 22:22, Matei Zaharia wrote: > Hi all, > > The Spark PMC recently voted to add several new committers. Please join me > in welcoming them to their new role! Our new committers are: > > - Maciej Szymkiewicz (contributor to

Re: Welcoming some new Apache Spark committers

2020-07-14 Thread Nick Pentreath
Congratulations and welcome as Apache Spark committers! On Wed, 15 Jul 2020 at 06:59, Prashant Sharma wrote: > Congratulations all ! It's great to have such committed folks as > committers. :) > > On Wed, Jul 15, 2020 at 9:24 AM Yi Wu wrote: > >> Congrats!! >> >> On Wed, Jul 15, 2020 at 8:02 AM

PySpark .collect() output to Scala Array[Row]

2020-05-25 Thread Nick Ruest
Hi, I've hit a wall with trying to just implement a couple of Scala methods of in a Python version of our project. My Python function looks like this: def Write_Graphml(data, graphml_path, sc): return sc.getOrCreate()._jvm.io.archivesunleashed.app.WriteGraphML(data, graphml_path).apply Whe

Re: [Dataset API] SPARK-27249

2020-01-26 Thread nick
> On Jan 22, 2020, at 8:35 AM, Nick Afshartous wrote: > > Hello, > > I'm looking into starting work on this ticket > > https://issues.apache.org/jira/browse/SPARK-27249 > <https://issues.apache.org/jira/browse/SPARK-27249> > > which involves add

[Dataset API] SPARK-27249

2020-01-22 Thread Nick Afshartous
dvise. Cheers, -- Nick - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Revisiting Online serving of Spark models?

2018-06-05 Thread Nick Pentreath
the meetup that >>> includes the signup link: >>> *https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/* >>> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/> >>> >>> We have an awesome

Re: Welcome Zhenhua Wang as a Spark committer

2018-04-02 Thread Nick Pentreath
Congratulations! On Tue, 3 Apr 2018 at 05:34 wangzhenhua (G) wrote: > > > Thanks everyone! It’s my great pleasure to be part of such a professional > and innovative community! > > > > > > best regards, > > -Zhenhua(Xander) > > >

Re: [VOTE] Spark 2.3.0 (RC5)

2018-02-27 Thread Nick Pentreath
+1 (binding) Built and ran Scala tests with "-Phadoop-2.6 -Pyarn -Phive", all passed. Python tests passed (also including pyspark-streaming w/kafka-0.8 and flume packages built) On Tue, 27 Feb 2018 at 10:09 Felix Cheung wrote: > +1 > > Tested R: > > install from package, CRAN tests, manual tes

Re: [VOTE] Spark 2.3.0 (RC3)

2018-02-14 Thread Nick Pentreath
-1 for me as we elevated https://issues.apache.org/jira/browse/SPARK-23377 to a Blocker. It should be fixed before release. On Thu, 15 Feb 2018 at 07:25 Holden Karau wrote: > If this is a blocker in your view then the vote thread is an important > place to mention it. I'm not super sure all of t

Re: redundant decision tree model

2018-02-13 Thread Nick Pentreath
There is a long outstanding JIRA issue about it: https://issues.apache.org/jira/browse/SPARK-3155. It is probably still a useful feature to have for trees but the priority is not that high since it may not be that useful for the tree ensemble models. On Tue, 13 Feb 2018 at 11:52 Alessandro Solima

Re: [VOTE] Spark 2.3.0 (RC2)

2018-02-01 Thread Nick Pentreath
All MLlib QA JIRAs resolved. Looks like SparkR too, so from the ML side that should be everything outstanding. On Thu, 1 Feb 2018 at 06:21 Yin Huai wrote: > seems we are not running tests related to pandas in pyspark tests (see my > email "python tests related to pandas are skipped in jenkins").

Re: [VOTE] Spark 2.3.0 (RC2)

2018-01-25 Thread Nick Pentreath
I think this has come up before (and Sean mentions it above), but the sub-items on: SPARK-23105 Spark MLlib, GraphX 2.3 QA umbrella are actually marked as Blockers, but are not targeted to 2.3.0. I think they should be, and I'm not comfortable with those not being resolved before voting positivel

Re: CrossValidation distribution - is it in the roadmap?

2017-11-29 Thread Nick Pentreath
Hi Tomasz Parallel evaluation for CrossValidation and TrainValidationSplit was added for Spark 2.3 in https://issues.apache.org/jira/browse/SPARK-19357 On Wed, 29 Nov 2017 at 16:31 Tomasz Dudek wrote: > Hey, > > is there a way to make the following code: > > val paramGrid = new ParamGridBuilde

Re: Timeline for Spark 2.3

2017-11-09 Thread Nick Pentreath
+1 I think that’s practical On Fri, 10 Nov 2017 at 03:13, Erik Erlandson wrote: > +1 on extending the deadline. It will significantly improve the logistics > for upstreaming the Kubernetes back-end. Also agreed, on the general > realities of reduced bandwidth over the Nov-Dec holiday season. >

Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-06 Thread Nick Pentreath
Ah yes - I recall that it was fixed. Forgot it was for 2.3.0 My +1 vote stands. On Fri, 6 Oct 2017 at 15:15 Hyukjin Kwon wrote: > Hi Nick, > > I believe that R test failure is due to SPARK-21093, at least the error > message looks the same, and that is fixed from 2.3.0.

Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-06 Thread Nick Pentreath
Checked sigs & hashes. Tested on RHEL build/mvn -Phadoop-2.7 -Phive -Pyarn test passed Python tests passed I ran R tests and am getting some failures: https://gist.github.com/MLnick/ddf4d531d5125208771beee0cc9c697e (I seem to recall similar issues on a previous release but I thought it was fixed)

Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-04 Thread Nick Pentreath
Ah right! Was using a new cloud instance and didn't realize I was logged in as root! thanks On Tue, 3 Oct 2017 at 21:13 Marcelo Vanzin wrote: > Maybe you're running as root (or the admin account on your OS)? > > On Tue, Oct 3, 2017 at 12:12 PM, Nick Pentreath > wrote: &g

Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-03 Thread Nick Pentreath
Hmm I'm consistently getting this error in core tests: - SPARK-3697: ignore directories that cannot be read. *** FAILED *** 2 was not equal to 1 (FsHistoryProviderSuite.scala:146) Anyone else? Any insight? Perhaps it's my set up. >> >> On Tue, Oct 3, 2017 at 7:24 AM Holden Karau wrote: >> >

Configuration docs pages are broken

2017-10-03 Thread Nick Dimiduk
Heya, Looks like the Configuration sections of your docs, both latest [0], and 2.1 [1] are broken. The last couple sections are smashed into a single unrendered paragraph of markdown at the bottom. Thanks, Nick [0]: https://spark.apache.org/docs/latest/configuration.html [1]: https

Re: Should Flume integration be behind a profile?

2017-10-02 Thread Nick Pentreath
I'd agree with #1 or #2. Deprecation now seems fine. Perhaps this should be raised on the user list also? And perhaps it makes sense to look at moving the Flume support into Apache Bahir if there is interest (I've cc'ed Bahir dev list here)? That way the current state of the connector could keep

Re: Welcoming Tejas Patil as a Spark committer

2017-09-30 Thread Nick Pentreath
Congratulations! >> >> Matei Zaharia wrote >> > Hi all, >> > >> > The Spark PMC recently added Tejas Patil as a committer on the >> > project. Tejas has been contributing across several areas of Spark for >> > a while, focusing especially on scalability issues and SQL. Please >> > join me in wel

Re: Updates on migration guides

2017-08-30 Thread Nick Pentreath
MLlib has tried quite hard to ensure the migration guide is up to date for each release. I think generally we catch all breaking and most major behavior changes On Wed, 30 Aug 2017 at 17:02, Dongjoon Hyun wrote: > +1 > > On Wed, Aug 30, 2017 at 7:54 AM, Xiao Li wrote: > >> Hi, Devs, >> >> Many

Re: [VOTE] Apache Spark 2.2.0 (RC6)

2017-07-03 Thread Nick Pentreath
+1 (binding) On Mon, 3 Jul 2017 at 11:53 Yanbo Liang wrote: > +1 > > On Mon, Jul 3, 2017 at 5:35 AM, Herman van Hövell tot Westerflier < > hvanhov...@databricks.com> wrote: > >> +1 >> >> On Sun, Jul 2, 2017 at 11:32 PM, Ricardo Almeida < >> ricardo.alme...@actnowib.com> wrote: >> >>> +1 (non-bin

Re: [VOTE] Apache Spark 2.2.0 (RC5)

2017-06-21 Thread Nick Pentreath
As before, release looks good, all Scala, Python tests pass. R tests fail with same issue in SPARK-21093 but it's not a blocker. +1 (binding) On Wed, 21 Jun 2017 at 01:49 Michael Armbrust wrote: > I will kick off the voting with a +1. > > On Tue, Jun 20, 2017 at 4:49 PM, Michael Armbrust > wr

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-21 Thread Nick Pentreath
Here is the self-reproducer: >>> >>> irisDF <- suppressWarnings(createDataFrame (iris)) >>> schema <- structType(structField("Sepal_Length", "double"), >>> structField("Avg", "double")) >>> df4 <- gapply( >

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-13 Thread Nick Pentreath
elated > > >> macOS Sierra 10.12.3 / R 3.2.3 - passed with a warning ( > https://gist.github.com/HyukjinKwon/85cbcfb245825852df20ed6a9ecfd845) > > As for CentOS - would it be possible to test against R older than 3.4.0? > This is the same error reported by Nick below. > >

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-09 Thread Nick Pentreath
All Scala, Python tests pass. ML QA and doc issues are resolved (as well as R it seems). However, I'm seeing the following test failure on R consistently: https://gist.github.com/MLnick/5f26152f97ae8473f807c6895817cf72 On Thu, 8 Jun 2017 at 08:48 Denny Lee wrote: > +1 non-binding > > Tested on

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-06 Thread Nick Pentreath
Now, on the subject of (ML) QA JIRAs. >From the ML side, I believe they are required (I think others such as Joseph will agree and in fact have already said as much). Most are marked as Blockers, though of those the Python API coverage is strictly not a Blocker as we will never hold the release f

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-06 Thread Nick Pentreath
The website updates for ML QA (SPARK-20507) are not *actually* critical as the project website certainly can be updated separately from the source code guide and is not part of the release to be voted on. In future that particular work item for the QA process could be marked down in priority, and i

Re: RDD MLLib Deprecation Question

2017-05-30 Thread Nick Pentreath
The short answer is those distributed linalg parts will not go away. In the medium term, it's much less likely that the distributed matrix classes will be ported over to DataFrames (though the ideal would be to have DataFrame-backed distributed matrix classes) - given the time and effort it's take

Re: [VOTE] Apache Spark 2.2.0 (RC2)

2017-05-19 Thread Nick Pentreath
All the outstanding ML QA doc and user guide items are done for 2.2 so from that side we should be good to cut another RC :) On Thu, 18 May 2017 at 00:18 Russell Spitzer wrote: > Seeing an issue with the DataScanExec and some of our integration tests > for the SCC. Running dataframe read and wri

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-02 Thread Nick Pentreath
I won't +1 just given that it seems certain there will be another RC and there are the outstanding ML QA blocker issues. But clean build and test for JVM and Python tests LGTM on CentOS Linux 7.2.1511, OpenJDK 1.8.0_111 On Mon, 1 May 2017 at 22:42 Frank Austin Nothaft wrote: > Hi Ryan, > > IMO,

Re: Outstanding Spark 2.1.1 issues

2017-03-21 Thread Nick Pentreath
As for SPARK-19759 , I don't think that needs to be targeted for 2.1.1 so we don't need to worry about it On Tue, 21 Mar 2017 at 13:49 Holden Karau wrote: > I agree with Michael, I think we've got some outstanding issues but none > of them seem

Re: Should we consider a Spark 2.1.1 release?

2017-03-16 Thread Nick Pentreath
Spark 1.5.1 had 87 issues fix version 1 month after 1.5.0. Spark 1.6.1 had 123 issues 2 months after 1.6.0 2.0.1 was larger (317 issues) at 3 months after 2.0.0 - makes sense due to how large a release it was. We are at 185 for 2.1.1 and 3 months after (and not released yet so it could slip furt

Re: [Spark Namespace]: Expanding Spark ML under Different Namespace?

2017-03-04 Thread Nick Pentreath
Also, note https://issues.apache.org/jira/browse/SPARK-7146 is linked from SPARK-19498 specifically to discuss opening up sharedParams traits. On Fri, 3 Mar 2017 at 23:17 Shouheng Yi wrote: > Hi Spark dev list, > > > > Thank you guys so much for all your inputs. We really appreciated those > su

Re: Feedback on MLlib roadmap process proposal

2017-02-24 Thread Nick Pentreath
works well in practice. In the meantime, though, there are plenty of > things that we could do to help developers of other libraries to have a > great experience with Spark. Matei alluded to that in his Spark Summit > keynote when he mentioned better integration with low-level libraries. >

Re: Feedback on MLlib roadmap process proposal

2017-02-23 Thread Nick Pentreath
Sorry for being late to the discussion. I think Joseph, Sean and others have covered the issues well. Overall I like the proposed cleaned up roadmap & process (thanks Joseph!). As for the actual critical roadmap items mentioned on SPARK-18813, I think it makes sense and will comment a bit further

Re: Implementation of RNN/LSTM in Spark

2017-02-23 Thread Nick Pentreath
The short answer is there is none and highly unlikely to be inside of Spark MLlib any time in the near future. The best bets are to look at other DL libraries - for JVM there is Deeplearning4J and BigDL (there are others but these seem to be the most comprehensive I have come across) - that run on

Re: [Spark Namespace]: Expanding Spark ML under Different Namespace?

2017-02-23 Thread Nick Pentreath
ose backward compat guarantees. Perhaps now is a good time to think about some of the common shared params for example. Thanks Nick On Wed, 22 Feb 2017 at 22:51 Shouheng Yi wrote: Hi Spark developers, Currently my team at Microsoft is extending Spark’s machine learning functionalities to inclu

Re: Google Summer of Code 2017 is coming

2017-02-05 Thread Nick Pentreath
I think Sean raises valid points - that the result is highly dependent on the particular student, project and mentor involved, and that the actual required time investment is very significant. Having said that, it's not all bad certainly. Scikit-learn started as a GSoC project 10 years ago! Actua

Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-01 Thread Nick Pentreath
Hi Maciej If you're seeing a regression from 1.6 -> 2.0 *both using DataFrames *then that seems to point to some other underlying issue as the root cause. Even though adding checkpointing should help, we should understand why it's different between 1.6 and 2.0? On Thu, 2 Feb 2017 at 08:22 Liang

Re: Why are ml models repartition(1)'d in save methods?

2017-01-13 Thread Nick Pentreath
Yup - it's because almost all model data in spark ML (model coefficients) is "small" - i.e. Non distributed. If you look at ALS you'll see there is no repartitioning since the factor dataframes can be large On Fri, 13 Jan 2017 at 19:42, Sean Owen wrote: > You're referring to code that serializes

Re: Issue in using DenseVector in RowMatrix, error could be due to ml and mllib package changes

2016-12-08 Thread Nick Pentreath
Yes most likely due to hashing tf returns ml vectors while you need mllib vectors for row matrix. I'd recommend using the vector conversion utils (I think in mllib.linalg.Vectors but I'm on mobile right now so can't recall exactly). There are until methods for converting single vectors as well as

Re: 2.1.0-rc2 cut; committers please set fix version for branch-2.1 to 2.1.1 instead

2016-12-07 Thread Nick Pentreath
I went ahead and re-marked all the existing 2.1.1 fix version JIRAs (that had gone into branch-2.1 since RC1 but before RC2) for Spark ML to 2.1.0 On Thu, 8 Dec 2016 at 09:20 Reynold Xin wrote: > Thanks. >

Re: unhelpful exception thrown on predict() when ALS trained model doesn't contain user or product?

2016-12-06 Thread Nick Pentreath
Indeed, it's being tracked here: https://issues.apache.org/jira/browse/SPARK-18230 though no Pr has been opened yet. On Tue, 6 Dec 2016 at 13:36 chris snow wrote: > I'm using the MatrixFactorizationModel.predict() method and encountered > the following exception: > > Name: java.util.NoSuchElemen

Re: Why don't we imp some adaptive learning rate methods, such as adadelat, adam?

2016-11-30 Thread Nick Pentreath
check out https://github.com/VinceShieh/Spark-AdaOptimizer On Wed, 30 Nov 2016 at 10:52 WangJianfei wrote: > Hi devs: > Normally, the adaptive learning rate methods can have a fast > convergence > then standard SGD, so why don't we imp them? > see the link for more details > http://sebastian

Re: Develop custom Estimator / Transformer for pipeline

2016-11-17 Thread Nick Pentreath
@Holden look forward to the blog post - I think a user guide PR based on it would also be super useful :) On Fri, 18 Nov 2016 at 05:29 Holden Karau wrote: > I've been working on a blog post around this and hope to have it published > early next month 😀 > > On Nov 17, 2016 10:16 PM, "Joseph Bradl

Re: Question about using collaborative filtering in MLlib

2016-11-02 Thread Nick Pentreath
I have a PR for it - https://github.com/apache/spark/pull/12574 Sadly I've been tied up and haven't had a chance to work further on it. The main issue outstanding is deciding on the transform semantics as well as performance testing. Any comments / feedback welcome especially on transform semant

Re: Is RankingMetrics' NDCG implementation correct?

2016-09-20 Thread Nick Pentreath
(cc'ing dev list also) I think a more general version of ranking metrics that allows arbitrary relevance scores could be useful. Ranking metrics are applicable to other settings like search or other learning-to-rank use cases, so it should be a little more generic than pure recommender settings.

Re: Organizing Spark ML example packages

2016-09-12 Thread Nick Pentreath
will make ML examples more neatly. >>> >>> 2016-04-14 5:28 GMT-07:00 Nick Pentreath : >>> >>>> Hey Spark devs >>>> >>>> I noticed that we now have a large number of examples for ML & MLlib in >>>> the examples project - 57

Re: Why can't a Transformer have multiple output columns?

2016-08-23 Thread Nick Pentreath
g if you are an ML beginner who thinks they need a Transformer > with multiple output columns, you’ve misunderstood something. 😅 > > Nick > ​ >

Re: Java 8

2016-08-20 Thread Nick Pentreath
Spark already supports compiling with Java 8. What refactoring are you referring to, and where do you expect to see performance gains? On Sat, 20 Aug 2016 at 12:41, Timur Shenkao wrote: > Hello, guys! > > Are there any plans / tickets / branches in repository on Java 8? > > I ask because ML libr

Re: Serving Spark ML models via a regular Python web app

2016-08-11 Thread Nick Pentreath
ARK-16365). N On Thu, 11 Aug 2016 at 06:28 Michael Allman wrote: > Nick, > > Check out MLeap: https://github.com/TrueCar/mleap. It's not python, but > we use it in production to serve a random forest model trained by a Spark > ML pipeline. > > Thanks, > > Mich

Re: [MLlib] Term Frequency in TF-IDF seems incorrect

2016-08-02 Thread Nick Pentreath
Note that both HashingTF and CountVectorizer are usually used for creating TF-IDF normalized vectors. The definition ( https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Definition) of term frequency in TF-IDF is actually the "number of times the term occurs in the document". So it's perhaps a bit of a

Re: Internal Deprecation warnings - worth fixing?

2016-07-27 Thread Nick Pentreath
+1 I don't believe there's any reason for the warnings to still be there except for available dev time & focus :) On Wed, 27 Jul 2016 at 21:35, Jacek Laskowski wrote: > Kill 'em all -- one by one slowly yet gradually! :) > > Pozdrawiam, > Jacek Laskowski > > https://medium.com/@jaceklaskows

Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-07-04 Thread Nick Pentreath
clean them up and see where they stand). If there are other blockers then we should mark them as such to help tracking progress? On Tue, 28 Jun 2016 at 11:28 Nick Pentreath wrote: > I take it there will be another RC due to some blockers and as there were > no +1 votes anyway. > > FW

Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-28 Thread Nick Pentreath
) - does anyone else encounter this? ./python/run-tests --python-executables=python2.7 Running PySpark tests. Output is in /Users/nick/workspace/scala/spark-rcs/spark-2.0.0/python/unit-tests.log Will test against the following Python executables: ['python2.7'] Will test the following Pytho

Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-24 Thread Nick Pentreath
sts.py", line 69, in __main__.identify_changed_files_from_git_commits Failed example: [x.name for x in determine_modules_for_files( identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))] Exception raised: Traceback (most recent call last):

Re: Welcoming Yanbo Liang as a committer

2016-06-04 Thread Nick Pentreath
Congratulations Yanbo and welcome On Sat, 4 Jun 2016 at 10:17, Hortonworks wrote: > Congratulations, Yanbo > > Zhan Zhang > > Sent from my iPhone > > > On Jun 3, 2016, at 8:39 PM, Dongjoon Hyun wrote: > > > > Congratulations > > -- > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for

Re: Cannot build master with sbt

2016-05-25 Thread Nick Pentreath
I've filed https://issues.apache.org/jira/browse/SPARK-15525 For now, you would have to check out sbt-antlr4 at https://github.com/ihji/sbt-antlr4/commit/23eab68b392681a7a09f6766850785afe8dfa53d (since I don't see any branches or tags in the github repo for different versions), and sbt publishLoca

Re: [VOTE] Removing module maintainer process

2016-05-22 Thread Nick Pentreath
+1 (binding) On Mon, 23 May 2016 at 04:19, Matei Zaharia wrote: > Correction, let's run this for 72 hours, so until 9 PM EST May 25th. > > > On May 22, 2016, at 8:34 PM, Matei Zaharia > wrote: > > > > It looks like the discussion thread on this has only had positive > replies, so I'm going to ca

PR for In-App Scheduling

2016-05-18 Thread Nick White
Hi ­ I raised a PR here: https://github.com/apache/spark/pull/12951 add a mechanism that prevents starvation when scheduling work within a single application. Could a committer take a look? Thanks - Nick smime.p7s Description: S/MIME cryptographic signature

Re: Cross Validator to work with K-Fold value of 1?

2016-05-02 Thread Nick Pentreath
There is a JIRA and PR around for supporting polynomial expansion with degree 1. Offhand I can't recall if it's been merged On Mon, 2 May 2016 at 17:45, Julio Antonio Soto de Vicente wrote: > Hi, > > Same goes for the PolynomialExpansion in org.apache.spark.ml.feature. It > would be dice to cross

Re: Duplicated fit into TrainValidationSplit

2016-04-27 Thread Nick Pentreath
You should find that the first set of fits are called on the training set, and the resulting models evaluated on the validation set. The final best model is then retrained on the entire dataset. This is standard practice - usually the dataset passed to the train validation split is itself further s

Organizing Spark ML example packages

2016-04-14 Thread Nick Pentreath
Hey Spark devs I noticed that we now have a large number of examples for ML & MLlib in the examples project - 57 for ML and 67 for MLLIB to be precise. This is bound to get larger as we add features (though I know there are some PRs to clean up duplicated examples). What do you think about organi

Re: ClassCastException when extracting and collecting DF array column type

2016-04-06 Thread Nick Pentreath
Ah I got it - Seq[(Int, Float)] is actually represented as Seq[Row] (seq of struct type) internally. So a further extraction is required, e.g. row => row.getSeq[Row](1).map { r => r.getInt(0) } On Wed, 6 Apr 2016 at 13:35 Nick Pentreath wrote: > Hi there, > > In writing some tes

ClassCastException when extracting and collecting DF array column type

2016-04-06 Thread Nick Pentreath
Hi there, In writing some tests for a PR I'm working on, with a more complex array type in a DF, I ran into this issue (running off latest master). Any thoughts? *// create DF with a column of Array[(Int, Double)]* val df = sc.parallelize(Seq( (0, Array((1, 6.0), (1, 4.0))), (1, Array((1, 3.0),

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Nick Pentreath
+1 for this proposal - as you mention I think it's the defacto current situation anyway. Note that from a developer view it's just the user-facing API that will be only "ml" - the majority of the actual algorithms still operate on RDDs under the good currently. On Wed, 6 Apr 2016 at 05:03, Chris F

Re: Spark ML - Scaling logistic regression for many features

2016-03-19 Thread Nick Pentreath
No, I didn't yet - feel free to create a JIRA. On Thu, 17 Mar 2016 at 22:55 Daniel Siegmann wrote: > Hi Nick, > > Thanks again for your help with this. Did you create a ticket in JIRA for > investigating sparse models in LR and / or multivariate summariser? If so, > can y

Re: Spark ML - Scaling logistic regression for many features

2016-03-12 Thread Nick Pentreath
Also adding dev list in case anyone else has ideas / views. On Sat, 12 Mar 2016 at 12:52, Nick Pentreath wrote: > Thanks for the feedback. > > I think Spark can certainly meet your use case when your data size scales > up, as the actual model dimension is very small - you will

Re: Running ALS on comparitively large RDD

2016-03-11 Thread Nick Pentreath
ically have around a million ratings > 2. Spark 1.6 on Amazon EMR > > On Fri, Mar 11, 2016 at 12:46 PM, Nick Pentreath > wrote: > >> Could you provide more details about: >> 1. Data set size (# ratings, # users and # products) >> 2. Spark cluster set up and version >

Re: Running ALS on comparitively large RDD

2016-03-10 Thread Nick Pentreath
Could you provide more details about: 1. Data set size (# ratings, # users and # products) 2. Spark cluster set up and version Thanks On Fri, 11 Mar 2016 at 05:53 Deepak Gopalakrishnan wrote: > Hello All, > > I've been running Spark's ALS on a dataset of users and rated items. I > first encode

Re: ML ALS API

2016-03-08 Thread Nick Pentreath
Hi Maciej Yes, that *train* method is intended to be public, but it is marked as *DeveloperApi*, which means that backward compatibility is not necessarily guaranteed, and that method may change. Having said that, even APIs marked as DeveloperApi do tend to be relatively stable. As the comment me

Re: Proposal

2016-01-30 Thread Nick Pentreath
Hi there Sounds like a fun project :) I'd recommend getting familiar with the existing k-means implementation as well as bisecting k-means in Spark, and then implementing yours based off that. You should focus on using the new ML pipelines API, and release it as a package on spark-packages.org

Re: Elasticsearch sink for metrics

2016-01-15 Thread Nick Pentreath
I haven't come across anything, but could you provide more detail on what issues you're encountering? On Fri, Jan 15, 2016 at 11:09 AM, Pete Robbins wrote: > Has anyone tried pushing Spark metrics into elasticsearch? We have other > metrics, eg some runtime information, going into ES and would

Re: Write access to wiki

2016-01-12 Thread Nick Pentreath
I'd also like to get Wiki write access - at the least it allows a few of us to amend the "Powered By" and similar pages when those requests come through (Sean has been doing a lot of that recently :) On Mon, Jan 11, 2016 at 11:01 PM, Sean Owen wrote: > ... I forget who can give access -- is it I

Re: Spark streaming with Kinesis broken?

2015-12-11 Thread Nick Pentreath
locally now. > Is the AWS SDK not used for reading/writing from S3 or do we get that for > free from the Hadoop dependencies? > On Fri, Dec 11, 2015 at 5:07 AM Nick Pentreath > wrote: >> cc'ing dev list >> >> Ok, looks like when the KCL version was updated in &

Re: Spark streaming with Kinesis broken?

2015-12-11 Thread Nick Pentreath
g on KCL. Is aws-java-sdk used anywhere else (AFAIK it is not, but in case I missed something let me know any good reason to keep the explicit dependency)? N On Fri, Dec 11, 2015 at 6:55 AM, Nick Pentreath wrote: > Yeah also the integration tests need to be specifically run - I would have >

Spark Streaming Kinesis - DynamoDB Streams compatability

2015-12-10 Thread Nick Pentreath
Hi Spark users & devs I was just wondering if anyone out there has interest in DynamoDB Streams ( http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html) as an input source for Spark Streaming Kinesis? Because DynamoDB Streams provides an adaptor client that works with the K

Re: Streaming Receiverless Kafka API + Offset Management

2015-11-16 Thread Nick Evans
wse/SPARK-10963 > > I think adding any ZK specific behavior to spark is a bad idea, since ZK > may no longer be the preferred storage location for Kafka offsets within > the next year. > > > > On Mon, Nov 16, 2015 at 9:53 AM, Nick Evans wrote: > >> I really like the S

Streaming Receiverless Kafka API + Offset Management

2015-11-16 Thread Nick Evans
I really like the Streaming receiverless API for Kafka streaming jobs, but I'm finding the manual offset management adds a fair bit of complexity. I'm sure that others feel the same way, so I'm proposing that we add the ability to have consumer offsets managed via an easy-to-use API. This would be

Re: Support for local disk columnar storage for DataFrames

2015-11-16 Thread Nick Pentreath
Cloudera's Kudu also looks interesting here (getkudu.io) - Hadoop input/output format support: https://github.com/cloudera/kudu/blob/master/java/kudu-mapreduce/src/main/java/org/kududb/mapreduce/KuduTableInputFormat.java On Mon, Nov 16, 2015 at 7:52 AM, Reynold Xin wrote: > This (updates) is som

Re: ml.feature.Word2Vec.transform() very slow issue

2015-11-09 Thread Nick Pentreath
Seems a straightforward change that purely enhances efficiency, so yes please submit a JIRA and PR for this On Tue, Nov 10, 2015 at 8:56 AM, Sean Owen wrote: > Since it's a fairly expensive operation to build the Map, I tend to agree > it should not happen in the loop. > > On Tue, Nov 10, 2015 a

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
3, 2015 at 12:09 AM, Yin Huai wrote: > Hi Nick, > The buffer exposed to UDAF interface is just a view of underlying buffer > (this underlying buffer is shared by different aggregate functions and > every function takes one or multiple slots). If you need a UDAF, extending > UserDefi

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
gt; 2353 DC Leiderdorp > hvanhov...@questtec.nl > +599 9 521 4402 > > > 2015-09-12 11:06 GMT+02:00 Nick Pentreath : > >> I should add that surely the idea behind UDT is exactly that it can (a) >> fit automatically into DFs and Tungsten and (b) that it can be used &g

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
I should add that surely the idea behind UDT is exactly that it can (a) fit automatically into DFs and Tungsten and (b) that it can be used efficiently in writing ones own UDTs and UDAFs? On Sat, Sep 12, 2015 at 11:05 AM, Nick Pentreath wrote: > Can I ask why you've done this as

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
rman van Hövell tot Westerflier < hvanhov...@questtec.nl> wrote: > Hello Nick, > > I have been working on a (UDT-less) implementation of HLL++. You can find > the PR here: https://github.com/apache/spark/pull/8362. This current > implements the dense version of HLL++, which i

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
thoughts on exposing that type? Or I need to make the package spark.sql ... Nick On Thu, Jul 2, 2015 at 8:06 AM, Reynold Xin wrote: > Yes - it's very interesting. However, ideally we should have a version of > hyperloglog that can work directly against some raw bytes in memory (rathe

Re: HyperLogLogUDT

2015-07-01 Thread Nick Pentreath
ache/spark/rdd/RDD.scala#L1153> > and > access the HLL directly, or do anything you like. > On Wed, Jul 1, 2015 at 5:26 PM, Nick Pentreath > wrote: >> Any thoughts? >> >> — >> Sent from Mailbox <https://www.dropbox.com/mailbox> >> >> >> On T

Re: HyperLogLogUDT

2015-07-01 Thread Nick Pentreath
Any thoughts? — Sent from Mailbox On Tue, Jun 23, 2015 at 11:19 AM, Nick Pentreath wrote: > Hey Spark devs > I've been looking at DF UDFs and UDAFs. The approx distinct is using > hyperloglog, > but there is only an option to return the count as a Long. > It can be

HyperLogLogUDT

2015-06-23 Thread Nick Pentreath
d apply potentially to other approximate (and mergeable) data structures like T-Digest and maybe CMS. Nick

Re: Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark

2015-06-18 Thread Nick Pentreath
If it's going into the DataFrame API (which it probably should rather than in RDD itself) - then it could become a UDT (similar to HyperLogLogUDT) which would mean it doesn't have to implement Serializable, as it appears that serialization is taken care of in the UDT def (e.g. https://github.com/ap

Re: [sample code] deeplearning4j for Spark ML (@DeveloperAPI)

2015-06-10 Thread Nick Pentreath
Looks very interesting, thanks for sharing this. I haven't had much chance to do more than a quick glance over the code. Quick question - are the Word2Vec and GLOVE implementations fully parallel on Spark? On Mon, Jun 8, 2015 at 6:20 PM, Eron Wright wrote: > > The deeplearning4j framework provi

Re: [discuss] ending support for Java 6?

2015-04-30 Thread Nick Pentreath
+1 for this think it's high time. We should of course do it with enough warning for users. 1.4 May be too early (not for me though!). Perhaps we specify that 1.5 will officially move to JDK7? — Sent from Mailbox On Fri, May 1, 2015 at 12:16 AM, Ram Sriharsha wrote: > +1 for end of

Re: hadoop input/output format advanced control

2015-03-24 Thread Nick Pentreath
Imran, on your point to read multiple files together in a partition, is it not simpler to use the approach of copy Hadoop conf and set per-RDD settings for min split to control the input size per partition, together with something like CombineFileInputFormat? On Tue, Mar 24, 2015 at 5:28 PM, Imran

Re: Directly broadcasting (sort of) RDDs

2015-03-20 Thread Nick Pentreath
There is block matrix in Spark 1.3 -  http://spark.apache.org/docs/latest/mllib-data-types.html#blockmatrix However I believe it only supports dense matrix blocks. Still, might be possible to use it or exetend  JIRAs: https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-

Re: Welcoming three new committers

2015-02-04 Thread Nick Pentreath
Congrats and welcome Sean, Joseph and Cheng! On Wed, Feb 4, 2015 at 2:10 PM, Sean Owen wrote: > Thanks all, I appreciate the vote of trust. I'll do my best to help > keep JIRA and commits moving along, and am ramping up carefully this > week. Now get back to work reviewing things! > > On Tue, F

Re: python converter in HBaseConverter.scala(spark/examples)

2015-01-05 Thread Nick Pentreath
7:06 PM, Ted Yu wrote: > HBaseConverter is in Spark source tree. Therefore I think it makes sense > for this improvement to be accepted so that the example is more useful. > Cheers > On Mon, Jan 5, 2015 at 7:54 AM, Nick Pentreath > wrote: >> Hey >> >> These converters

Re: python converter in HBaseConverter.scala(spark/examples)

2015-01-05 Thread Nick Pentreath
Hey  These converters are actually just intended to be examples of how to set up a custom converter for a specific input format. The converter interface is there to provide flexibility where needed, although with the new SparkSQL data store interface the intention is that most common use cases

  1   2   >