Re: Welcoming six new Apache Spark committers

2021-03-29 Thread Nick Pentreath
Congratulations to all the new committers. Welcome! On Fri, 26 Mar 2021 at 22:22, Matei Zaharia wrote: > Hi all, > > The Spark PMC recently voted to add several new committers. Please join me > in welcoming them to their new role! Our new committers are: > > - Maciej Szymkiewicz (contributor to

Re: [Apache Spark Jenkins] build system shutting down Dec 23th, 2021

2021-12-06 Thread Nick Pentreath
Wow! end of an era Thanks so much to you Shane for all you work over 10 (!!) years. And to Amplab also! Farewell Spark Jenkins! N On Tue, Dec 7, 2021 at 6:49 AM Nicholas Chammas wrote: > Farewell to Jenkins and its classic weather forecast build status icons: > > [image: health-80plus.png][im

Re: Welcoming some new Apache Spark committers

2020-07-14 Thread Nick Pentreath
Congratulations and welcome as Apache Spark committers! On Wed, 15 Jul 2020 at 06:59, Prashant Sharma wrote: > Congratulations all ! It's great to have such committed folks as > committers. :) > > On Wed, Jul 15, 2020 at 9:24 AM Yi Wu wrote: > >> Congrats!! >> >> On Wed, Jul 15, 2020 at 8:02 AM

Re: Welcoming Tejas Patil as a Spark committer

2017-09-30 Thread Nick Pentreath
Congratulations! >> >> Matei Zaharia wrote >> > Hi all, >> > >> > The Spark PMC recently added Tejas Patil as a committer on the >> > project. Tejas has been contributing across several areas of Spark for >> > a while, focusing especially on scalability issues and SQL. Please >> > join me in wel

Re: Should Flume integration be behind a profile?

2017-10-02 Thread Nick Pentreath
I'd agree with #1 or #2. Deprecation now seems fine. Perhaps this should be raised on the user list also? And perhaps it makes sense to look at moving the Flume support into Apache Bahir if there is interest (I've cc'ed Bahir dev list here)? That way the current state of the connector could keep

Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-03 Thread Nick Pentreath
Hmm I'm consistently getting this error in core tests: - SPARK-3697: ignore directories that cannot be read. *** FAILED *** 2 was not equal to 1 (FsHistoryProviderSuite.scala:146) Anyone else? Any insight? Perhaps it's my set up. >> >> On Tue, Oct 3, 2017 at 7:24 AM Holden Karau wrote: >> >

Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-04 Thread Nick Pentreath
Ah right! Was using a new cloud instance and didn't realize I was logged in as root! thanks On Tue, 3 Oct 2017 at 21:13 Marcelo Vanzin wrote: > Maybe you're running as root (or the admin account on your OS)? > > On Tue, Oct 3, 2017 at 12:12 PM, Nick Pentreath > wrote: &g

Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-06 Thread Nick Pentreath
Checked sigs & hashes. Tested on RHEL build/mvn -Phadoop-2.7 -Phive -Pyarn test passed Python tests passed I ran R tests and am getting some failures: https://gist.github.com/MLnick/ddf4d531d5125208771beee0cc9c697e (I seem to recall similar issues on a previous release but I thought it was fixed)

Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-06 Thread Nick Pentreath
l#known-issues > before due to this reason. > I believe It should be fine and probably we should note if possible. I > believe this should not be a regression anyway as, if I understood > correctly, it was there from the very first place. > > Thanks. > > > > > 2017-10

Re: Timeline for Spark 2.3

2017-11-09 Thread Nick Pentreath
+1 I think that’s practical On Fri, 10 Nov 2017 at 03:13, Erik Erlandson wrote: > +1 on extending the deadline. It will significantly improve the logistics > for upstreaming the Kubernetes back-end. Also agreed, on the general > realities of reduced bandwidth over the Nov-Dec holiday season. >

Re: CrossValidation distribution - is it in the roadmap?

2017-11-29 Thread Nick Pentreath
Hi Tomasz Parallel evaluation for CrossValidation and TrainValidationSplit was added for Spark 2.3 in https://issues.apache.org/jira/browse/SPARK-19357 On Wed, 29 Nov 2017 at 16:31 Tomasz Dudek wrote: > Hey, > > is there a way to make the following code: > > val paramGrid = new ParamGridBuilde

Re: [VOTE] Spark 2.3.0 (RC2)

2018-01-25 Thread Nick Pentreath
I think this has come up before (and Sean mentions it above), but the sub-items on: SPARK-23105 Spark MLlib, GraphX 2.3 QA umbrella are actually marked as Blockers, but are not targeted to 2.3.0. I think they should be, and I'm not comfortable with those not being resolved before voting positivel

Re: [VOTE] Spark 2.3.0 (RC2)

2018-02-01 Thread Nick Pentreath
All MLlib QA JIRAs resolved. Looks like SparkR too, so from the ML side that should be everything outstanding. On Thu, 1 Feb 2018 at 06:21 Yin Huai wrote: > seems we are not running tests related to pandas in pyspark tests (see my > email "python tests related to pandas are skipped in jenkins").

Re: redundant decision tree model

2018-02-13 Thread Nick Pentreath
There is a long outstanding JIRA issue about it: https://issues.apache.org/jira/browse/SPARK-3155. It is probably still a useful feature to have for trees but the priority is not that high since it may not be that useful for the tree ensemble models. On Tue, 13 Feb 2018 at 11:52 Alessandro Solima

Re: [VOTE] Spark 2.3.0 (RC3)

2018-02-14 Thread Nick Pentreath
-1 for me as we elevated https://issues.apache.org/jira/browse/SPARK-23377 to a Blocker. It should be fixed before release. On Thu, 15 Feb 2018 at 07:25 Holden Karau wrote: > If this is a blocker in your view then the vote thread is an important > place to mention it. I'm not super sure all of t

Re: [VOTE] Spark 2.3.0 (RC5)

2018-02-27 Thread Nick Pentreath
+1 (binding) Built and ran Scala tests with "-Phadoop-2.6 -Pyarn -Phive", all passed. Python tests passed (also including pyspark-streaming w/kafka-0.8 and flume packages built) On Tue, 27 Feb 2018 at 10:09 Felix Cheung wrote: > +1 > > Tested R: > > install from package, CRAN tests, manual tes

Re: Welcome Zhenhua Wang as a Spark committer

2018-04-02 Thread Nick Pentreath
Congratulations! On Tue, 3 Apr 2018 at 05:34 wangzhenhua (G) wrote: > > > Thanks everyone! It’s my great pleasure to be part of such a professional > and innovative community! > > > > > > best regards, > > -Zhenhua(Xander) > > >

Re: Revisiting Online serving of Spark models?

2018-06-05 Thread Nick Pentreath
I will aim to join up at 4pm tomorrow (Wed) too. Look forward to it. On Sun, 3 Jun 2018 at 00:24 Holden Karau wrote: > On Sat, Jun 2, 2018 at 8:39 PM, Maximiliano Felice < > maximilianofel...@gmail.com> wrote: > >> Hi! >> >> We're already in San Francisco waiting for the summit. We even think th

Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-07-04 Thread Nick Pentreath
clean them up and see where they stand). If there are other blockers then we should mark them as such to help tracking progress? On Tue, 28 Jun 2016 at 11:28 Nick Pentreath wrote: > I take it there will be another RC due to some blockers and as there were > no +1 votes anyway. > > FW

Re: Internal Deprecation warnings - worth fixing?

2016-07-27 Thread Nick Pentreath
+1 I don't believe there's any reason for the warnings to still be there except for available dev time & focus :) On Wed, 27 Jul 2016 at 21:35, Jacek Laskowski wrote: > Kill 'em all -- one by one slowly yet gradually! :) > > Pozdrawiam, > Jacek Laskowski > > https://medium.com/@jaceklaskows

Re: [MLlib] Term Frequency in TF-IDF seems incorrect

2016-08-02 Thread Nick Pentreath
Note that both HashingTF and CountVectorizer are usually used for creating TF-IDF normalized vectors. The definition ( https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Definition) of term frequency in TF-IDF is actually the "number of times the term occurs in the document". So it's perhaps a bit of a

Re: Serving Spark ML models via a regular Python web app

2016-08-11 Thread Nick Pentreath
Currently there is no direct way in Spark to serve models without bringing in all of Spark as a dependency. For Spark ML, there is actually no way to do it independently of DataFrames either (which for single-instance prediction makes things sub-optimal). That is covered here: https://issues.apach

Re: Java 8

2016-08-20 Thread Nick Pentreath
Spark already supports compiling with Java 8. What refactoring are you referring to, and where do you expect to see performance gains? On Sat, 20 Aug 2016 at 12:41, Timur Shenkao wrote: > Hello, guys! > > Are there any plans / tickets / branches in repository on Java 8? > > I ask because ML libr

Re: Why can't a Transformer have multiple output columns?

2016-08-23 Thread Nick Pentreath
It's not impossible that a Transformer could output multiple columns - it's simply because none of the current ones do. It's true that it might be a relatively less common use case in general. But take StringIndexer for example. It turns strings (categorical features) into ints (0-based indexes).

Re: Organizing Spark ML example packages

2016-09-12 Thread Nick Pentreath
will make ML examples more neatly. >>> >>> 2016-04-14 5:28 GMT-07:00 Nick Pentreath : >>> >>>> Hey Spark devs >>>> >>>> I noticed that we now have a large number of examples for ML & MLlib in >>>> the examples project - 57

Re: Is RankingMetrics' NDCG implementation correct?

2016-09-20 Thread Nick Pentreath
(cc'ing dev list also) I think a more general version of ranking metrics that allows arbitrary relevance scores could be useful. Ranking metrics are applicable to other settings like search or other learning-to-rank use cases, so it should be a little more generic than pure recommender settings.

Re: Question about using collaborative filtering in MLlib

2016-11-02 Thread Nick Pentreath
I have a PR for it - https://github.com/apache/spark/pull/12574 Sadly I've been tied up and haven't had a chance to work further on it. The main issue outstanding is deciding on the transform semantics as well as performance testing. Any comments / feedback welcome especially on transform semant

Re: Develop custom Estimator / Transformer for pipeline

2016-11-17 Thread Nick Pentreath
@Holden look forward to the blog post - I think a user guide PR based on it would also be super useful :) On Fri, 18 Nov 2016 at 05:29 Holden Karau wrote: > I've been working on a blog post around this and hope to have it published > early next month 😀 > > On Nov 17, 2016 10:16 PM, "Joseph Bradl

Re: Why don't we imp some adaptive learning rate methods, such as adadelat, adam?

2016-11-30 Thread Nick Pentreath
check out https://github.com/VinceShieh/Spark-AdaOptimizer On Wed, 30 Nov 2016 at 10:52 WangJianfei wrote: > Hi devs: > Normally, the adaptive learning rate methods can have a fast > convergence > then standard SGD, so why don't we imp them? > see the link for more details > http://sebastian

Re: unhelpful exception thrown on predict() when ALS trained model doesn't contain user or product?

2016-12-06 Thread Nick Pentreath
Indeed, it's being tracked here: https://issues.apache.org/jira/browse/SPARK-18230 though no Pr has been opened yet. On Tue, 6 Dec 2016 at 13:36 chris snow wrote: > I'm using the MatrixFactorizationModel.predict() method and encountered > the following exception: > > Name: java.util.NoSuchElemen

Re: 2.1.0-rc2 cut; committers please set fix version for branch-2.1 to 2.1.1 instead

2016-12-07 Thread Nick Pentreath
I went ahead and re-marked all the existing 2.1.1 fix version JIRAs (that had gone into branch-2.1 since RC1 but before RC2) for Spark ML to 2.1.0 On Thu, 8 Dec 2016 at 09:20 Reynold Xin wrote: > Thanks. >

Re: Issue in using DenseVector in RowMatrix, error could be due to ml and mllib package changes

2016-12-08 Thread Nick Pentreath
Yes most likely due to hashing tf returns ml vectors while you need mllib vectors for row matrix. I'd recommend using the vector conversion utils (I think in mllib.linalg.Vectors but I'm on mobile right now so can't recall exactly). There are until methods for converting single vectors as well as

Re: Why are ml models repartition(1)'d in save methods?

2017-01-13 Thread Nick Pentreath
Yup - it's because almost all model data in spark ML (model coefficients) is "small" - i.e. Non distributed. If you look at ALS you'll see there is no repartitioning since the factor dataframes can be large On Fri, 13 Jan 2017 at 19:42, Sean Owen wrote: > You're referring to code that serializes

Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-01 Thread Nick Pentreath
Hi Maciej If you're seeing a regression from 1.6 -> 2.0 *both using DataFrames *then that seems to point to some other underlying issue as the root cause. Even though adding checkpointing should help, we should understand why it's different between 1.6 and 2.0? On Thu, 2 Feb 2017 at 08:22 Liang

Re: Google Summer of Code 2017 is coming

2017-02-05 Thread Nick Pentreath
I think Sean raises valid points - that the result is highly dependent on the particular student, project and mentor involved, and that the actual required time investment is very significant. Having said that, it's not all bad certainly. Scikit-learn started as a GSoC project 10 years ago! Actua

Re: [Spark Namespace]: Expanding Spark ML under Different Namespace?

2017-02-23 Thread Nick Pentreath
Currently your only option is to write (or copy) your own implementations. Logging is definitely intended to be internal use only, and it's best to use your own logging lib - Typesafe scalalogging is a common option that I've used. As for the VectorUDT, for now that is private. There are no plans

Re: Implementation of RNN/LSTM in Spark

2017-02-23 Thread Nick Pentreath
The short answer is there is none and highly unlikely to be inside of Spark MLlib any time in the near future. The best bets are to look at other DL libraries - for JVM there is Deeplearning4J and BigDL (there are others but these seem to be the most comprehensive I have come across) - that run on

Re: Feedback on MLlib roadmap process proposal

2017-02-23 Thread Nick Pentreath
Sorry for being late to the discussion. I think Joseph, Sean and others have covered the issues well. Overall I like the proposed cleaned up roadmap & process (thanks Joseph!). As for the actual critical roadmap items mentioned on SPARK-18813, I think it makes sense and will comment a bit further

Re: Feedback on MLlib roadmap process proposal

2017-02-24 Thread Nick Pentreath
works well in practice. In the meantime, though, there are plenty of > things that we could do to help developers of other libraries to have a > great experience with Spark. Matei alluded to that in his Spark Summit > keynote when he mentioned better integration with low-level libraries. >

Re: [Spark Namespace]: Expanding Spark ML under Different Namespace?

2017-03-04 Thread Nick Pentreath
Also, note https://issues.apache.org/jira/browse/SPARK-7146 is linked from SPARK-19498 specifically to discuss opening up sharedParams traits. On Fri, 3 Mar 2017 at 23:17 Shouheng Yi wrote: > Hi Spark dev list, > > > > Thank you guys so much for all your inputs. We really appreciated those > su

Re: Should we consider a Spark 2.1.1 release?

2017-03-16 Thread Nick Pentreath
Spark 1.5.1 had 87 issues fix version 1 month after 1.5.0. Spark 1.6.1 had 123 issues 2 months after 1.6.0 2.0.1 was larger (317 issues) at 3 months after 2.0.0 - makes sense due to how large a release it was. We are at 185 for 2.1.1 and 3 months after (and not released yet so it could slip furt

Re: Outstanding Spark 2.1.1 issues

2017-03-21 Thread Nick Pentreath
As for SPARK-19759 , I don't think that needs to be targeted for 2.1.1 so we don't need to worry about it On Tue, 21 Mar 2017 at 13:49 Holden Karau wrote: > I agree with Michael, I think we've got some outstanding issues but none > of them seem

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-02 Thread Nick Pentreath
I won't +1 just given that it seems certain there will be another RC and there are the outstanding ML QA blocker issues. But clean build and test for JVM and Python tests LGTM on CentOS Linux 7.2.1511, OpenJDK 1.8.0_111 On Mon, 1 May 2017 at 22:42 Frank Austin Nothaft wrote: > Hi Ryan, > > IMO,

Re: [VOTE] Apache Spark 2.2.0 (RC2)

2017-05-19 Thread Nick Pentreath
All the outstanding ML QA doc and user guide items are done for 2.2 so from that side we should be good to cut another RC :) On Thu, 18 May 2017 at 00:18 Russell Spitzer wrote: > Seeing an issue with the DataScanExec and some of our integration tests > for the SCC. Running dataframe read and wri

Re: RDD MLLib Deprecation Question

2017-05-30 Thread Nick Pentreath
The short answer is those distributed linalg parts will not go away. In the medium term, it's much less likely that the distributed matrix classes will be ported over to DataFrames (though the ideal would be to have DataFrame-backed distributed matrix classes) - given the time and effort it's take

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-06 Thread Nick Pentreath
The website updates for ML QA (SPARK-20507) are not *actually* critical as the project website certainly can be updated separately from the source code guide and is not part of the release to be voted on. In future that particular work item for the QA process could be marked down in priority, and i

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-06 Thread Nick Pentreath
Now, on the subject of (ML) QA JIRAs. >From the ML side, I believe they are required (I think others such as Joseph will agree and in fact have already said as much). Most are marked as Blockers, though of those the Python API coverage is strictly not a Blocker as we will never hold the release f

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-09 Thread Nick Pentreath
All Scala, Python tests pass. ML QA and doc issues are resolved (as well as R it seems). However, I'm seeing the following test failure on R consistently: https://gist.github.com/MLnick/5f26152f97ae8473f807c6895817cf72 On Thu, 8 Jun 2017 at 08:48 Denny Lee wrote: > +1 non-binding > > Tested on

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-13 Thread Nick Pentreath
_ > From: Hyukjin Kwon > Sent: Tuesday, June 13, 2017 8:02 PM > > Subject: Re: [VOTE] Apache Spark 2.2.0 (RC4) > To: dev > Cc: Sean Owen , Nick Pentreath < > nick.pentre...@gmail.com>, Felix Cheung > > > > For the test failure on R,

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-21 Thread Nick Pentreath
Here is the self-reproducer: >>> >>> irisDF <- suppressWarnings(createDataFrame (iris)) >>> schema <- structType(structField("Sepal_Length", "double"), >>> structField("Avg", "double")) >>> df4 <- gapply( >

Re: [VOTE] Apache Spark 2.2.0 (RC5)

2017-06-21 Thread Nick Pentreath
As before, release looks good, all Scala, Python tests pass. R tests fail with same issue in SPARK-21093 but it's not a blocker. +1 (binding) On Wed, 21 Jun 2017 at 01:49 Michael Armbrust wrote: > I will kick off the voting with a +1. > > On Tue, Jun 20, 2017 at 4:49 PM, Michael Armbrust > wr

Re: [VOTE] Apache Spark 2.2.0 (RC6)

2017-07-03 Thread Nick Pentreath
+1 (binding) On Mon, 3 Jul 2017 at 11:53 Yanbo Liang wrote: > +1 > > On Mon, Jul 3, 2017 at 5:35 AM, Herman van Hövell tot Westerflier < > hvanhov...@databricks.com> wrote: > >> +1 >> >> On Sun, Jul 2, 2017 at 11:32 PM, Ricardo Almeida < >> ricardo.alme...@actnowib.com> wrote: >> >>> +1 (non-bin

Re: Updates on migration guides

2017-08-30 Thread Nick Pentreath
MLlib has tried quite hard to ensure the migration guide is up to date for each release. I think generally we catch all breaking and most major behavior changes On Wed, 30 Aug 2017 at 17:02, Dongjoon Hyun wrote: > +1 > > On Wed, Aug 30, 2017 at 7:54 AM, Xiao Li wrote: > >> Hi, Devs, >> >> Many

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Nick Pentreath
+1 (binding) — Sent from Mailbox On Thu, Nov 6, 2014 at 6:52 PM, Debasish Das wrote: > +1 > The app to track PRs based on component is a great idea... > On Thu, Nov 6, 2014 at 8:47 AM, Sean McNamara > wrote: >> +1 >> >> Sean >> >> On Nov 5, 2014, at 6:32 PM, Matei Zaharia wrote: >> >> > Hi al

Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-13 Thread Nick Pentreath
+1 — Sent from Mailbox On Sat, Dec 13, 2014 at 3:12 PM, GuoQiang Li wrote: > +1 (non-binding). Tested on CentOS 6.4 > -- Original -- > From: "Patrick Wendell";; > Date: Thu, Dec 11, 2014 05:08 AM > To: "dev发送@spark.apache.org"; > Subject: [VOTE] Release Apac

Re: Highly interested in contributing to spark

2015-01-01 Thread Nick Pentreath
I'm sure Spark will sign up for GSoC again this year - and id be surprised if there was not some interest now for projects :) If I have the time at that point in the year I'd be happy to mentor a project in MLlib but will have to see how my schedule is at that point! Manoj perhaps some of t

Re: Highly interested in contributing to spark

2015-01-01 Thread Nick Pentreath
Oh actually I was confused with another project, yours was not LSH sorry! — Sent from Mailbox On Fri, Jan 2, 2015 at 8:19 AM, Nick Pentreath wrote: > I'm sure Spark will sign up for GSoC again this year - and id be surprised if > there was not some interest now for project

Re: python converter in HBaseConverter.scala(spark/examples)

2015-01-05 Thread Nick Pentreath
Hey  These converters are actually just intended to be examples of how to set up a custom converter for a specific input format. The converter interface is there to provide flexibility where needed, although with the new SparkSQL data store interface the intention is that most common use cases

Re: python converter in HBaseConverter.scala(spark/examples)

2015-01-05 Thread Nick Pentreath
7:06 PM, Ted Yu wrote: > HBaseConverter is in Spark source tree. Therefore I think it makes sense > for this improvement to be accepted so that the example is more useful. > Cheers > On Mon, Jan 5, 2015 at 7:54 AM, Nick Pentreath > wrote: >> Hey >> >> These converters

Re: Welcoming three new committers

2015-02-04 Thread Nick Pentreath
Congrats and welcome Sean, Joseph and Cheng! On Wed, Feb 4, 2015 at 2:10 PM, Sean Owen wrote: > Thanks all, I appreciate the vote of trust. I'll do my best to help > keep JIRA and commits moving along, and am ramping up carefully this > week. Now get back to work reviewing things! > > On Tue, F

Re: Directly broadcasting (sort of) RDDs

2015-03-20 Thread Nick Pentreath
There is block matrix in Spark 1.3 -  http://spark.apache.org/docs/latest/mllib-data-types.html#blockmatrix However I believe it only supports dense matrix blocks. Still, might be possible to use it or exetend  JIRAs: https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-

Re: hadoop input/output format advanced control

2015-03-24 Thread Nick Pentreath
Imran, on your point to read multiple files together in a partition, is it not simpler to use the approach of copy Hadoop conf and set per-RDD settings for min split to control the input size per partition, together with something like CombineFileInputFormat? On Tue, Mar 24, 2015 at 5:28 PM, Imran

Re: [discuss] ending support for Java 6?

2015-04-30 Thread Nick Pentreath
+1 for this think it's high time. We should of course do it with enough warning for users. 1.4 May be too early (not for me though!). Perhaps we specify that 1.5 will officially move to JDK7? — Sent from Mailbox On Fri, May 1, 2015 at 12:16 AM, Ram Sriharsha wrote: > +1 for end of

Re: [sample code] deeplearning4j for Spark ML (@DeveloperAPI)

2015-06-10 Thread Nick Pentreath
Looks very interesting, thanks for sharing this. I haven't had much chance to do more than a quick glance over the code. Quick question - are the Word2Vec and GLOVE implementations fully parallel on Spark? On Mon, Jun 8, 2015 at 6:20 PM, Eron Wright wrote: > > The deeplearning4j framework provi

Re: Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark

2015-06-18 Thread Nick Pentreath
If it's going into the DataFrame API (which it probably should rather than in RDD itself) - then it could become a UDT (similar to HyperLogLogUDT) which would mean it doesn't have to implement Serializable, as it appears that serialization is taken care of in the UDT def (e.g. https://github.com/ap

HyperLogLogUDT

2015-06-23 Thread Nick Pentreath
Hey Spark devs I've been looking at DF UDFs and UDAFs. The approx distinct is using hyperloglog, but there is only an option to return the count as a Long. It can be useful to be able to return and store the actual data structure (ie serialized HLL). This effectively allows one to do aggregation

Re: HyperLogLogUDT

2015-07-01 Thread Nick Pentreath
Any thoughts? — Sent from Mailbox On Tue, Jun 23, 2015 at 11:19 AM, Nick Pentreath wrote: > Hey Spark devs > I've been looking at DF UDFs and UDAFs. The approx distinct is using > hyperloglog, > but there is only an option to return the count as a Long. > It can be

Re: HyperLogLogUDT

2015-07-01 Thread Nick Pentreath
ache/spark/rdd/RDD.scala#L1153> > and > access the HLL directly, or do anything you like. > On Wed, Jul 1, 2015 at 5:26 PM, Nick Pentreath > wrote: >> Any thoughts? >> >> — >> Sent from Mailbox <https://www.dropbox.com/mailbox> >> >> >> On T

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
r > than java objects), in order for this to fit the Tungsten execution model > where everything is operating directly against some memory address. > > On Wed, Jul 1, 2015 at 11:00 PM, Nick Pentreath > wrote: > >> Sure I can copy the code but my aim was more to understand:

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
enwacht 98 > 2353 DC Leiderdorp > hvanhov...@questtec.nl > +599 9 521 4402 > > > 2015-09-12 10:07 GMT+02:00 Nick Pentreath : > >> Inspired by this post: >> http://eugenezhulenev.com/blog/2015/07/15/interactive-audience-analytics-with-spark-and-hyperlog

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
I should add that surely the idea behind UDT is exactly that it can (a) fit automatically into DFs and Tungsten and (b) that it can be used efficiently in writing ones own UDTs and UDAFs? On Sat, Sep 12, 2015 at 11:05 AM, Nick Pentreath wrote: > Can I ask why you've done this as

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
gt; 2353 DC Leiderdorp > hvanhov...@questtec.nl > +599 9 521 4402 > > > 2015-09-12 11:06 GMT+02:00 Nick Pentreath : > >> I should add that surely the idea behind UDT is exactly that it can (a) >> fit automatically into DFs and Tungsten and (b) that it can be used &g

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
nedAggregationFunction is the preferred > approach. AggregateFunction2 is used for built-in aggregate function. > Thanks, > Yin > On Sat, Sep 12, 2015 at 10:40 AM, Nick Pentreath > wrote: >> Ok, that makes sense. So this is (a) more efficient, since as far as I can >>

Re: ml.feature.Word2Vec.transform() very slow issue

2015-11-09 Thread Nick Pentreath
Seems a straightforward change that purely enhances efficiency, so yes please submit a JIRA and PR for this On Tue, Nov 10, 2015 at 8:56 AM, Sean Owen wrote: > Since it's a fairly expensive operation to build the Map, I tend to agree > it should not happen in the loop. > > On Tue, Nov 10, 2015 a

Re: Support for local disk columnar storage for DataFrames

2015-11-16 Thread Nick Pentreath
Cloudera's Kudu also looks interesting here (getkudu.io) - Hadoop input/output format support: https://github.com/cloudera/kudu/blob/master/java/kudu-mapreduce/src/main/java/org/kududb/mapreduce/KuduTableInputFormat.java On Mon, Nov 16, 2015 at 7:52 AM, Reynold Xin wrote: > This (updates) is som

Spark Streaming Kinesis - DynamoDB Streams compatability

2015-12-10 Thread Nick Pentreath
Hi Spark users & devs I was just wondering if anyone out there has interest in DynamoDB Streams ( http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html) as an input source for Spark Streaming Kinesis? Because DynamoDB Streams provides an adaptor client that works with the K

Re: Spark streaming with Kinesis broken?

2015-12-11 Thread Nick Pentreath
g on KCL. Is aws-java-sdk used anywhere else (AFAIK it is not, but in case I missed something let me know any good reason to keep the explicit dependency)? N On Fri, Dec 11, 2015 at 6:55 AM, Nick Pentreath wrote: > Yeah also the integration tests need to be specifically run - I would have >

Re: Spark streaming with Kinesis broken?

2015-12-11 Thread Nick Pentreath
locally now. > Is the AWS SDK not used for reading/writing from S3 or do we get that for > free from the Hadoop dependencies? > On Fri, Dec 11, 2015 at 5:07 AM Nick Pentreath > wrote: >> cc'ing dev list >> >> Ok, looks like when the KCL version was updated in &

Re: Write access to wiki

2016-01-12 Thread Nick Pentreath
I'd also like to get Wiki write access - at the least it allows a few of us to amend the "Powered By" and similar pages when those requests come through (Sean has been doing a lot of that recently :) On Mon, Jan 11, 2016 at 11:01 PM, Sean Owen wrote: > ... I forget who can give access -- is it I

Re: Elasticsearch sink for metrics

2016-01-15 Thread Nick Pentreath
I haven't come across anything, but could you provide more detail on what issues you're encountering? On Fri, Jan 15, 2016 at 11:09 AM, Pete Robbins wrote: > Has anyone tried pushing Spark metrics into elasticsearch? We have other > metrics, eg some runtime information, going into ES and would

Re: Proposal

2016-01-30 Thread Nick Pentreath
Hi there Sounds like a fun project :) I'd recommend getting familiar with the existing k-means implementation as well as bisecting k-means in Spark, and then implementing yours based off that. You should focus on using the new ML pipelines API, and release it as a package on spark-packages.org

Re: ML ALS API

2016-03-08 Thread Nick Pentreath
Hi Maciej Yes, that *train* method is intended to be public, but it is marked as *DeveloperApi*, which means that backward compatibility is not necessarily guaranteed, and that method may change. Having said that, even APIs marked as DeveloperApi do tend to be relatively stable. As the comment me

Re: Running ALS on comparitively large RDD

2016-03-10 Thread Nick Pentreath
Could you provide more details about: 1. Data set size (# ratings, # users and # products) 2. Spark cluster set up and version Thanks On Fri, 11 Mar 2016 at 05:53 Deepak Gopalakrishnan wrote: > Hello All, > > I've been running Spark's ALS on a dataset of users and rated items. I > first encode

Re: Running ALS on comparitively large RDD

2016-03-11 Thread Nick Pentreath
ically have around a million ratings > 2. Spark 1.6 on Amazon EMR > > On Fri, Mar 11, 2016 at 12:46 PM, Nick Pentreath > wrote: > >> Could you provide more details about: >> 1. Data set size (# ratings, # users and # products) >> 2. Spark cluster set up and version >

Re: Spark ML - Scaling logistic regression for many features

2016-03-12 Thread Nick Pentreath
Also adding dev list in case anyone else has ideas / views. On Sat, 12 Mar 2016 at 12:52, Nick Pentreath wrote: > Thanks for the feedback. > > I think Spark can certainly meet your use case when your data size scales > up, as the actual model dimension is very small - you will

Re: Spark ML - Scaling logistic regression for many features

2016-03-19 Thread Nick Pentreath
ou give me the issue key(s)? If not, would you like me to create these > tickets? > > I'm going to look into this some more and see if I can figure out how to > implement these fixes. > > ~Daniel Siegmann > > On Sat, Mar 12, 2016 at 5:53 AM, Nick Pentreath > wrot

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Nick Pentreath
+1 for this proposal - as you mention I think it's the defacto current situation anyway. Note that from a developer view it's just the user-facing API that will be only "ml" - the majority of the actual algorithms still operate on RDDs under the good currently. On Wed, 6 Apr 2016 at 05:03, Chris F

ClassCastException when extracting and collecting DF array column type

2016-04-06 Thread Nick Pentreath
Hi there, In writing some tests for a PR I'm working on, with a more complex array type in a DF, I ran into this issue (running off latest master). Any thoughts? *// create DF with a column of Array[(Int, Double)]* val df = sc.parallelize(Seq( (0, Array((1, 6.0), (1, 4.0))), (1, Array((1, 3.0),

Re: ClassCastException when extracting and collecting DF array column type

2016-04-06 Thread Nick Pentreath
Ah I got it - Seq[(Int, Float)] is actually represented as Seq[Row] (seq of struct type) internally. So a further extraction is required, e.g. row => row.getSeq[Row](1).map { r => r.getInt(0) } On Wed, 6 Apr 2016 at 13:35 Nick Pentreath wrote: > Hi there, > > In writing some tes

Organizing Spark ML example packages

2016-04-14 Thread Nick Pentreath
Hey Spark devs I noticed that we now have a large number of examples for ML & MLlib in the examples project - 57 for ML and 67 for MLLIB to be precise. This is bound to get larger as we add features (though I know there are some PRs to clean up duplicated examples). What do you think about organi

Re: Duplicated fit into TrainValidationSplit

2016-04-27 Thread Nick Pentreath
You should find that the first set of fits are called on the training set, and the resulting models evaluated on the validation set. The final best model is then retrained on the entire dataset. This is standard practice - usually the dataset passed to the train validation split is itself further s

Re: Cross Validator to work with K-Fold value of 1?

2016-05-02 Thread Nick Pentreath
There is a JIRA and PR around for supporting polynomial expansion with degree 1. Offhand I can't recall if it's been merged On Mon, 2 May 2016 at 17:45, Julio Antonio Soto de Vicente wrote: > Hi, > > Same goes for the PolynomialExpansion in org.apache.spark.ml.feature. It > would be dice to cross

Re: [VOTE] Removing module maintainer process

2016-05-22 Thread Nick Pentreath
+1 (binding) On Mon, 23 May 2016 at 04:19, Matei Zaharia wrote: > Correction, let's run this for 72 hours, so until 9 PM EST May 25th. > > > On May 22, 2016, at 8:34 PM, Matei Zaharia > wrote: > > > > It looks like the discussion thread on this has only had positive > replies, so I'm going to ca

Re: Cannot build master with sbt

2016-05-25 Thread Nick Pentreath
I've filed https://issues.apache.org/jira/browse/SPARK-15525 For now, you would have to check out sbt-antlr4 at https://github.com/ihji/sbt-antlr4/commit/23eab68b392681a7a09f6766850785afe8dfa53d (since I don't see any branches or tags in the github repo for different versions), and sbt publishLoca

Re: Welcoming Yanbo Liang as a committer

2016-06-04 Thread Nick Pentreath
Congratulations Yanbo and welcome On Sat, 4 Jun 2016 at 10:17, Hortonworks wrote: > Congratulations, Yanbo > > Zhan Zhang > > Sent from my iPhone > > > On Jun 3, 2016, at 8:39 PM, Dongjoon Hyun wrote: > > > > Congratulations > > -- > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for

Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-24 Thread Nick Pentreath
I'm getting the following when trying to run ./dev/run-tests (not happening on master) from the extracted source tar. Anyone else seeing this? error: Could not access 'fc0a1475ef' ** File "./dev/run-tests.py", line 69, in __main__

Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-28 Thread Nick Pentreath
ich is a correctness regression from 1.6. >> Looks like the patch is ready though: >> https://github.com/apache/spark/pull/13884 – it would be ideal for this >> patch to make it into the release. >> >> -Matt Cheah >> >> From: Nick Pentreath >> Date: F

Re: [PySpark]: reading arbitrary Hadoop InputFormats

2014-03-18 Thread Nick Pentreath
it’s easy to read in Python) or some kind of WholeFileInputFormat. > Matei > On Dec 19, 2013, at 10:57 AM, Nick Pentreath wrote: >> Hi >> >> >> I managed to find the time to put together a PR on this: >> https://github.com/apache/incubator-spark/pull/26

Re: [PySpark]: reading arbitrary Hadoop InputFormats

2014-03-19 Thread Nick Pentreath
> at first and then add some just to keep the API small. > Matei > On Mar 18, 2014, at 11:44 PM, Nick Pentreath wrote: >> Hi Matei >> >> >> I'm afraid I haven't had enough time to focus on this as work has just been >> crazy. It's still someth

Re: ALS array index out of bound with 50 factors

2014-04-07 Thread Nick Pentreath
On the partitioning / id keys. If we would look at hash partitioning, how feasible will it be to just allow the user and item ids to be strings? A lot of the time these ids are strings anyway (UUIDs and so on), and it's really painful to translate between String <-> Int the whole time. Are there a

  1   2   >