Re: Develop custom Estimator / Transformer for pipeline

2016-11-20 Thread Georg Heiler
The estimator should perform data cleaning tasks. This means some rows will be dropped, some columns dropped, some columns added, some values replaced in existing columns. IT should also store the mean or min for some numeric columns as a NaN replacement. However, override def transformSchema(sch

Re: Develop custom Estimator / Transformer for pipeline

2016-11-17 Thread Georg Heiler
Looking forward to the blog post. Thanks for for pointing me to some of the simpler classes. Nick Pentreath schrieb am Fr. 18. Nov. 2016 um 02:53: > @Holden look forward to the blog post - I think a user guide PR based on > it would also be super useful :) > > > On Fri, 18 Nov 2016 at 05:29 Holde

Re: Develop custom Estimator / Transformer for pipeline

2016-11-17 Thread Nick Pentreath
@Holden look forward to the blog post - I think a user guide PR based on it would also be super useful :) On Fri, 18 Nov 2016 at 05:29 Holden Karau wrote: > I've been working on a blog post around this and hope to have it published > early next month 😀 > > On Nov 17, 2016 10:16 PM, "Joseph Bradl

Re: Develop custom Estimator / Transformer for pipeline

2016-11-17 Thread Holden Karau
I've been working on a blog post around this and hope to have it published early next month 😀 On Nov 17, 2016 10:16 PM, "Joseph Bradley" wrote: Hi Georg, It's true we need better documentation for this. I'd recommend checking out simple algorithms within Spark for examples: ml.feature.Tokenize

Re: Develop custom Estimator / Transformer for pipeline

2016-11-17 Thread Joseph Bradley
Hi Georg, It's true we need better documentation for this. I'd recommend checking out simple algorithms within Spark for examples: ml.feature.Tokenizer ml.regression.IsotonicRegression You should not need to put your library in Spark's namespace. The shared Params in SPARK-7146 are not necessar