thanks for that, its good to know that functionality exists. but shouldn't a decision tree be able to handle missing (aka null) values more intelligently than simply using replacement values?
see for example here: http://stats.stackexchange.com/questions/96025/how-do-decision-tree-learning-algorithms-deal-with-missing-values-under-the-hoo On Thu, Apr 21, 2016 at 12:29 AM, John Trengrove < john.trengr...@servian.com.au> wrote: > You could handle null values by using the DataFrame.na functions in a > preprocessing step like DataFrame.na.fill(). > > For reference: > > > https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameNaFunctions > > John > > On 21 April 2016 at 03:41, Andres Perez <and...@tresata.com> wrote: > >> so the missing data could be on a one-off basis, or from fields that are >> in general optional, or from, say, a count that is only relevant for >> certain cases (very sparse): >> >> f1|f2|f3|optF1|optF2|sparseF1 >> a|15|3.5|cat1|142L| >> b|13|2.4|cat2|64L|catA >> c|2|1.6||| >> d|27|5.1||0| >> >> -Andy >> >> On Wed, Apr 20, 2016 at 1:38 AM, Nick Pentreath <nick.pentre...@gmail.com >> > wrote: >> >>> Could you provide an example of what your input data looks like? >>> Supporting missing values in a sparse result vector makes sense. >>> >>> On Tue, 19 Apr 2016 at 23:55, Andres Perez <and...@tresata.com> wrote: >>> >>>> Hi everyone. org.apache.spark.ml.feature.VectorAssembler currently >>>> cannot handle null values. This presents a problem for us as we wish to run >>>> a decision tree classifier on sometimes sparse data. Is there a particular >>>> reason VectorAssembler is implemented in this way, and can anyone recommend >>>> the best path for enabling VectorAssembler to build vectors for data that >>>> will contain empty values? >>>> >>>> Thanks! >>>> >>>> -Andres >>>> >>>> >> >