thanks for that, its good to know that functionality exists.

but shouldn't a decision tree be able to handle missing (aka null) values
more intelligently than simply using replacement values?

see for example here:
http://stats.stackexchange.com/questions/96025/how-do-decision-tree-learning-algorithms-deal-with-missing-values-under-the-hoo


On Thu, Apr 21, 2016 at 12:29 AM, John Trengrove <
john.trengr...@servian.com.au> wrote:

> You could handle null values by using the DataFrame.na functions in a
> preprocessing step like DataFrame.na.fill().
>
> For reference:
>
>
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameNaFunctions
>
> John
>
> On 21 April 2016 at 03:41, Andres Perez <and...@tresata.com> wrote:
>
>> so the missing data could be on a one-off basis, or from fields that are
>> in general optional, or from, say, a count that is only relevant for
>> certain cases (very sparse):
>>
>> f1|f2|f3|optF1|optF2|sparseF1
>> a|15|3.5|cat1|142L|
>> b|13|2.4|cat2|64L|catA
>> c|2|1.6|||
>> d|27|5.1||0|
>>
>> -Andy
>>
>> On Wed, Apr 20, 2016 at 1:38 AM, Nick Pentreath <nick.pentre...@gmail.com
>> > wrote:
>>
>>> Could you provide an example of what your input data looks like?
>>> Supporting missing values in a sparse result vector makes sense.
>>>
>>> On Tue, 19 Apr 2016 at 23:55, Andres Perez <and...@tresata.com> wrote:
>>>
>>>> Hi everyone. org.apache.spark.ml.feature.VectorAssembler currently
>>>> cannot handle null values. This presents a problem for us as we wish to run
>>>> a decision tree classifier on sometimes sparse data. Is there a particular
>>>> reason VectorAssembler is implemented in this way, and can anyone recommend
>>>> the best path for enabling VectorAssembler to build vectors for data that
>>>> will contain empty values?
>>>>
>>>> Thanks!
>>>>
>>>> -Andres
>>>>
>>>>
>>
>

Reply via email to