thanks for that, its good to know that functionality exists.
but shouldn't a decision tree be able to handle missing (aka null) values
more intelligently than simply using replacement values?
see for example here:
http://stats.stackexchange.com/questions/96025/how-do-decision-tree-learning-algori
You could handle null values by using the DataFrame.na functions in a
preprocessing step like DataFrame.na.fill().
For reference:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameNaFunctions
John
On 21 April 2016 at 03:41, Andres Perez wrote:
> so the mi
so the missing data could be on a one-off basis, or from fields that are in
general optional, or from, say, a count that is only relevant for certain
cases (very sparse):
f1|f2|f3|optF1|optF2|sparseF1
a|15|3.5|cat1|142L|
b|13|2.4|cat2|64L|catA
c|2|1.6|||
d|27|5.1||0|
-Andy
On Wed, Apr 20, 2016 a
Could you provide an example of what your input data looks like? Supporting
missing values in a sparse result vector makes sense.
On Tue, 19 Apr 2016 at 23:55, Andres Perez wrote:
> Hi everyone. org.apache.spark.ml.feature.VectorAssembler currently cannot
> handle null values. This presents a pro