[
https://issues.apache.org/jira/browse/IGNITE-8059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Anton Dmitriev updated IGNITE-8059:
-----------------------------------
Description:
A partition based dataset (new underlying infrastructure component) was added
as part of IGNITE-7437 and now we need to adopt decision tree algorithm to work
on top of this infrastructure.
----
The way decision tree algorithm is implemented on top of a row-partitioned data
is described further.
At first, the basic idea behind any decision tree, bother regression and
classification, is to find the *data split* that allows to minimize an
*impurity measure* like [Gini
coefficient|[https://en.wikipedia.org/wiki/Gini_coefficient]]
[entropy|https://en.wikipedia.org/wiki/Entropy_(information_theory)] or [mean
squared error|[https://en.wikipedia.org/wiki/Mean_squared_error]]. To calculate
the best split we need to build a _function_ that describes dependency between
split point (independent variable) and impurity measure (dependent variable)
and then find a minimum of this _function_.
In case of a distributed system, when a data is partitioned by row, we can
calculate such _function_ on every node, compress it somehow, and then pass it
to the master node. On the master node we need to summarize _functions_
received from all nodes and then find a minimum of the result _function_. It's
the way decision tree algorithm is implemented in Apache Ignite ML module.
was:
A partition based dataset (new underlying infrastructure component) was added
as part of IGNITE-7437 and now we need to adopt decision tree algorithm to work
on top of this infrastructure.
----
The way decision tree algorithm is implemented on top of a row-partitioned data
is described further.
At first, the basic idea behind any decision tree, bother regression and
classification, is to find the *data split* that allows to minimize an
*impurity measure* like [Gini
coefficient|[https://en.wikipedia.org/wiki/Gini_coefficient],]
[entropy|https://en.wikipedia.org/wiki/Entropy_(information_theory)] or [mean
squared error|[https://en.wikipedia.org/wiki/Mean_squared_error].] To calculate
the best split we need to build a _function_ that describes dependency between
split point (independent variable) and impurity measure (dependent variable)
and then find a minimum of this _function_.
In case of a distributed system, when a data is partitioned by row, we can
calculate such _function_ on every node, compress it somehow, and then pass it
to the master node. On the master node we need to summarize _functions_
received from all nodes and then find a minimum of the result _function_. It's
the way decision tree algorithm is implemented in Apache Ignite ML module.
> Integrate decision tree with partition based dataset
> ----------------------------------------------------
>
> Key: IGNITE-8059
> URL: https://issues.apache.org/jira/browse/IGNITE-8059
> Project: Ignite
> Issue Type: Improvement
> Components: ml
> Reporter: Anton Dmitriev
> Assignee: Anton Dmitriev
> Priority: Major
> Fix For: 2.5
>
>
> A partition based dataset (new underlying infrastructure component) was added
> as part of IGNITE-7437 and now we need to adopt decision tree algorithm to
> work on top of this infrastructure.
> ----
> The way decision tree algorithm is implemented on top of a row-partitioned
> data is described further.
> At first, the basic idea behind any decision tree, bother regression and
> classification, is to find the *data split* that allows to minimize an
> *impurity measure* like [Gini
> coefficient|[https://en.wikipedia.org/wiki/Gini_coefficient]]
> [entropy|https://en.wikipedia.org/wiki/Entropy_(information_theory)] or [mean
> squared error|[https://en.wikipedia.org/wiki/Mean_squared_error]]. To
> calculate the best split we need to build a _function_ that describes
> dependency between split point (independent variable) and impurity measure
> (dependent variable) and then find a minimum of this _function_.
> In case of a distributed system, when a data is partitioned by row, we can
> calculate such _function_ on every node, compress it somehow, and then pass
> it to the master node. On the master node we need to summarize _functions_
> received from all nodes and then find a minimum of the result _function_.
> It's the way decision tree algorithm is implemented in Apache Ignite ML
> module.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)