[
https://issues.apache.org/jira/browse/SPARK-6509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15434577#comment-15434577
]
Sean Owen commented on SPARK-6509:
----------------------------------
There's no good answer to that question. When a couple people here seem to
agree, including someone who will commit it?
I think that in practice the bar is pretty high since ML already covers the
basics reasonably well, and it's perfectly possible to use third-party packages
with a Spark app. It doesn't have to be _in Spark_ to be useful, usable, and
widely used. I think it would probably merge into the project if it were widely
used but for some reason was struggling to be usable as a third party package,
maybe due to constant breakage or lack of maintenance.
Going more philosophical for a minute, in any platform-ish project, putting X
in the project discourages any alternative solutions to X from the ecosystem,
but has the benefit of making X somewhat easier to access. Same argument
circled around, say, having an official logging package for Java, or an
official JSON parser library for Scala.
> MDLP discretizer
> ----------------
>
> Key: SPARK-6509
> URL: https://issues.apache.org/jira/browse/SPARK-6509
> Project: Spark
> Issue Type: New Feature
> Components: MLlib
> Reporter: Sergio Ramírez
>
> Minimum Description Lenght Discretizer
> This method implements Fayyad's discretizer [1] based on Minimum Description
> Length Principle (MDLP) in order to treat non discrete datasets from a
> distributed perspective. We have developed a distributed version from the
> original one performing some important changes.
> Associated paper:
> Ramírez-Gallego, S., García, S., Mouriño-Talín, H., Martínez-Rego, D.,
> Bolón-Canedo, V., Alonso-Betanzos, A., Benítez, J. M. and Herrera, F. (2016),
> Data discretization: taxonomy and big data challenge. WIREs Data Mining
> Knowledge Discovery, 6: 5–21. doi:10.1002/widm.1173
> URL: http://onlinelibrary.wiley.com/doi/10.1002/widm.1173/abstract
> -- Improvements on discretizer:
> Support for sparse data.
> Multi-attribute processing. The whole process is carried out in a single
> step when the number of boundary points per attribute fits well in one
> partition (<= 100K boundary points per attribute).
> Support for attributes with a huge number of boundary points (> 100K
> boundary points per attribute). Rare situation.
> This software has been proved with two large real-world datasets such as:
> A dataset selected for the GECCO-2014 in Vancouver, July 13th, 2014
> competition, which comes from the Protein Structure Prediction field
> (http://cruncher.ncl.ac.uk/bdcomp/). The dataset has 32 million instances,
> 631 attributes, 2 classes, 98% of negative examples and occupies, when
> uncompressed, about 56GB of disk space.
> Epsilon dataset:
> http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#epsilon.
> 400K instances and 2K attributes
> We have demonstrated that our method performs 300 times faster than the
> sequential version for the first dataset, and also improves the accuracy for
> Naive Bayes.
> Publication: S. Ramírez-Gallego, S. García, H. Mouriño-Talin, D.
> Martínez-Rego, V. Bolón, A. Alonso-Betanzos, J.M. Benitez, F. Herrera. "Data
> Discretization: Taxonomy and Big Data Challenge", WIRES Data Mining and
> Knowledge Discovery. In press, 2015.
> Design doc:
> https://docs.google.com/document/d/1HOaPL_HJzTbL2tVdzbTjhr5wxVvPe9e-23S7rc2VcsY/edit?usp=sharing
> References
> [1] Fayyad, U., & Irani, K. (1993).
> "Multi-interval discretization of continuous-valued attributes for
> classification learning."
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]