[ https://issues.apache.org/jira/browse/FLINK-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15866685#comment-15866685 ]
Stavros Kontopoulos edited comment on FLINK-5588 at 2/15/17 2:09 AM: --------------------------------------------------------------------- [~till.rohrmann] Have already implemented the Normalizer... need to check floating arithmetic for the UnitScaler because the sum might lead to overflow, so have to find a proper algo. Thinking of dividing with Xmax to avoid overflow and use https://en.wikipedia.org/wiki/Kahan_summation_algorithm for the sum of many small numbers. Standard scaler uses this algo: http://www.cs.yale.edu/publications/techreports/tr222.pdf for variance on big vectors, need something similar. was (Author: skonto): [~till.rohrmann] Have already implemented the Normalizer... need to check floating arithmetic for the UnitScaler because the sum might lead to overflow, so have to find a proper algo. Thinking of dividing with Xmax to avoid overflow and use https://en.wikipedia.org/wiki/Kahan_summation_algorithm for the sum of many small numbers. Standard scaler uses this algo: http://www.cs.yale.edu/publications/techreports/tr222.pdf for variance on big vectors. > Add a unit scaler based on different norms > ------------------------------------------ > > Key: FLINK-5588 > URL: https://issues.apache.org/jira/browse/FLINK-5588 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library > Reporter: Stavros Kontopoulos > Assignee: Stavros Kontopoulos > Priority: Minor > > So far ML has two scalers: min-max and the standard scaler. > A third one frequently used, is the scaler to unit. > We could implement a transformer for this type of scaling for different norms > available to the user. > I will make a separate class for the Normalization per sample procedure by > using the Transformer API because it is easy to add > it, fit method does nothing in this case. > Scikit-learn has also some calls available outside the Transform API, we > might want add that in the future. > These calls work on any axis but they are not re-usable in a pipeline [4] > Right now the existing scalers in Flink ML support per feature normalization > by using the Transformer API. > Resources > [1] https://en.wikipedia.org/wiki/Feature_scaling > [2] > http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html > [3] https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html > [4] http://scikit-learn.org/stable/modules/preprocessing.html -- This message was sent by Atlassian JIRA (v6.3.15#6346)