[ https://issues.apache.org/jira/browse/FLINK-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14532251#comment-14532251 ]
Till Rohrmann commented on FLINK-1735: -------------------------------------- Hi [~Felix Neutatz], great to see that you have already implemented a feature hasher. Well the feature hasher should maybe go into a `feature.extraction` package. I'll move the `PolynomialBase` transformer in the `preprocessing` package where he belongs to. There is no test data for the feature hasher. Thus, you should create some test data. Usually the result of the feature hasher is really sparse, otherwise you have selected the number of features too little and thus your feature vectors won't be meaningful at all. However, one could also think about a threshold which defines how many entries have to be non-zero in order for the vector to be stored in a `DenseVector`. If the threshold is not exceeded then a `SparseVector` is used. I have some comments to your implementation: The idea of the feature hasher is that you transform non-numerical data (image, text) into a numerical representation. Thus, defining the `FeatureHasher` as a `Transformer[Vector, Vector]` is not really useful. It would be better to define it for textual input or introducing a type parameter there. For further comments, it would be good to open a PR, then I can directly comment on the code. > Add FeatureHasher to machine learning library > --------------------------------------------- > > Key: FLINK-1735 > URL: https://issues.apache.org/jira/browse/FLINK-1735 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library > Reporter: Till Rohrmann > Assignee: Felix Neutatz > Labels: ML > > Using the hashing trick [1,2] is a common way to vectorize arbitrary feature > values. The hash of the feature value is used to calculate its index for a > vector entry. In order to mitigate possible collisions, a second hashing > function is used to calculate the sign for the update value which is added to > the vector entry. This way, it is likely that collision will simply cancel > out. > A feature hasher would also be helpful for NLP problems where it could be > used to vectorize bag of words or ngrams feature vectors. > Resources: > [1] [https://en.wikipedia.org/wiki/Feature_hashing] > [2] > [http://scikit-learn.org/stable/modules/feature_extraction.html#feature-extraction] -- This message was sent by Atlassian JIRA (v6.3.4#6332)