[ 
https://issues.apache.org/jira/browse/FLINK-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14532251#comment-14532251
 ] 

Till Rohrmann commented on FLINK-1735:
--------------------------------------

Hi [~Felix Neutatz],

great to see that you have already implemented a feature hasher.

Well the feature hasher should maybe go into a `feature.extraction` package. 
I'll move the `PolynomialBase` transformer in the `preprocessing` package where 
he belongs to.

There is no test data for the feature hasher. Thus, you should create some test 
data.

Usually the result of the feature hasher is really sparse, otherwise you have 
selected the number of features too little and thus your feature vectors won't 
be meaningful at all. However, one could also think about a threshold which 
defines how many entries have to be non-zero in order for the vector to be 
stored in a `DenseVector`. If the threshold is not exceeded then a 
`SparseVector` is used.

I have some comments to your implementation: The idea of the feature hasher is 
that you transform non-numerical data (image, text) into a numerical 
representation. Thus, defining the `FeatureHasher` as a `Transformer[Vector, 
Vector]` is not really useful. It would be better to define it for textual 
input or introducing a type parameter there. 

For further comments, it would be good to open a PR, then I can directly 
comment on the code.

> Add FeatureHasher to machine learning library
> ---------------------------------------------
>
>                 Key: FLINK-1735
>                 URL: https://issues.apache.org/jira/browse/FLINK-1735
>             Project: Flink
>          Issue Type: New Feature
>          Components: Machine Learning Library
>            Reporter: Till Rohrmann
>            Assignee: Felix Neutatz
>              Labels: ML
>
> Using the hashing trick [1,2] is a common way to vectorize arbitrary feature 
> values. The hash of the feature value is used to calculate its index for a 
> vector entry. In order to mitigate possible collisions, a second hashing 
> function is used to calculate the sign for the update value which is added to 
> the vector entry. This way, it is likely that collision will simply cancel 
> out.
> A feature hasher would also be helpful for NLP problems where it could be 
> used to vectorize bag of words or ngrams feature vectors.
> Resources:
> [1] [https://en.wikipedia.org/wiki/Feature_hashing]
> [2] 
> [http://scikit-learn.org/stable/modules/feature_extraction.html#feature-extraction]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to