[ https://issues.apache.org/jira/browse/FLINK-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14532547#comment-14532547 ]
Till Rohrmann commented on FLINK-1735: -------------------------------------- I guess that a tokenizer/sentence splitter makes much sense if we want to do text classification. If you want to, then you can open a JIRA issue and implement it. I guess that a Seq[T] would be most generic as an input for the feature hasher, right. If that is not possible, then you can start with Set[String] and see how to generalize it later. I'm currently reworking the pipelining and this will allow you to define specialized implementations for different input types (e.g. String, Image, ...) > Add FeatureHasher to machine learning library > --------------------------------------------- > > Key: FLINK-1735 > URL: https://issues.apache.org/jira/browse/FLINK-1735 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library > Reporter: Till Rohrmann > Assignee: Felix Neutatz > Labels: ML > > Using the hashing trick [1,2] is a common way to vectorize arbitrary feature > values. The hash of the feature value is used to calculate its index for a > vector entry. In order to mitigate possible collisions, a second hashing > function is used to calculate the sign for the update value which is added to > the vector entry. This way, it is likely that collision will simply cancel > out. > A feature hasher would also be helpful for NLP problems where it could be > used to vectorize bag of words or ngrams feature vectors. > Resources: > [1] [https://en.wikipedia.org/wiki/Feature_hashing] > [2] > [http://scikit-learn.org/stable/modules/feature_extraction.html#feature-extraction] -- This message was sent by Atlassian JIRA (v6.3.4#6332)