[ https://issues.apache.org/jira/browse/FLINK-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14602901#comment-14602901 ]
Till Rohrmann commented on FLINK-1736: -------------------------------------- I actually implemented a simple {{CountVectorizer}} for one of my presentations [1]. I thought about making a PR out of it. [1] [https://github.com/tillrohrmann/flink/blob/zeppelin/flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/feature/CountVectorizer.scala] > Add CountVectorizer to machine learning library > ----------------------------------------------- > > Key: FLINK-1736 > URL: https://issues.apache.org/jira/browse/FLINK-1736 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library > Reporter: Till Rohrmann > Assignee: Alexander Alexandrov > Labels: ML, Starter > > A {{CountVectorizer}} feature extractor [1] assigns each occurring word in a > corpus an unique identifier. With this mapping it can vectorize models such > as bag of words or ngrams in a efficient way. The unique identifier assigned > to a word acts as the index of a vector. The number of word occurrences is > represented as a vector value at a specific index. > The advantage of the {{CountVectorizer}} compared to the FeatureHasher is > that the mapping of words to indices can be obtained which makes it easier to > understand the resulting feature vectors. > The {{CountVectorizer}} could be generalized to support arbitrary feature > values. > The {{CountVectorizer}} should be implemented as a {{Transfomer}}. > Resources: > [1] > [http://scikit-learn.org/stable/modules/feature_extraction.html#common-vectorizer-usage] -- This message was sent by Atlassian JIRA (v6.3.4#6332)