[ https://issues.apache.org/jira/browse/FLINK-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14602847#comment-14602847 ]
Sachin Goel commented on FLINK-1736: ------------------------------------ Hi Alexander, are there any updates on this? > Add CountVectorizer to machine learning library > ----------------------------------------------- > > Key: FLINK-1736 > URL: https://issues.apache.org/jira/browse/FLINK-1736 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library > Reporter: Till Rohrmann > Assignee: Alexander Alexandrov > Labels: ML, Starter > > A {{CountVectorizer}} feature extractor [1] assigns each occurring word in a > corpus an unique identifier. With this mapping it can vectorize models such > as bag of words or ngrams in a efficient way. The unique identifier assigned > to a word acts as the index of a vector. The number of word occurrences is > represented as a vector value at a specific index. > The advantage of the {{CountVectorizer}} compared to the FeatureHasher is > that the mapping of words to indices can be obtained which makes it easier to > understand the resulting feature vectors. > The {{CountVectorizer}} could be generalized to support arbitrary feature > values. > The {{CountVectorizer}} should be implemented as a {{Transfomer}}. > Resources: > [1] > [http://scikit-learn.org/stable/modules/feature_extraction.html#common-vectorizer-usage] -- This message was sent by Atlassian JIRA (v6.3.4#6332)