[ https://issues.apache.org/jira/browse/FLINK-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14537332#comment-14537332 ]
ASF GitHub Bot commented on FLINK-1735: --------------------------------------- Github user FelixNeutatz commented on a diff in the pull request: https://github.com/apache/flink/pull/665#discussion_r30005430 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/feature/extraction/FeatureHasher.scala --- @@ -0,0 +1,142 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.feature.extraction + +import java.nio.charset.Charset + +import org.apache.flink.api.scala._ +import org.apache.flink.ml.common.{Parameter, ParameterMap, Transformer} +import org.apache.flink.ml.feature.extraction.FeatureHasher.{NonNegative, NumFeatures} +import org.apache.flink.ml.math.{Vector, SparseVector} + +import scala.util.hashing.MurmurHash3 + + +/** This transformer turns sequences of symbolic feature names (strings) into + * flink.ml.math.SparseVectors, using a hash function to compute the matrix column corresponding + * to a name. Aka the hashing trick. + * The hash function employed is the signed 32-bit version of Murmurhash3. + * + * By default for [[FeatureHasher]] transformer numFeatures=2#94;20 and nonNegative=false. + * + * This transformer takes a [[Seq]] of strings and maps it to a + * feature [[Vector]]. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.common.Learner]] implementations which expect an input of + * [[Vector]]. + * + * @example + * {{{ + * val trainingDS: DataSet[Seq[String]] = env.fromCollection(data) + * val transformer = FeatureHasher().setNumFeatures(65536).setNonNegative(false) + * + * transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[FeatureHasher.NumFeatures]]: The number of features (entries) in the output vector; + * by default equal to 2^20 + * - [[FeatureHasher.NonNegative]]: Whether output vector should contain non-negative values only. + * When True, output values can be interpreted as frequencies. When False, output values will have + * expected value zero; by default equal to false + */ +class FeatureHasher extends Transformer[Seq[String], Vector] with Serializable { + + // The seed used to initialize the hasher + val Seed = 0 + + /** Sets the number of features (entries) in the output vector + * + * @param numFeatures the user-specified numFeatures value. In case the user gives a value less + * than 1, numFeatures is set to its default value: 2^20 + * @return the FeatureHasher instance with its numFeatures value set to the user-specified value + */ + def setNumFeatures(numFeatures: Int): FeatureHasher = { + // number of features must be greater zero + if(numFeatures < 1) { + return this + } + parameters.add(NumFeatures, numFeatures) + this + } + + /** Sets whether output vector should contain non-negative values only + * + * @param nonNegative the user-specified nonNegative value. + * @return the FeatureHasher instance with its nonNegative value set to the user-specified value + */ + def setNonNegative(nonNegative: Boolean): FeatureHasher = { + parameters.add(NonNegative, nonNegative) + this + } + + override def transform(input: DataSet[Seq[String]], parameters: ParameterMap): + DataSet[Vector] = { + val resultingParameters = this.parameters ++ parameters + + val nonNegative = resultingParameters(NonNegative) + val numFeatures = resultingParameters(NumFeatures) + + // each item of the sequence is hashed and transformed into a tuple (index, value) + input.map { + inputSeq => { + val entries = inputSeq.map { + s => { + // unicode strings are converted to utf-8 + // bytesHash is faster than arrayHash, because it hashes 4 bytes at once + val h = MurmurHash3.bytesHash(s.getBytes(Charset.forName("UTF-8")), Seed) % numFeatures --- End diff -- we are using MurmurHash3 because we followed the implementation of [sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html#sklearn.feature_extraction.FeatureHasher) which uses the MurmurHash3 > Add FeatureHasher to machine learning library > --------------------------------------------- > > Key: FLINK-1735 > URL: https://issues.apache.org/jira/browse/FLINK-1735 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library > Reporter: Till Rohrmann > Assignee: Felix Neutatz > Labels: ML > > Using the hashing trick [1,2] is a common way to vectorize arbitrary feature > values. The hash of the feature value is used to calculate its index for a > vector entry. In order to mitigate possible collisions, a second hashing > function is used to calculate the sign for the update value which is added to > the vector entry. This way, it is likely that collision will simply cancel > out. > A feature hasher would also be helpful for NLP problems where it could be > used to vectorize bag of words or ngrams feature vectors. > Resources: > [1] [https://en.wikipedia.org/wiki/Feature_hashing] > [2] > [http://scikit-learn.org/stable/modules/feature_extraction.html#feature-extraction] -- This message was sent by Atlassian JIRA (v6.3.4#6332)