On 25/10/2014, 5:25 AM, Wush Wu wrote: > Dear all, > > Sorry that I am not sure that whether I should ask the question here or > R-devel. Is there any existed packages which implements or is implementing > feature hashing or similar function? > > For who does not know "feature hashing", please let me give a brief > explanation here. > > Feature hashing is a technique to convert a large amount of string to dummy > variables quickly( similar to `stats::contrasts` ). For example, if I want > to convert a character vector `x <- c("asdfa", "adsfausd", .....)` to dummy > variable, I need to construct a mapping between the string and the index > (`base::factor`). However, if the `x` has lots of different elements and > the size of `x` is huge, the overhead of constructing index is large. > Moreover, the overhead is larger for the distributed environment. > > A good hashing function could be used to map the string to the index > quickly without the overhead of constructing the index. The probability of > "collision" might be small if we pick a good hashing function. For details, > please see http://en.wikipedia.org/wiki/Feature_hashing
The "digest" package implements several different hash functions. You could use the hash values as names in an environment to index arbitrary objects associated with the values. Duncan Murdoch ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.