Re: Creating a feature vector from text before using with MLLib

2014-10-01 Thread Xiangrui Meng
Yes, the "bigram" in that demo only has two characters, which could separate different character sets. -Xiangrui On Wed, Oct 1, 2014 at 2:54 PM, Liquan Pei wrote: > The program computes hashing bi-gram frequency normalized by total number of > bigrams then filter out zero values. hashing is a eff

Re: Creating a feature vector from text before using with MLLib

2014-10-01 Thread Liquan Pei
The program computes hashing bi-gram frequency normalized by total number of bigrams then filter out zero values. hashing is a effective trick of vectorizing features. Take a look at http://en.wikipedia.org/wiki/Feature_hashing Liquan On Wed, Oct 1, 2014 at 2:18 PM, Soumya Simanta wrote: > I'm