subject:"Re\: \[MLlib\] Term Frequency in TF\-IDF seems incorrect"

Re: [MLlib] Term Frequency in TF-IDF seems incorrect

2016-08-02 Thread Nick Pentreath

Note that both HashingTF and CountVectorizer are usually used for creating TF-IDF normalized vectors. The definition ( https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Definition) of term frequency in TF-IDF is actually the "number of times the term occurs in the document". So it's perhaps a bit of a

Re: [MLlib] Term Frequency in TF-IDF seems incorrect

2016-08-01 Thread Yanbo Liang

Hi Hao, HashingTF directly apply a hash function (Murmurhash3) to the features to determine their column index. It excluded any thought about the term frequency or the length of the document. It does similar work compared with sklearn FeatureHasher. The result is increased speed and reduced memory