Hi Ted,

On Apr 13, 2013, at 8:46pm, Ted Dunning wrote:

> On Sat, Apr 13, 2013 at 7:05 AM, Ken Krugler 
> <[email protected]>wrote:
> 
>> 
>> On Apr 12, 2013, at 11:55pm, Ted Dunning wrote:
>> 
>>> The first thing to try is feature hashing to reduce your feature vector
>> size.
>> 
>> Unfortunately LibLinear takes feature indices directly (assumes they're
>> sequential ints from 0..n-1), so I don't think feature hashing will help
>> here.
>> 
> 
> I am sure that it would.  The feature indices that you give to liblinear
> don't have to be your original indices.
> 
> The simplest level of feature hashing would be to take the original feature
> indices and use multiple hashing to get 1, 2 or more new feature index
> values for each original index.  Then take these modulo the new feature
> vector size (which can be much smaller than your original).

Thanks for clarifying - I was stuck on using the hash trick to get rid of the 
terms to index map, versus creating a denser matrix.

Though I haven't yet found a good write-up on the value of generating more than 
one hash - seems like multiple hash values would increase the odds of 
collisions.

For a not-so-sparse matrix and a single hash function, I got a 6% drop in 
accuracy from a single hash. I'll have to try with a more real/sparser data set.

-- Ken

> There will be some collisions, but the result here is a linear
> transformation of the original space and if you use multiple indexes for
> each original feature, you will lose very little, if anything.  The SVM
> will almost always be able to learn around the effects of collisions.

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Reply via email to