Given a table with hundreds of columns mixed with both categorical and
numerical attributes, and the distribution of values is unknown, what's the
best way to detect outliers?
For example, given a table
Category Price
A 1
A 1.3
A 100
C
Given a table
> $cat data.csv
>
> ID,State,City,Price,Number,Flag
> 1,CA,A,100,1000,0
> 2,CA,A,96,1010,1
> 3,CA,A,195,1010,1
> 4,NY,B,124,2000,0
> 5,NY,B,128,2001,1
> 6,NY,C,24,3,0
> 7,NY,C,27,30100,1
> 8,NY,C,29,30200,0
> 9,NY,C,39,33000,1
Expecte
f speed for a small cost in accuracy.
>
> However if your rule is really like "must match column A and B and
> then closest value in column C then just ordering everything by A, B,
> C lets you pretty much read off the answer from the result set
> directly. Everything is closest
etric, I don't think you can use things like
> LSH which more or less depend on a continuous metric space. This is
> too specific to fit into a general framework usefully I think, but, I
> think you can solve this directly with some code without much trouble.
>
> On Tue, Sep 13,
* *1* *5* *2001* *128* *B*
> *4* *5* *0.041* *1*
> * NY* *0* *6* *3* *24* *C* * NY* *1* *7* *30100* *27* *C*
> *6* *7* *0.13* *1*
> NY 0 6 3 24 C NY 1 9 33000 39 C 6 9 3.15 2
> *NY* *0* *8* *30200* *29* *C* * NY* *1* *7* *30100* *27* *C