What's the best way to detect and remove outliers in a table?

2016-09-01 Thread Mobius ReX
Given a table with hundreds of columns mixed with both categorical and numerical attributes, and the distribution of values is unknown, what's the best way to detect outliers? For example, given a table Category Price A 1 A 1.3 A 100 C

What's the best way to find the nearest neighbor in Spark? Any windowing function?

2016-09-13 Thread Mobius ReX
Given a table > $cat data.csv > > ID,State,City,Price,Number,Flag > 1,CA,A,100,1000,0 > 2,CA,A,96,1010,1 > 3,CA,A,195,1010,1 > 4,NY,B,124,2000,0 > 5,NY,B,128,2001,1 > 6,NY,C,24,3,0 > 7,NY,C,27,30100,1 > 8,NY,C,29,30200,0 > 9,NY,C,39,33000,1 Expecte

Re: What's the best way to find the nearest neighbor in Spark? Any windowing function?

2016-09-13 Thread Mobius ReX
f speed for a small cost in accuracy. > > However if your rule is really like "must match column A and B and > then closest value in column C then just ordering everything by A, B, > C lets you pretty much read off the answer from the result set > directly. Everything is closest

Re: What's the best way to find the nearest neighbor in Spark? Any windowing function?

2016-09-13 Thread Mobius ReX
etric, I don't think you can use things like > LSH which more or less depend on a continuous metric space. This is > too specific to fit into a general framework usefully I think, but, I > think you can solve this directly with some code without much trouble. > > On Tue, Sep 13,

Re: What's the best way to find the nearest neighbor in Spark? Any windowing function?

2016-09-13 Thread Mobius ReX
* *1* *5* *2001* *128* *B* > *4* *5* *0.041* *1* > * NY* *0* *6* *3* *24* *C* * NY* *1* *7* *30100* *27* *C* > *6* *7* *0.13* *1* > NY 0 6 3 24 C NY 1 9 33000 39 C 6 9 3.15 2 > *NY* *0* *8* *30200* *29* *C* * NY* *1* *7* *30100* *27* *C