Actually I think it is possibly that an user/developer needs the standardized features with population mean and std in some cases. It would be better if StandardScaler can offer the option to do that.
Holden Karau wrote > Hi Gilad, > > Spark uses the sample standard variance inside of the StandardScaler (see > https://spark.apache.org/docs/2.0.2/api/scala/index.html#org.apache.spark.mllib.feature.StandardScaler > ) which I think would explain the results you are seeing you are seeing. I > believe the scalers are intended to be used on larger sized datasets You > can verify this yourself doing the same computation in Python and see the > scaling using the sample deviation result in the values you are seeing > from > Spark. > > Cheers, > > Holden :) > > > On Sun, Jan 8, 2017 at 12:06 PM, Gilad Barkan < > gilad.barkan@ > > > wrote: > >> Hi >> >> It seems that the output of MLlib's *StandardScaler*(*withMean=*True, >> *withStd*=True)are not as expected. >> >> The above configuration is expected to do the following transformation: >> >> X -> Y = (X-Mean)/Std - Eq.1 >> >> This transformation (a.k.a. Standardization) should result in a >> "standardized" vector with unit-variance and zero-mean. >> >> I'll demonstrate my claim using the current documentation example: >> >> >>> vs = [Vectors.dense([-2.0, 2.3, 0]), Vectors.dense([3.8, 0.0, >> 1.9])]>>> dataset = sc.parallelize(vs)>>> standardizer = >> StandardScaler(True, True)>>> model = standardizer.fit(dataset)>>> result >> = model.transform(dataset)>>> for r in result.collect(): print r >> DenseVector([-0.7071, 0.7071, -0.7071]) DenseVector([0.7071, >> -0.7071, 0.7071]) >> >> This result in std = sqrt(1/2) foreach column instead of std=1. >> >> Applying Standardization transformation on the above 2 vectors result in >> the following output >> >> DenseVector([-1.0, 1.0, -1.0]) DenseVector([1.0, -1.0, 1.0]) >> >> >> Another example: >> >> Adding another DenseVector([2.4, 0.8, 3.5]) to the above we get a 3 rows >> of DenseVectors: >> [DenseVector([-2.0, 2.3, 0.0]), DenseVector([3.8, 0.0, 1.9]), >> DenseVector([2.4, 0.8, 3.5])] >> >> The StandardScaler result the following scaled vectors: >> [DenseVector([-1.12339, 1.084829, -1.02731]), DenseVector([0.792982, >> -0.88499, 0.057073]), DenseVector([0.330409, 4 >> -0.19984, 0.970241]) >> >> This result has std=sqrt(2/3) >> >> Instead it should have resulted other 3 vectors that form std=1 for each >> column. >> >> Adding another vector (4 total) results in 4 scaled vectors that form >> std= sqrt(3/4) instead of std=1 >> >> I hope all the examples help to make my point clear. >> >> I hope I don't miss here something. >> >> Thank you >> >> Gilad Barkan >> >> >> >> >> >> > > > -- > Cell : 425-233-8271 > Twitter: https://twitter.com/holdenkarau ----- Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/A-note-about-MLlib-s-StandardScaler-tp20513p20517.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org