Positive log likelihoods for continuous distributions are not unusual. You are evaluating a pdf not a probability. For example a univariate Gaussian pdf returns greater than 1 at the mean when the variance goes below 0.39, at which point the log pdf is positive.
Sent from Polymail ( https://polymail.io/?utm_source=polymail&utm_medium=referral&utm_campaign=signature ) On Tue, 29 May 2018 at 12:08 Simon Dirmeier < Simon Dirmeier ( Simon Dirmeier <simon.dirme...@web.de> ) > wrote: > > > Hey, > > sorry for the late reply. I cannot share the data but the problem can be > reproduced easily, like below. > I wanted to check with sklearn and observe a similar behaviour, i.e. a > positive per-sample average log-likelihood ( > http://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture.score > ). > > I don't think it is necessarily an issue with the implementation, but > maybe due to parameter identifiability or so? > As far as I can tell, the variances seem to be ok. > > Thanks for looking into this. > > Best, > Simon > > > > import scipy > import sklearn.mixture > from scipy.stats import multivariate_normal > from sklearn.mixture import GaussianMixture > > scipy.random.seed(23) > X = multivariate_normal.rvs(mean=scipy.ones(10), size=100) > > dff = map(lambda x: (int(x[0]), Vectors.dense(x[0:])), X) > df = spark.createDataFrame(dff, schema=["label", "features"]) > > for i in [100, 90, 80, 70, 60, 50]: > km = pyspark.ml.clustering.GaussianMixture(k=10, > seed=23).fit(df.limit(i)) > sk_gmm = sklearn.mixture.GaussianMixture(10, > random_state=23).fit(X[:i, :]) > print(df.limit(i).count(), X[:i, :].shape[0], > km.summary.logLikelihood, sk_gmm.score(X[:i, :])) > > 100 100 368.37475644171036 -1.54949312502 90 90 1026.084529101155 > 1.16196607062 80 80 2245.427539835042 4.25769131857 70 70 > 1940.0122633489268 10.0949992881 60 60 2255.002313247103 14.0497823725 50 > 50 -140.82605873444814 21.2423016046 >