Hi, In my picture search project, I need a cluster algorithm to narrow the dataset, for accelerate the search on millions of pictures. First we use python+pytorch+kmean, with the growing data from thousands to millions, the KMeans clustering became slower and slower(seconds to minutes), then we find MiniBatchKMeans could amazing finish the clustering in 1~2 seconds on millions of data. Meanwhile we still faced the insufficient concurrent capacity of python, so we switch to kotlin on jvm. But there did not a MinibatchKMeans algorithm in jvm yet, so I wrote one in kotlin, refer to the (python)sklearn MinibatchKMeans and Apache Commons Math(Deeplearning4j was also considered, but it is too slow because of ND4j's design).
I'd like to contribute it to Apache Commons Math, and I wrote a java version: https://github.com/chentao106/commons-math/tree/feature-MiniBatchKMeans From my test(Kotlin version), it is very fast, but gives slightly different results with KMeans++ in most case, but sometimes has big different(May be affected by the randomness of the mini batch): Some bad case: It even worse when I use RandomSource.create(RandomSource.MT_64, 0) for the random generator ┐(´-`)┌. My brief understanding of MiniBatchKMeans: Use a partial points in initialize cluster centers, and random mini batch in training iterations. It can finish in few seconds when clustering millions of data, and has few differences between KMeans. More information about MiniBatchKMeans https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html