Hello, I have been evaluating LogisticRegressionWithSGD of Spark 1.0 MLlib on Hadoop 0.20.2-cdh3u6 but it does not work for a sparse dataset though the number of training examples used in the evaluation is just 1,000.
It works fine for the dataset *news20.binary.1000* that has 178,560 features. However, it does not work for *news20.random.1000* where # of features is large (1,354,731 features) though we used a sparse vector through MLUtils.loadLibSVMFile(). The execution seems not progressing while no error is reported in the spark-shell as well as in the stdout/stderr of executors. We used 32 executors with each allocating 7GB (2GB is for RDD) for working memory. Any suggesions? Your help is really appreciated. ============== Executed code ============== import org.apache.spark.mllib.util.MLUtils import org.apache.spark.mllib.classification.LogisticRegressionWithSGD //val training = MLUtils.loadLibSVMFile(sc, "hdfs://host:8020/dataset/news20-binary/news20.binary.1000", multiclass=false) val training = MLUtils.loadLibSVMFile(sc, "hdfs://host:8020/dataset/news20-binary/news20.random.1000", multiclass=false) val numFeatures = training .take(1)(0).features.size //numFeatures: Int = 178560 for news20.binary.1000 //numFeatures: Int = 1354731 for news20.random.1000 val model = LogisticRegressionWithSGD.train(training, numIterations=1) ================================== The dataset used in the evaluation ================================== http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#news20.binary $ head -1000 news20.binary | sed 's/+1/1/g' | sed 's/-1/0/g' > news20.binary.1000 $ sort -R news20.binary > news20.random $ head -1000 news20.random | sed 's/+1/1/g' | sed 's/-1/0/g' > news20.random.1000 You can find the dataset in https://dl.dropboxusercontent.com/u/13123103/news20.random.1000 https://dl.dropboxusercontent.com/u/13123103/news20.binary.1000 Thanks, Makoto