Hi Alexander, The stack trace is a little misleading here: all of the time is spent in MemoryStore, but that's because MemoryStore is unrolling an iterator (note the iterator.next()) call so that it can be stored in-memory. Essentially all of the computation for the tasks happens as part of that iterator.next() call, which is why you're seeing a combination of deserializing input data with Snappy (the InputStream reading) and some MLLib processing.
-Kay On Thu, Mar 12, 2015 at 5:34 PM, Ulanov, Alexander <alexander.ula...@hp.com> wrote: > Hi, > > I am working on artificial neural networks for Spark. It is solved with > Gradient Descent, so each step the data is read, sum of gradients is > calculated for each data partition (on each worker), aggregated (on the > driver) and broadcasted back. I noticed that the gradient computation time > is few times less than the total time needed for each step. To narrow down > my observation, I run the gradient on a single machine with single > partition of data of site 100MB that I persist (data.persist). This should > minimize the overhead for aggregation at least, but the gradient > computation still takes much less time than the whole step. Just in case, > data is loaded by MLUtil. loadLibSVMFile in RDD[LabeledPoint], this is my > code: > > val conf = new SparkConf().setAppName("myApp").setMaster("local[2]") > val train = MLUtils.loadLibSVMFile(new SparkContext(conf), > "/data/mnist/mnist.scale").repartition(1).persist() > val model = ANN2Classifier.train(train, 1000, Array[Int](32), 10, > 1e-4) //training data, batch size, hidden layer size, iterations, LBFGS > tolerance > > Profiler shows that there are two threads, one is doing Gradient and the > other I don't know what. The Gradient takes 10% of this thread. Almost all > other time is spent by MemoryStore. Below is the screenshot (first thread): > > https://drive.google.com/file/d/0BzYMzvDiCep5bGp2S2F6eE9TRlk/view?usp=sharing > Second thread: > > https://drive.google.com/file/d/0BzYMzvDiCep5OHA0WUtQbXd3WmM/view?usp=sharing > > Could Spark developers please elaborate what's going on in MemoryStore? It > seems that it does some string operations (parsing libsvm file? Why every > step?) and a lot of InputStream reading. It seems that the overall time > depends on the size of the data batch (or size of vector) I am processing. > However it does not seems linear to me. > > Also, I would like to know how to speedup these operations. > > Best regards, Alexander > >