Hi Alexander,

The stack trace is a little misleading here: all of the time is spent in
MemoryStore, but that's because MemoryStore is unrolling an iterator (note
the iterator.next()) call so that it can be stored in-memory.  Essentially
all of the computation for the tasks happens as part of that
iterator.next() call, which is why you're seeing a combination of
deserializing input data with Snappy (the InputStream reading) and some
MLLib processing.

-Kay

On Thu, Mar 12, 2015 at 5:34 PM, Ulanov, Alexander <alexander.ula...@hp.com>
wrote:

> Hi,
>
> I am working on artificial neural networks for Spark. It is solved with
> Gradient Descent, so each step the data is read, sum of gradients is
> calculated for each data partition (on each worker), aggregated (on the
> driver) and broadcasted back. I noticed that the gradient computation time
> is few times less than the total time needed for each step. To narrow down
> my observation, I run the gradient on a single machine with single
> partition of data of site 100MB that I persist (data.persist). This should
> minimize the overhead for aggregation at least, but the gradient
> computation still takes much less time than the whole step. Just in case,
> data is loaded by MLUtil. loadLibSVMFile in RDD[LabeledPoint], this is my
> code:
>
>     val conf = new SparkConf().setAppName("myApp").setMaster("local[2]")
>     val train = MLUtils.loadLibSVMFile(new SparkContext(conf),
> "/data/mnist/mnist.scale").repartition(1).persist()
>     val model = ANN2Classifier.train(train, 1000, Array[Int](32), 10,
> 1e-4) //training data, batch size, hidden layer size, iterations, LBFGS
> tolerance
>
> Profiler shows that there are two threads, one is doing Gradient and the
> other I don't know what. The Gradient takes 10% of this thread. Almost all
> other time is spent by MemoryStore. Below is the screenshot (first thread):
>
> https://drive.google.com/file/d/0BzYMzvDiCep5bGp2S2F6eE9TRlk/view?usp=sharing
> Second thread:
>
> https://drive.google.com/file/d/0BzYMzvDiCep5OHA0WUtQbXd3WmM/view?usp=sharing
>
> Could Spark developers please elaborate what's going on in MemoryStore? It
> seems that it does some string operations (parsing libsvm file? Why every
> step?) and a lot of InputStream reading. It seems that the overall time
> depends on the size of the data batch (or size of vector) I am processing.
> However it does not seems linear to me.
>
> Also, I would like to know how to speedup these operations.
>
> Best regards, Alexander
>
>

Reply via email to