I haven't looked closely at the sampling issues, but regarding the aggregation latency, there are fixed overheads (in local and distributed mode) with the way aggregation is done in Spark. Launching a stage of tasks, fetching outputs from the previous stage etc. all have overhead, so I would say its not efficient / recommended to run stages where computation is less than 500ms or so. You could increase your batch size based on this and hopefully that will help.
Regarding reducing these overheads by an order of magnitude it is a challenging problem given the architecture in Spark -- I have some ideas for this, but they are very much at a research stage. Thanks Shivaram On Thu, Apr 2, 2015 at 12:00 PM, Joseph Bradley <jos...@databricks.com> wrote: > When you say "It seems that instead of sample it is better to shuffle data > and then access it sequentially by mini-batches," are you sure that holds > true for a big dataset in a cluster? As far as implementing it, I haven't > looked carefully at GapSamplingIterator (in RandomSampler.scala) myself, > but that looks like it could be modified to be deterministic. > > Hopefully someone else can comment on aggregation in local mode. I'm not > sure how much effort has gone into optimizing for local mode. > > Joseph > > On Thu, Apr 2, 2015 at 11:33 AM, Ulanov, Alexander < > alexander.ula...@hp.com> > wrote: > > > Hi Joseph, > > > > > > > > Thank you for suggestion! > > > > It seems that instead of sample it is better to shuffle data and then > > access it sequentially by mini-batches. Could you suggest how to > implement > > it? > > > > > > > > With regards to aggregate (reduce), I am wondering why it works so slow > in > > local mode? Could you elaborate on this? I do understand that in cluster > > mode the network speed will kick in and then one can blame it. > > > > > > > > Best regards, Alexander > > > > > > > > *From:* Joseph Bradley [mailto:jos...@databricks.com] > > *Sent:* Thursday, April 02, 2015 10:51 AM > > *To:* Ulanov, Alexander > > *Cc:* dev@spark.apache.org > > *Subject:* Re: Stochastic gradient descent performance > > > > > > > > It looks like SPARK-3250 was applied to the sample() which > GradientDescent > > uses, and that should kick in for your minibatchFraction <= 0.4. Based > on > > your numbers, aggregation seems like the main issue, though I hesitate to > > optimize aggregation based on local tests for data sizes that small. > > > > > > > > The first thing I'd check for is unnecessary object creation, and to > > profile in a cluster or larger data setting. > > > > > > > > On Wed, Apr 1, 2015 at 10:09 AM, Ulanov, Alexander < > > alexander.ula...@hp.com> wrote: > > > > Sorry for bothering you again, but I think that it is an important issue > > for applicability of SGD in Spark MLlib. Could Spark developers please > > comment on it. > > > > > > -----Original Message----- > > From: Ulanov, Alexander > > Sent: Monday, March 30, 2015 5:00 PM > > To: dev@spark.apache.org > > Subject: Stochastic gradient descent performance > > > > Hi, > > > > It seems to me that there is an overhead in "runMiniBatchSGD" function of > > MLlib's "GradientDescent". In particular, "sample" and "treeAggregate" > > might take time that is order of magnitude greater than the actual > gradient > > computation. In particular, for mnist dataset of 60K instances, minibatch > > size = 0.001 (i.e. 60 samples) it take 0.15 s to sample and 0.3 to > > aggregate in local mode with 1 data partition on Core i5 processor. The > > actual gradient computation takes 0.002 s. I searched through Spark Jira > > and found that there was recently an update for more efficient sampling > > (SPARK-3250) that is already included in Spark codebase. Is there a way > to > > reduce the sampling time and local treeRedeuce by order of magnitude? > > > > Best regards, Alexander > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > > For additional commands, e-mail: dev-h...@spark.apache.org > > > > > > >