Yeah, a simple way to estimate the time for an iterative algorithms is number of iterations required * time per iteration. The time per iteration will depend on the batch size, computation required and the fixed overheads I mentioned before. The number of iterations of course depends on the convergence rate for the problem being solved.
Thanks Shivaram On Thu, Apr 2, 2015 at 2:19 PM, Ulanov, Alexander <alexander.ula...@hp.com> wrote: > Hi Shivaram, > > > > It sounds really interesting! With this time we can estimate if it worth > considering to run an iterative algorithm on Spark. For example, for SGD on > Imagenet (450K samples) we will spend 450K*50ms=62.5 hours to traverse all > data by one example not considering the data loading, computation and > update times. One may need to traverse all data a number of times to > converge. Let’s say this number is equal to the batch size. So, we remain > with 62.5 hours overhead. Is it reasonable? > > > > Best regards, Alexander > > > > *From:* Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu] > *Sent:* Thursday, April 02, 2015 1:26 PM > *To:* Joseph Bradley > *Cc:* Ulanov, Alexander; dev@spark.apache.org > *Subject:* Re: Stochastic gradient descent performance > > > > I haven't looked closely at the sampling issues, but regarding the > aggregation latency, there are fixed overheads (in local and distributed > mode) with the way aggregation is done in Spark. Launching a stage of > tasks, fetching outputs from the previous stage etc. all have overhead, so > I would say its not efficient / recommended to run stages where computation > is less than 500ms or so. You could increase your batch size based on this > and hopefully that will help. > > > > Regarding reducing these overheads by an order of magnitude it is a > challenging problem given the architecture in Spark -- I have some ideas > for this, but they are very much at a research stage. > > > > Thanks > Shivaram > > > > On Thu, Apr 2, 2015 at 12:00 PM, Joseph Bradley <jos...@databricks.com> > wrote: > > When you say "It seems that instead of sample it is better to shuffle data > and then access it sequentially by mini-batches," are you sure that holds > true for a big dataset in a cluster? As far as implementing it, I haven't > looked carefully at GapSamplingIterator (in RandomSampler.scala) myself, > but that looks like it could be modified to be deterministic. > > Hopefully someone else can comment on aggregation in local mode. I'm not > sure how much effort has gone into optimizing for local mode. > > Joseph > > On Thu, Apr 2, 2015 at 11:33 AM, Ulanov, Alexander < > alexander.ula...@hp.com> > wrote: > > > Hi Joseph, > > > > > > > > Thank you for suggestion! > > > > It seems that instead of sample it is better to shuffle data and then > > access it sequentially by mini-batches. Could you suggest how to > implement > > it? > > > > > > > > With regards to aggregate (reduce), I am wondering why it works so slow > in > > local mode? Could you elaborate on this? I do understand that in cluster > > mode the network speed will kick in and then one can blame it. > > > > > > > > Best regards, Alexander > > > > > > > > *From:* Joseph Bradley [mailto:jos...@databricks.com] > > *Sent:* Thursday, April 02, 2015 10:51 AM > > *To:* Ulanov, Alexander > > *Cc:* dev@spark.apache.org > > *Subject:* Re: Stochastic gradient descent performance > > > > > > > > It looks like SPARK-3250 was applied to the sample() which > GradientDescent > > uses, and that should kick in for your minibatchFraction <= 0.4. Based > on > > your numbers, aggregation seems like the main issue, though I hesitate to > > optimize aggregation based on local tests for data sizes that small. > > > > > > > > The first thing I'd check for is unnecessary object creation, and to > > profile in a cluster or larger data setting. > > > > > > > > On Wed, Apr 1, 2015 at 10:09 AM, Ulanov, Alexander < > > alexander.ula...@hp.com> wrote: > > > > Sorry for bothering you again, but I think that it is an important issue > > for applicability of SGD in Spark MLlib. Could Spark developers please > > comment on it. > > > > > > -----Original Message----- > > From: Ulanov, Alexander > > Sent: Monday, March 30, 2015 5:00 PM > > To: dev@spark.apache.org > > Subject: Stochastic gradient descent performance > > > > Hi, > > > > It seems to me that there is an overhead in "runMiniBatchSGD" function of > > MLlib's "GradientDescent". In particular, "sample" and "treeAggregate" > > might take time that is order of magnitude greater than the actual > gradient > > computation. In particular, for mnist dataset of 60K instances, minibatch > > size = 0.001 (i.e. 60 samples) it take 0.15 s to sample and 0.3 to > > aggregate in local mode with 1 data partition on Core i5 processor. The > > actual gradient computation takes 0.002 s. I searched through Spark Jira > > and found that there was recently an update for more efficient sampling > > (SPARK-3250) that is already included in Spark codebase. Is there a way > to > > reduce the sampling time and local treeRedeuce by order of magnitude? > > > > Best regards, Alexander > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > > > For additional commands, e-mail: dev-h...@spark.apache.org > > > > > > > > >