I agree that given a small data set it's probably better to solve the linear regression problem directly. However, I'm not so sure how well this performs if the data gets really big (more in terms of number of data points). But maybe we can find something like a sweet spot when to switch between both methods. And maybe a distributed conjugate gradient methods can also beat SGD if the data is too large to be computed on a single machine.
Until we have adagrad or another more robust learning rate strategy, we could also deactivate the default value for simple SGD. This makes users aware that they have to tweak this parameter. On Thu, Jun 4, 2015 at 2:54 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > On Thu, Jun 4, 2015 at 1:26 PM, Till Rohrmann <trohrm...@apache.org> > wrote: > > > Maybe also the default learning rate of 0.1 is set too high. > > > > Could be. > > But grid search on learning rate is pretty standard practice. Running > multiple learning engines at the same time with different learning rates is > pretty plausible. > > Also, using something like adagrad will knock down high learning rates very > quickly if you get a nearly divergent step. This can make initially high > learning rates quite plausible. >