I agree that given a small data set it's probably better to solve the
linear regression problem directly. However, I'm not so sure how well this
performs if the data gets really big (more in terms of number of data
points). But maybe we can find something like a sweet spot when to switch
between both methods. And maybe a distributed conjugate gradient methods
can also beat SGD if the data is too large to be computed on a single
machine.

Until we have adagrad or another more robust learning rate strategy, we
could also deactivate the default value for simple SGD. This makes users
aware that they have to tweak this parameter.

On Thu, Jun 4, 2015 at 2:54 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> On Thu, Jun 4, 2015 at 1:26 PM, Till Rohrmann <trohrm...@apache.org>
> wrote:
>
> > Maybe also the default learning rate of 0.1 is set too high.
> >
>
> Could be.
>
> But grid search on learning rate is pretty standard practice. Running
> multiple learning engines at the same time with different learning rates is
> pretty plausible.
>
> Also, using something like adagrad will knock down high learning rates very
> quickly if you get a nearly divergent step. This can make initially high
> learning rates quite plausible.
>

Reply via email to