Re: Some feedback on the Gradient Descent Code

2015-05-28 Thread Theodore Vasiloudis
+1 This separation was the idea from the start, there is trade-off between having highly configureable optimizers and ensuring that the right types of regularization can only be applied to optimization algorithms that support them. It comes down to viewing the optimization framework mostly as a b

Re: Some feedback on the Gradient Descent Code

2015-05-28 Thread Till Rohrmann
I think so too. Ok, I'll try to update the PR accordingly. On Thu, May 28, 2015 at 5:36 PM, Mikio Braun wrote: > Ah yeah, I see.. . > > Yes, it's right that many algorithms perform quite differently > depending on the kind of regularization... . Same holds for cutting > plane algorithms which ei

Re: Some feedback on the Gradient Descent Code

2015-05-28 Thread Mikio Braun
Ah yeah, I see.. . Yes, it's right that many algorithms perform quite differently depending on the kind of regularization... . Same holds for cutting plane algorithms which either reduce to linear or quadratic programs depending on L1 or L2. Generally speaking, I think this is also not surprising

Re: Some feedback on the Gradient Descent Code

2015-05-28 Thread Till Rohrmann
Yes GradientDescent == (batch-)SGD. That was also my first idea of how to implement it. However, what happens if the regularization is specific to the actually used algorithm. For example, for L-BFGS with L1 regularization you have a different `parameterUpdate` step (Orthant-wise Limited Memory Qu

Re: Some feedback on the Gradient Descent Code

2015-05-28 Thread Mikio Braun
GradientDescent is the just the (batch-)SGD optimizer right? Actually I think the parameter update should be done by a RegularizationFunction. IMHO the structure should be like this: GradientDescent - collects gradient and regularization updates from - CostFunction LinearModelCostFunction - i

Re: Some feedback on the Gradient Descent Code

2015-05-28 Thread Till Rohrmann
Hey Mikio, yes you’re right. The SGD only needs to know the gradient of the loss function and some mean to update the weights in accordance with the regularization scheme. Additionally, we also need to be able to compute the loss for the convergence criterion. That’s also how it is implemented in

Re: Some feedback on the Gradient Descent Code

2015-05-28 Thread Mikio Braun
[Ok, so maybe this is exactly what is implemented, sorry if I'm just repeating you... ] So C(w, xys) = C regularization(w) + sum over yxs of losses Gradient is C grad reg(w) + sum grad losses(w, xy) For some regularization functions, regularization is better performed by some explicit op

Re: Some feedback on the Gradient Descent Code

2015-05-28 Thread Mikio Braun
Oh wait.. continue to type. accidentally sent out the message to early. On Thu, May 28, 2015 at 4:03 PM, Mikio Braun wrote: > Hi Till and Theodore, > > I think the code is cleaned up a lot now, introducing the > mapWithBcVariable helped a lot. > > I also get that the goal was to make a cost funct

Re: Some feedback on the Gradient Descent Code

2015-05-28 Thread Mikio Braun
Hi Till and Theodore, I think the code is cleaned up a lot now, introducing the mapWithBcVariable helped a lot. I also get that the goal was to make a cost function for learning linear model configurable well. My main concern was that the solver itself was already too specifically bound to the ki

Re: Some feedback on the Gradient Descent Code

2015-05-28 Thread Till Rohrmann
What tweaks would that be? I mean what is required to implement L-BFGS? I guess that we won’t get rid of the case statements because we have to decide between two code paths: One with and the other without convergence criterion. But I think by pulling each branch in its own function, it becomes cl