Re: Some feedback on the Gradient Descent Code

Till Rohrmann Thu, 28 May 2015 08:24:47 -0700

Yes GradientDescent == (batch-)SGD.

That was also my first idea of how to implement it. However, what happens
if the regularization is specific to the actually used algorithm. For
example, for L-BFGS with L1 regularization you have a different
`parameterUpdate` step (Orthant-wise Limited Memory QuasiNewton method)
than for SGD with L1 (proximal operator). Thus, we would have to make sure
that the right `RegularizationFunction` is used with the `CostFunction`
depending on the optimization algorithm you choose. Maybe there are also
some optimization algorithms which cannot work with certain regularization
types.


Do you have some more insights into this?

On Thu, May 28, 2015 at 5:14 PM, Mikio Braun <mikiobr...@googlemail.com>
wrote:

> GradientDescent is the just the (batch-)SGD optimizer right? Actually
> I think the parameter update should be done by a
> RegularizationFunction.
>
> IMHO the structure should be like this:
>
> GradientDescent
>   - collects gradient and regularization updates from - CostFunction
> LinearModelCostFunction
>   - inherits from Cost Function
>   - parameterized by LossFunction and RegularizationFunction
>
> Does that make sense?
>
>
>
> On Thu, May 28, 2015 at 5:00 PM, Till Rohrmann <till.rohrm...@gmail.com>
> wrote:
> > Hey Mikio,
> >
> > yes you’re right. The SGD only needs to know the gradient of the loss
> > function and some mean to update the weights in accordance with the
> > regularization scheme. Additionally, we also need to be able to compute
> the
> > loss for the convergence criterion.
> >
> > That’s also how it is implemented in the before-mentioned PR right now.
> The
> > question which arises now is whether the LossFunction should be
> responsible
> > for the updating of the weight vector or not. So what we could do is the
> > following: Remove the regularization from the LossFunction and let the
> > GradientDescent solver do the parameterUpdate. This either requires
> > specialized implementations such as GradientDescentL1 and
> GradientDescentL2
> > or a configuration parameter which defines a regularization function
> which
> > is then called for the parameterUpdate.
> >
> > trait RegularizationFunction{
> >   def parameterUpdate(weightVector: WeightVector, gradient: WeightVector,
> > learningRate: Double, regularizationConstant: Double): WeightVector
> >
> >   def regularizationValue(weigthVector: WeightVector): Double
> > }
> >
> > Both ansätze are semantically equivalent. I have no strong preference for
> > either of them. What do you think is the better approach?
> >
> >
> > On Thu, May 28, 2015 at 4:13 PM, Mikio Braun <mikiobr...@googlemail.com>
> > wrote:
> >>
> >> [Ok, so maybe this is exactly what is implemented, sorry if I'm just
> >> repeating you... ]
> >>
> >> So
> >>
> >>    C(w, xys) = C regularization(w) + sum over yxs of losses
> >>
> >> Gradient is
> >>
> >>    C grad reg(w) + sum grad losses(w, xy)
> >>
> >> For some regularization functions, regularization is better performed
> >> by some explicit operation on the weights like a subtracting some
> >> value if it's still to large or so (forgot how that exactly) works.
> >>
> >> The losses consists of the model (for now, linear I guess), plus the
> >> kind of losses (hinge, squared, and maybe logistic for LR)
> >>
> >> So from the viewpoint of SGD, we only need a way to compute the
> >> gradient for a set of examples, and maybe some way to update(e.g.
> >> shrink) weights for regularization.
> >>
> >> So SGD takes some
> >>
> >> trait CostFunction {
> >>    def gradient(weightVector: Vector, xys: DataSet[LabeledVector]):
> Vector
> >>
> >>    def parameterUpdate(weightVector: Vector): Vector
> >> }
> >>
> >> so that SGD can run.
> >>
> >> On the other hand, we could have a
> >>
> >> class LinearModelCostFunction {
> >>    val lossFunction: LossFunction
> >>    val regularizationFunction: RegularizationFunction
> >>
> >>   val regularizationConstant
> >>
> >>   def gradient(weightVector: Vector, xys: DataSet[LabeledVector]):
> Vector
> >> = {
> >>      regularizationConstant *
> >> regularizationFunciton.gradient(weightVector) +
> >>      xys.map(xy => lossFunction(predict(x, y) * loss) * X).sum()
> >>   }
> >>
> >>   def parameterUpdate = regularizationFunction.parameterUpdate
> >> }
> >>
> >> That way the optimizer and the cost function would be entirely
> >> uncoupled, and you could use the SGD to train on practically anything
> >> which has a data-dependent gradient.
> >>
> >> What do you think?
> >>
> >> -M
> >>
> >> On Thu, May 28, 2015 at 4:03 PM, Mikio Braun <mikiobr...@googlemail.com
> >
> >> wrote:
> >> > Oh wait.. continue to type. accidentally sent out the message to
> early.
> >> >
> >> > On Thu, May 28, 2015 at 4:03 PM, Mikio Braun <
> mikiobr...@googlemail.com>
> >> > wrote:
> >> >> Hi Till and Theodore,
> >> >>
> >> >> I think the code is cleaned up a lot now, introducing the
> >> >> mapWithBcVariable helped a lot.
> >> >>
> >> >> I also get that the goal was to make a cost function for learning
> >> >> linear model configurable well. My main concern was that the solver
> >> >> itself was already too specifically bound to the kind of cost
> function
> >> >> we wanted to optimize.
> >> >>
> >> >> Maybe it helps to discuss this in terms of the math of the
> >> >> optimization methods instead of in the code. The mathematics tends to
> >> >> be much more concise and small compared to the completely unrolled
> >> >> code derived from the math.
> >> >>
> >> >> Just very briefly, and correct me if I'm wrong, this is about
> >> >> optimizing some cost function C(w, xys) where w the weight vector and
> >> >> xys is a set of labeled vectors. For the update, we only need to
> >> >> consider the gradient C(w, xys)
> >> >>
> >> >> Typically
> >> >>
> >> >> C(w, xys) = C regularization(w) + sum over yxs of losses
> >> >>
> >> >> so the gradient is
> >> >> C grad reg(w) + sum grad losses(w, xy)
> >> >>
> >> >> On Thu, May 28, 2015 at 3:04 PM, Till Rohrmann <
> t...@data-artisans.com>
> >> >> wrote:
> >> >>> What tweaks would that be? I mean what is required to implement
> >> >>> L-BFGS?
> >> >>>
> >> >>> I guess that we won’t get rid of the case statements because we have
> >> >>> to
> >> >>> decide between two code paths: One with and the other without
> >> >>> convergence
> >> >>> criterion. But I think by pulling each branch in its own function,
> it
> >> >>> becomes clearer now.
> >> >>>
> >> >>> I agree that the update step is not really the responsibility of the
> >> >>> LossFunction. The only reason why the current prototype does it this
> >> >>> way is
> >> >>> that the regularization is part of the LossFunction. By moving the
> >> >>> regularization out of the LossFunction and make it part of the
> Solver
> >> >>> we can
> >> >>> avoid to clutter the LossFunction interface. However, then we have
> to
> >> >>> provide different implementations for the Solver, one for each
> >> >>> regularization type (L1, L2, etc.). Or we provide the regularization
> >> >>> as a
> >> >>> parameter which is responsible for the updating of the weight
> vector.
> >> >>>
> >> >>> What do you prefer?
> >> >>>
> >> >>> Cheer,
> >> >>> Till
> >> >>>
> >> >>>
> >> >>> On Thu, May 28, 2015 at 12:01 PM, Theodore Vasiloudis <t...@kth.se>
> >> >>> wrote:
> >> >>>>
> >> >>>> Hello Mikio and Till,
> >> >>>>
> >> >>>> I really like the idea of syntactic sugar to reduce the boilerplate
> >> >>>> as
> >> >>>> well.
> >> >>>>
> >> >>>> I'm looking at the PR now. For me the goal was to decouple as many
> >> >>>> things
> >> >>>> as possible, but in the end a lot of the
> >> >>>> logic ended up in SGD itself. Tthe previous design allowed us to
> >> >>>> implement
> >> >>>> LBFGS using Breeze in a clean manner,
> >> >>>> but I think that this new design from Till can also work for that,
> >> >>>> with
> >> >>>> some tweaks. But we have to really look closely
> >> >>>> and see if it makes things simpler. The SGD implementation looks
> >> >>>> simpler I
> >> >>>> think, but there is still the need for the case
> >> >>>> statements, and I'm not sure yet on how we can get rid of them.
> >> >>>>
> >> >>>> The one thing that I'm not sure we want is the the coupling of the
> >> >>>> update
> >> >>>> step to the loss function (and the regularization by extension).
> >> >>>> It's something that is also in the current implementation of the
> >> >>>> optimization framework, but I was hoping we can get rid of that.
> >> >>>> As Mikio said we want interfaces that are clean,and updating the
> >> >>>> weights
> >> >>>> should not be the "responsibility" of the loss or the
> regularization,
> >> >>>> rather it should be handled by a function that takes the required
> >> >>>> information to perform the update.
> >> >>>>
> >> >>>> I'm a fan of how Breeze does optimization, where they have a
> >> >>>> FirstOrderMinimizer interface that has takeStep and
> >> >>>> chooseDescentDirection
> >> >>>> functions,
> >> >>>> and the actual optimization steps are performed in an
> >> >>>> infiniteIterations
> >> >>>> function implemented in FirstOrderMinimizer that all the first
> order
> >> >>>> minimization
> >> >>>> algorithms use, the thing that changes in the various algorithms is
> >> >>>> how
> >> >>>> the gradients are calculated, weights updated etc.
> >> >>>>
> >> >>>> I would definitely recommend taking a look at their code.
> >> >>>>
> >> >>>>
> >> >>>> On 05/28/2015 10:22 AM, Till Rohrmann wrote:
> >> >>>>
> >> >>>> I opened a PR [1] with the above-mentioned changes. Would be great
> if
> >> >>>> you
> >> >>>> could take a look and give me some feedback whether this is now
> >> >>>> easier to
> >> >>>> understand.
> >> >>>>
> >> >>>> Cheers,
> >> >>>> Till
> >> >>>>
> >> >>>> [1] https://github.com/apache/flink/pull/740
> >> >>>>
> >> >>>> On Thu, May 28, 2015 at 1:16 AM, Till Rohrmann
> >> >>>> <t...@data-artisans.com>
> >> >>>> wrote:
> >> >>>>>
> >> >>>>> Hi Mikio,
> >> >>>>>
> >> >>>>> I agree that we should add some more syntactic sugar to make
> >> >>>>> programming
> >> >>>>> with Scala more succinct and more fun :-) Especially, for the use
> >> >>>>> case where
> >> >>>>> you have a simple map operation with a broadcast variable. I like
> >> >>>>> your
> >> >>>>> approach of wrapping the, admittedly, cumbersome RichMapFunction
> >> >>>>> creation
> >> >>>>> into the helper function. We could even use the pimp my library
> >> >>>>> pattern do
> >> >>>>> make it “part” of the usual DataSet.
> >> >>>>>
> >> >>>>> Something like that works:
> >> >>>>>
> >> >>>>> val x: DataSet[X] = ...
> >> >>>>> val bcVar: DataSet[B] = ...
> >> >>>>>
> >> >>>>> x.mapWithBroadcast(bcVar){
> >> >>>>>     (element: X, var: B) => element - var
> >> >>>>> }
> >> >>>>>
> >> >>>>> Great idea Mikio. I really like that :-) I guess that we all got a
> >> >>>>> little
> >> >>>>> bit “betriebsblind” and stopped questioning ;-)
> >> >>>>>
> >> >>>>> Concerning the strong coupling of SGD with the different parts of
> >> >>>>> the
> >> >>>>> loss function, the idea was to make it easily configurable. I
> agree,
> >> >>>>> though,
> >> >>>>> that one could at least encapsulate the prediction function as
> part
> >> >>>>> of the
> >> >>>>> loss function and also add the regularization function to it. This
> >> >>>>> would
> >> >>>>> simplify the code of SGD. A possible interface for a loss function
> >> >>>>> could
> >> >>>>> look like
> >> >>>>>
> >> >>>>> trait LossFunction {
> >> >>>>>   def loss(labeledVector: LabeledVector, weightVector:
> >> >>>>> WeightVector):
> >> >>>>> Double
> >> >>>>>
> >> >>>>>   def gradient(labeledVector: LabeledVector, weightVector:
> >> >>>>> WeightVector):
> >> >>>>> WeightVector
> >> >>>>>
> >> >>>>>   def updateWeightVector(oldWeightVector: WeightVector,
> >> >>>>> newWeightVector:
> >> >>>>> WeightVector, regularizationValue: Double): WeightVector
> >> >>>>> }
> >> >>>>>
> >> >>>>> I can try to make a prototype of this interface to see how it
> looks
> >> >>>>> and
> >> >>>>> feels like.
> >> >>>>>
> >> >>>>> Cheers,
> >> >>>>>
> >> >>>>> Till
> >> >>>>>
> >> >>>>>
> >> >>>>> On Tue, May 26, 2015 at 3:03 PM, Mikio Braun
> >> >>>>> <mikiobr...@googlemail.com>
> >> >>>>> wrote:
> >> >>>>>>
> >> >>>>>> I've been playing around with a few ideas, but you could have a
> >> >>>>>> helper
> >> >>>>>> RichMapFunction like
> >> >>>>>>
> >> >>>>>> class MapWithBroadcast[B, In, Out](name: String)(fct: (B) => (In)
> >> >>>>>> =>
> >> >>>>>> Out)
> >> >>>>>>     extends RichMapFunction[In, Out]
> >> >>>>>> {
> >> >>>>>>   var f: (In) => Out = _
> >> >>>>>>
> >> >>>>>>   override def open(parameters: Configuration): Unit = {
> >> >>>>>>     f =
> fct(getRuntimeContext.getBroadcastVariable[B](name).get(0))
> >> >>>>>>   }
> >> >>>>>>
> >> >>>>>>   def map(in: In): Out = f(in)
> >> >>>>>> }
> >> >>>>>>
> >> >>>>>> which takes a function which gets the broadcast value and then
> >> >>>>>> returns
> >> >>>>>> a function doing the actual map, and you would then use it like
> >> >>>>>> this:
> >> >>>>>>
> >> >>>>>> def mapWithBroadcast[B, I, O: ClassTag : TypeInformation](
> >> >>>>>>     xs: DataSet[I], bc: DataSet[B])
> >> >>>>>>   (fct: (B) => (I) => O): DataSet[O] = {
> >> >>>>>>   xs.map(new MapWithBroadcast[B, I,
> >> >>>>>> O]("dummy")(fct)).withBroadcastSet(bc, "dummy")
> >> >>>>>> }
> >> >>>>>>
> >> >>>>>> And then...
> >> >>>>>>
> >> >>>>>> val x = env.fromCollection(Seq(1, 2, 3, 4, 5, 6, 7))
> >> >>>>>> val sum = x.reduce(_ + _)
> >> >>>>>>
> >> >>>>>> val foo = mapWithBroadcast(x, sum)(sum => (x: Int) => x - sum)
> >> >>>>>>
> >> >>>>>> Has something like this been considered at any point? Or are you
> >> >>>>>> all
> >> >>>>>> at the point where you can type in the broadcasting stuff while
> you
> >> >>>>>> sleep ;) ?
> >> >>>>>>
> >> >>>>>> -M
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> On Tue, May 26, 2015 at 1:14 PM, Mikio Braun
> >> >>>>>> <mikiobr...@googlemail.com>
> >> >>>>>> wrote:
> >> >>>>>> > Hi Theodore and Till,
> >> >>>>>> >
> >> >>>>>> > last Friday Till suggested I have a look at the gradient code.
> >> >>>>>> > Took me
> >> >>>>>> > a while, but I think I got the main structure.
> >> >>>>>> >
> >> >>>>>> > I think I understand why the code is written how it is, but
> just
> >> >>>>>> > speaking from a purely data science perspective, I find that a
> >> >>>>>> > lot of
> >> >>>>>> > the code is superheavy on boilterplate code. The
> LossCalculation
> >> >>>>>> > and
> >> >>>>>> > so on RichMapFunctions in GradientDescent easily take up 20
> lines
> >> >>>>>> > of
> >> >>>>>> > code just to unpacka few broadcast variable and call a single
> >> >>>>>> > function. I understand why we need to use broadcast variables
> >> >>>>>> > here,
> >> >>>>>> > and that this is a common idiom in Flink, but I wonder whether
> >> >>>>>> > one
> >> >>>>>> > cannot reduce the boilerplate code.
> >> >>>>>> >
> >> >>>>>> > In my experience, it is pretty important when implementing ML
> >> >>>>>> > algorithms that the mapping back from the code to the algorithm
> >> >>>>>> > is
> >> >>>>>> > still somewhat clear, because if you need to do modifications
> or
> >> >>>>>> > debugging, it's important that you still know what corresponds
> to
> >> >>>>>> > what, and with so many lines of code just moving data around,
> >> >>>>>> > that's
> >> >>>>>> > hardly the case.
> >> >>>>>> >
> >> >>>>>> > A general design decision I would like to question is the tight
> >> >>>>>> > integration of the specific loss functions, regularizers, etc.
> >> >>>>>> > into
> >> >>>>>> > the GradientDescent code. Given that it is really a pretty
> >> >>>>>> > general
> >> >>>>>> > approach which is applicable to many many different models, I
> >> >>>>>> > think
> >> >>>>>> > It's better to have the stochastic gradient code being as
> >> >>>>>> > agnostic as
> >> >>>>>> > possible about exactly what it is optimizing. Instead, you
> would
> >> >>>>>> > pass
> >> >>>>>> > it a function which contains a method to compute a function
> >> >>>>>> > value, and
> >> >>>>>> > its derivative, and possibly some action to regularize the
> >> >>>>>> > parameters
> >> >>>>>> > (if this cannot be already done with the function itself).
> >> >>>>>> >
> >> >>>>>> > Then, such a function for learning linear models can be
> >> >>>>>> > configured in
> >> >>>>>> > the same way as it is right now. I personally would opt for
> >> >>>>>> > optional
> >> >>>>>> > parameters instead of the builder-method style, but that is ok
> >> >>>>>> > with
> >> >>>>>> > me.
> >> >>>>>> >
> >> >>>>>> > This would in particular mean that I would try to eliminate the
> >> >>>>>> > case
> >> >>>>>> > statements as much as possible.
> >> >>>>>> >
> >> >>>>>> > The result would then be a bunch of classes around the gradient
> >> >>>>>> > descent optimizer which very clearly just encode that
> algorithm,
> >> >>>>>> > and a
> >> >>>>>> > bunch of classes around cost functions for linear models which
> >> >>>>>> > encode
> >> >>>>>> > that, and both parts talk with one another only through a
> small,
> >> >>>>>> > clean
> >> >>>>>> > interface. Plus, this interface could be reused for other
> >> >>>>>> > optimization
> >> >>>>>> > algorithms like BFGS, or the Trust Region Newton method we
> talked
> >> >>>>>> > about last time.
> >> >>>>>> >
> >> >>>>>> > So my suggestions:
> >> >>>>>> >   - think of some way to reduce boilerplate code for broadcast
> >> >>>>>> > variables.
> >> >>>>>> >   - split gradient and the specific model for linear models to
> >> >>>>>> > make
> >> >>>>>> > the individual parts cleaner and more extensible
> >> >>>>>> >
> >> >>>>>> > Hope this helps,
> >> >>>>>> >
> >> >>>>>> > -M
> >> >>>>>> >
> >> >>>>>> >
> >> >>>>>> > --
> >> >>>>>> > Mikio Braun - http://blog.mikiobraun.de,
> >> >>>>>> > http://twitter.com/mikiobraun
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> --
> >> >>>>>> Mikio Braun - http://blog.mikiobraun.de,
> >> >>>>>> http://twitter.com/mikiobraun
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>> --
> >> >>>>> Data Artisans GmbH | Tempelhofer Ufer 17 | 10963 Berlin
> >> >>>>>
> >> >>>>> i...@data-artisans.com
> >> >>>>> phone +493055599146
> >> >>>>> mobile +4917671721261
> >> >>>>>
> >> >>>>> Registered at Amtsgericht Charlottenburg - HRB 158244 B
> >> >>>>> Managing Directors: Kostas Tzoumas, Stephan Ewen
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> --
> >> >>>> Data Artisans GmbH | Tempelhofer Ufer 17 | 10963 Berlin
> >> >>>>
> >> >>>> i...@data-artisans.com
> >> >>>> phone +493055599146
> >> >>>> mobile +4917671721261
> >> >>>>
> >> >>>> Registered at Amtsgericht Charlottenburg - HRB 158244 B
> >> >>>> Managing Directors: Kostas Tzoumas, Stephan Ewen
> >> >>>>
> >> >>>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> --
> >> >>> Data Artisans GmbH | Tempelhofer Ufer 17 | 10963 Berlin
> >> >>>
> >> >>> i...@data-artisans.com
> >> >>> phone +493055599146
> >> >>> mobile +4917671721261
> >> >>>
> >> >>> Registered at Amtsgericht Charlottenburg - HRB 158244 B
> >> >>> Managing Directors: Kostas Tzoumas, Stephan Ewen
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Mikio Braun - http://blog.mikiobraun.de,
> http://twitter.com/mikiobraun
> >> >
> >> >
> >> >
> >> > --
> >> > Mikio Braun - http://blog.mikiobraun.de,
> http://twitter.com/mikiobraun
> >>
> >>
> >>
> >> --
> >> Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun
> >
> >
>
>
>
> --
> Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun
>

Re: Some feedback on the Gradient Descent Code

Reply via email to