Re: Some feedback on the Gradient Descent Code

Mikio Braun Thu, 28 May 2015 08:16:16 -0700

GradientDescent is the just the (batch-)SGD optimizer right? Actually
I think the parameter update should be done by a
RegularizationFunction.


IMHO the structure should be like this:

GradientDescent
  - collects gradient and regularization updates from - CostFunction
LinearModelCostFunction
  - inherits from Cost Function
  - parameterized by LossFunction and RegularizationFunction

Does that make sense?



On Thu, May 28, 2015 at 5:00 PM, Till Rohrmann <[email protected]> wrote:
> Hey Mikio,
>
> yes you’re right. The SGD only needs to know the gradient of the loss
> function and some mean to update the weights in accordance with the
> regularization scheme. Additionally, we also need to be able to compute the
> loss for the convergence criterion.
>
> That’s also how it is implemented in the before-mentioned PR right now. The
> question which arises now is whether the LossFunction should be responsible
> for the updating of the weight vector or not. So what we could do is the
> following: Remove the regularization from the LossFunction and let the
> GradientDescent solver do the parameterUpdate. This either requires
> specialized implementations such as GradientDescentL1 and GradientDescentL2
> or a configuration parameter which defines a regularization function which
> is then called for the parameterUpdate.
>
> trait RegularizationFunction{
>   def parameterUpdate(weightVector: WeightVector, gradient: WeightVector,
> learningRate: Double, regularizationConstant: Double): WeightVector
>
>   def regularizationValue(weigthVector: WeightVector): Double
> }
>
> Both ansätze are semantically equivalent. I have no strong preference for
> either of them. What do you think is the better approach?
>
>
> On Thu, May 28, 2015 at 4:13 PM, Mikio Braun <[email protected]>
> wrote:
>>
>> [Ok, so maybe this is exactly what is implemented, sorry if I'm just
>> repeating you... ]
>>
>> So
>>
>>    C(w, xys) = C regularization(w) + sum over yxs of losses
>>
>> Gradient is
>>
>>    C grad reg(w) + sum grad losses(w, xy)
>>
>> For some regularization functions, regularization is better performed
>> by some explicit operation on the weights like a subtracting some
>> value if it's still to large or so (forgot how that exactly) works.
>>
>> The losses consists of the model (for now, linear I guess), plus the
>> kind of losses (hinge, squared, and maybe logistic for LR)
>>
>> So from the viewpoint of SGD, we only need a way to compute the
>> gradient for a set of examples, and maybe some way to update(e.g.
>> shrink) weights for regularization.
>>
>> So SGD takes some
>>
>> trait CostFunction {
>>    def gradient(weightVector: Vector, xys: DataSet[LabeledVector]): Vector
>>
>>    def parameterUpdate(weightVector: Vector): Vector
>> }
>>
>> so that SGD can run.
>>
>> On the other hand, we could have a
>>
>> class LinearModelCostFunction {
>>    val lossFunction: LossFunction
>>    val regularizationFunction: RegularizationFunction
>>
>>   val regularizationConstant
>>
>>   def gradient(weightVector: Vector, xys: DataSet[LabeledVector]): Vector
>> = {
>>      regularizationConstant *
>> regularizationFunciton.gradient(weightVector) +
>>      xys.map(xy => lossFunction(predict(x, y) * loss) * X).sum()
>>   }
>>
>>   def parameterUpdate = regularizationFunction.parameterUpdate
>> }
>>
>> That way the optimizer and the cost function would be entirely
>> uncoupled, and you could use the SGD to train on practically anything
>> which has a data-dependent gradient.
>>
>> What do you think?
>>
>> -M
>>
>> On Thu, May 28, 2015 at 4:03 PM, Mikio Braun <[email protected]>
>> wrote:
>> > Oh wait.. continue to type. accidentally sent out the message to early.
>> >
>> > On Thu, May 28, 2015 at 4:03 PM, Mikio Braun <[email protected]>
>> > wrote:
>> >> Hi Till and Theodore,
>> >>
>> >> I think the code is cleaned up a lot now, introducing the
>> >> mapWithBcVariable helped a lot.
>> >>
>> >> I also get that the goal was to make a cost function for learning
>> >> linear model configurable well. My main concern was that the solver
>> >> itself was already too specifically bound to the kind of cost function
>> >> we wanted to optimize.
>> >>
>> >> Maybe it helps to discuss this in terms of the math of the
>> >> optimization methods instead of in the code. The mathematics tends to
>> >> be much more concise and small compared to the completely unrolled
>> >> code derived from the math.
>> >>
>> >> Just very briefly, and correct me if I'm wrong, this is about
>> >> optimizing some cost function C(w, xys) where w the weight vector and
>> >> xys is a set of labeled vectors. For the update, we only need to
>> >> consider the gradient C(w, xys)
>> >>
>> >> Typically
>> >>
>> >> C(w, xys) = C regularization(w) + sum over yxs of losses
>> >>
>> >> so the gradient is
>> >> C grad reg(w) + sum grad losses(w, xy)
>> >>
>> >> On Thu, May 28, 2015 at 3:04 PM, Till Rohrmann <[email protected]>
>> >> wrote:
>> >>> What tweaks would that be? I mean what is required to implement
>> >>> L-BFGS?
>> >>>
>> >>> I guess that we won’t get rid of the case statements because we have
>> >>> to
>> >>> decide between two code paths: One with and the other without
>> >>> convergence
>> >>> criterion. But I think by pulling each branch in its own function, it
>> >>> becomes clearer now.
>> >>>
>> >>> I agree that the update step is not really the responsibility of the
>> >>> LossFunction. The only reason why the current prototype does it this
>> >>> way is
>> >>> that the regularization is part of the LossFunction. By moving the
>> >>> regularization out of the LossFunction and make it part of the Solver
>> >>> we can
>> >>> avoid to clutter the LossFunction interface. However, then we have to
>> >>> provide different implementations for the Solver, one for each
>> >>> regularization type (L1, L2, etc.). Or we provide the regularization
>> >>> as a
>> >>> parameter which is responsible for the updating of the weight vector.
>> >>>
>> >>> What do you prefer?
>> >>>
>> >>> Cheer,
>> >>> Till
>> >>>
>> >>>
>> >>> On Thu, May 28, 2015 at 12:01 PM, Theodore Vasiloudis <[email protected]>
>> >>> wrote:
>> >>>>
>> >>>> Hello Mikio and Till,
>> >>>>
>> >>>> I really like the idea of syntactic sugar to reduce the boilerplate
>> >>>> as
>> >>>> well.
>> >>>>
>> >>>> I'm looking at the PR now. For me the goal was to decouple as many
>> >>>> things
>> >>>> as possible, but in the end a lot of the
>> >>>> logic ended up in SGD itself. Tthe previous design allowed us to
>> >>>> implement
>> >>>> LBFGS using Breeze in a clean manner,
>> >>>> but I think that this new design from Till can also work for that,
>> >>>> with
>> >>>> some tweaks. But we have to really look closely
>> >>>> and see if it makes things simpler. The SGD implementation looks
>> >>>> simpler I
>> >>>> think, but there is still the need for the case
>> >>>> statements, and I'm not sure yet on how we can get rid of them.
>> >>>>
>> >>>> The one thing that I'm not sure we want is the the coupling of the
>> >>>> update
>> >>>> step to the loss function (and the regularization by extension).
>> >>>> It's something that is also in the current implementation of the
>> >>>> optimization framework, but I was hoping we can get rid of that.
>> >>>> As Mikio said we want interfaces that are clean,and updating the
>> >>>> weights
>> >>>> should not be the "responsibility" of the loss or the regularization,
>> >>>> rather it should be handled by a function that takes the required
>> >>>> information to perform the update.
>> >>>>
>> >>>> I'm a fan of how Breeze does optimization, where they have a
>> >>>> FirstOrderMinimizer interface that has takeStep and
>> >>>> chooseDescentDirection
>> >>>> functions,
>> >>>> and the actual optimization steps are performed in an
>> >>>> infiniteIterations
>> >>>> function implemented in FirstOrderMinimizer that all the first order
>> >>>> minimization
>> >>>> algorithms use, the thing that changes in the various algorithms is
>> >>>> how
>> >>>> the gradients are calculated, weights updated etc.
>> >>>>
>> >>>> I would definitely recommend taking a look at their code.
>> >>>>
>> >>>>
>> >>>> On 05/28/2015 10:22 AM, Till Rohrmann wrote:
>> >>>>
>> >>>> I opened a PR [1] with the above-mentioned changes. Would be great if
>> >>>> you
>> >>>> could take a look and give me some feedback whether this is now
>> >>>> easier to
>> >>>> understand.
>> >>>>
>> >>>> Cheers,
>> >>>> Till
>> >>>>
>> >>>> [1] https://github.com/apache/flink/pull/740
>> >>>>
>> >>>> On Thu, May 28, 2015 at 1:16 AM, Till Rohrmann
>> >>>> <[email protected]>
>> >>>> wrote:
>> >>>>>
>> >>>>> Hi Mikio,
>> >>>>>
>> >>>>> I agree that we should add some more syntactic sugar to make
>> >>>>> programming
>> >>>>> with Scala more succinct and more fun :-) Especially, for the use
>> >>>>> case where
>> >>>>> you have a simple map operation with a broadcast variable. I like
>> >>>>> your
>> >>>>> approach of wrapping the, admittedly, cumbersome RichMapFunction
>> >>>>> creation
>> >>>>> into the helper function. We could even use the pimp my library
>> >>>>> pattern do
>> >>>>> make it “part” of the usual DataSet.
>> >>>>>
>> >>>>> Something like that works:
>> >>>>>
>> >>>>> val x: DataSet[X] = ...
>> >>>>> val bcVar: DataSet[B] = ...
>> >>>>>
>> >>>>> x.mapWithBroadcast(bcVar){
>> >>>>>     (element: X, var: B) => element - var
>> >>>>> }
>> >>>>>
>> >>>>> Great idea Mikio. I really like that :-) I guess that we all got a
>> >>>>> little
>> >>>>> bit “betriebsblind” and stopped questioning ;-)
>> >>>>>
>> >>>>> Concerning the strong coupling of SGD with the different parts of
>> >>>>> the
>> >>>>> loss function, the idea was to make it easily configurable. I agree,
>> >>>>> though,
>> >>>>> that one could at least encapsulate the prediction function as part
>> >>>>> of the
>> >>>>> loss function and also add the regularization function to it. This
>> >>>>> would
>> >>>>> simplify the code of SGD. A possible interface for a loss function
>> >>>>> could
>> >>>>> look like
>> >>>>>
>> >>>>> trait LossFunction {
>> >>>>>   def loss(labeledVector: LabeledVector, weightVector:
>> >>>>> WeightVector):
>> >>>>> Double
>> >>>>>
>> >>>>>   def gradient(labeledVector: LabeledVector, weightVector:
>> >>>>> WeightVector):
>> >>>>> WeightVector
>> >>>>>
>> >>>>>   def updateWeightVector(oldWeightVector: WeightVector,
>> >>>>> newWeightVector:
>> >>>>> WeightVector, regularizationValue: Double): WeightVector
>> >>>>> }
>> >>>>>
>> >>>>> I can try to make a prototype of this interface to see how it looks
>> >>>>> and
>> >>>>> feels like.
>> >>>>>
>> >>>>> Cheers,
>> >>>>>
>> >>>>> Till
>> >>>>>
>> >>>>>
>> >>>>> On Tue, May 26, 2015 at 3:03 PM, Mikio Braun
>> >>>>> <[email protected]>
>> >>>>> wrote:
>> >>>>>>
>> >>>>>> I've been playing around with a few ideas, but you could have a
>> >>>>>> helper
>> >>>>>> RichMapFunction like
>> >>>>>>
>> >>>>>> class MapWithBroadcast[B, In, Out](name: String)(fct: (B) => (In)
>> >>>>>> =>
>> >>>>>> Out)
>> >>>>>>     extends RichMapFunction[In, Out]
>> >>>>>> {
>> >>>>>>   var f: (In) => Out = _
>> >>>>>>
>> >>>>>>   override def open(parameters: Configuration): Unit = {
>> >>>>>>     f = fct(getRuntimeContext.getBroadcastVariable[B](name).get(0))
>> >>>>>>   }
>> >>>>>>
>> >>>>>>   def map(in: In): Out = f(in)
>> >>>>>> }
>> >>>>>>
>> >>>>>> which takes a function which gets the broadcast value and then
>> >>>>>> returns
>> >>>>>> a function doing the actual map, and you would then use it like
>> >>>>>> this:
>> >>>>>>
>> >>>>>> def mapWithBroadcast[B, I, O: ClassTag : TypeInformation](
>> >>>>>>     xs: DataSet[I], bc: DataSet[B])
>> >>>>>>   (fct: (B) => (I) => O): DataSet[O] = {
>> >>>>>>   xs.map(new MapWithBroadcast[B, I,
>> >>>>>> O]("dummy")(fct)).withBroadcastSet(bc, "dummy")
>> >>>>>> }
>> >>>>>>
>> >>>>>> And then...
>> >>>>>>
>> >>>>>> val x = env.fromCollection(Seq(1, 2, 3, 4, 5, 6, 7))
>> >>>>>> val sum = x.reduce(_ + _)
>> >>>>>>
>> >>>>>> val foo = mapWithBroadcast(x, sum)(sum => (x: Int) => x - sum)
>> >>>>>>
>> >>>>>> Has something like this been considered at any point? Or are you
>> >>>>>> all
>> >>>>>> at the point where you can type in the broadcasting stuff while you
>> >>>>>> sleep ;) ?
>> >>>>>>
>> >>>>>> -M
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> On Tue, May 26, 2015 at 1:14 PM, Mikio Braun
>> >>>>>> <[email protected]>
>> >>>>>> wrote:
>> >>>>>> > Hi Theodore and Till,
>> >>>>>> >
>> >>>>>> > last Friday Till suggested I have a look at the gradient code.
>> >>>>>> > Took me
>> >>>>>> > a while, but I think I got the main structure.
>> >>>>>> >
>> >>>>>> > I think I understand why the code is written how it is, but just
>> >>>>>> > speaking from a purely data science perspective, I find that a
>> >>>>>> > lot of
>> >>>>>> > the code is superheavy on boilterplate code. The LossCalculation
>> >>>>>> > and
>> >>>>>> > so on RichMapFunctions in GradientDescent easily take up 20 lines
>> >>>>>> > of
>> >>>>>> > code just to unpacka few broadcast variable and call a single
>> >>>>>> > function. I understand why we need to use broadcast variables
>> >>>>>> > here,
>> >>>>>> > and that this is a common idiom in Flink, but I wonder whether
>> >>>>>> > one
>> >>>>>> > cannot reduce the boilerplate code.
>> >>>>>> >
>> >>>>>> > In my experience, it is pretty important when implementing ML
>> >>>>>> > algorithms that the mapping back from the code to the algorithm
>> >>>>>> > is
>> >>>>>> > still somewhat clear, because if you need to do modifications or
>> >>>>>> > debugging, it's important that you still know what corresponds to
>> >>>>>> > what, and with so many lines of code just moving data around,
>> >>>>>> > that's
>> >>>>>> > hardly the case.
>> >>>>>> >
>> >>>>>> > A general design decision I would like to question is the tight
>> >>>>>> > integration of the specific loss functions, regularizers, etc.
>> >>>>>> > into
>> >>>>>> > the GradientDescent code. Given that it is really a pretty
>> >>>>>> > general
>> >>>>>> > approach which is applicable to many many different models, I
>> >>>>>> > think
>> >>>>>> > It's better to have the stochastic gradient code being as
>> >>>>>> > agnostic as
>> >>>>>> > possible about exactly what it is optimizing. Instead, you would
>> >>>>>> > pass
>> >>>>>> > it a function which contains a method to compute a function
>> >>>>>> > value, and
>> >>>>>> > its derivative, and possibly some action to regularize the
>> >>>>>> > parameters
>> >>>>>> > (if this cannot be already done with the function itself).
>> >>>>>> >
>> >>>>>> > Then, such a function for learning linear models can be
>> >>>>>> > configured in
>> >>>>>> > the same way as it is right now. I personally would opt for
>> >>>>>> > optional
>> >>>>>> > parameters instead of the builder-method style, but that is ok
>> >>>>>> > with
>> >>>>>> > me.
>> >>>>>> >
>> >>>>>> > This would in particular mean that I would try to eliminate the
>> >>>>>> > case
>> >>>>>> > statements as much as possible.
>> >>>>>> >
>> >>>>>> > The result would then be a bunch of classes around the gradient
>> >>>>>> > descent optimizer which very clearly just encode that algorithm,
>> >>>>>> > and a
>> >>>>>> > bunch of classes around cost functions for linear models which
>> >>>>>> > encode
>> >>>>>> > that, and both parts talk with one another only through a small,
>> >>>>>> > clean
>> >>>>>> > interface. Plus, this interface could be reused for other
>> >>>>>> > optimization
>> >>>>>> > algorithms like BFGS, or the Trust Region Newton method we talked
>> >>>>>> > about last time.
>> >>>>>> >
>> >>>>>> > So my suggestions:
>> >>>>>> >   - think of some way to reduce boilerplate code for broadcast
>> >>>>>> > variables.
>> >>>>>> >   - split gradient and the specific model for linear models to
>> >>>>>> > make
>> >>>>>> > the individual parts cleaner and more extensible
>> >>>>>> >
>> >>>>>> > Hope this helps,
>> >>>>>> >
>> >>>>>> > -M
>> >>>>>> >
>> >>>>>> >
>> >>>>>> > --
>> >>>>>> > Mikio Braun - http://blog.mikiobraun.de,
>> >>>>>> > http://twitter.com/mikiobraun
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> --
>> >>>>>> Mikio Braun - http://blog.mikiobraun.de,
>> >>>>>> http://twitter.com/mikiobraun
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>> Data Artisans GmbH | Tempelhofer Ufer 17 | 10963 Berlin
>> >>>>>
>> >>>>> [email protected]
>> >>>>> phone +493055599146
>> >>>>> mobile +4917671721261
>> >>>>>
>> >>>>> Registered at Amtsgericht Charlottenburg - HRB 158244 B
>> >>>>> Managing Directors: Kostas Tzoumas, Stephan Ewen
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Data Artisans GmbH | Tempelhofer Ufer 17 | 10963 Berlin
>> >>>>
>> >>>> [email protected]
>> >>>> phone +493055599146
>> >>>> mobile +4917671721261
>> >>>>
>> >>>> Registered at Amtsgericht Charlottenburg - HRB 158244 B
>> >>>> Managing Directors: Kostas Tzoumas, Stephan Ewen
>> >>>>
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Data Artisans GmbH | Tempelhofer Ufer 17 | 10963 Berlin
>> >>>
>> >>> [email protected]
>> >>> phone +493055599146
>> >>> mobile +4917671721261
>> >>>
>> >>> Registered at Amtsgericht Charlottenburg - HRB 158244 B
>> >>> Managing Directors: Kostas Tzoumas, Stephan Ewen
>> >>
>> >>
>> >>
>> >> --
>> >> Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun
>> >
>> >
>> >
>> > --
>> > Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun
>>
>>
>>
>> --
>> Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun
>
>



-- 
Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun

Re: Some feedback on the Gradient Descent Code

Reply via email to