Re: Some feedback on the Gradient Descent Code

Mikio Braun Thu, 28 May 2015 07:16:39 -0700

[Ok, so maybe this is exactly what is implemented, sorry if I'm just
repeating you... ]


So

   C(w, xys) = C regularization(w) + sum over yxs of losses

Gradient is

   C grad reg(w) + sum grad losses(w, xy)

For some regularization functions, regularization is better performed
by some explicit operation on the weights like a subtracting some
value if it's still to large or so (forgot how that exactly) works.

The losses consists of the model (for now, linear I guess), plus the
kind of losses (hinge, squared, and maybe logistic for LR)

So from the viewpoint of SGD, we only need a way to compute the
gradient for a set of examples, and maybe some way to update(e.g.
shrink) weights for regularization.

So SGD takes some

trait CostFunction {
   def gradient(weightVector: Vector, xys: DataSet[LabeledVector]): Vector

   def parameterUpdate(weightVector: Vector): Vector
}

so that SGD can run.

On the other hand, we could have a

class LinearModelCostFunction {
   val lossFunction: LossFunction
   val regularizationFunction: RegularizationFunction

  val regularizationConstant

  def gradient(weightVector: Vector, xys: DataSet[LabeledVector]): Vector = {
     regularizationConstant * regularizationFunciton.gradient(weightVector) +
     xys.map(xy => lossFunction(predict(x, y) * loss) * X).sum()
  }

  def parameterUpdate = regularizationFunction.parameterUpdate
}

That way the optimizer and the cost function would be entirely
uncoupled, and you could use the SGD to train on practically anything
which has a data-dependent gradient.

What do you think?

-M

On Thu, May 28, 2015 at 4:03 PM, Mikio Braun <mikiobr...@googlemail.com> wrote:
> Oh wait.. continue to type. accidentally sent out the message to early.
>
> On Thu, May 28, 2015 at 4:03 PM, Mikio Braun <mikiobr...@googlemail.com> 
> wrote:
>> Hi Till and Theodore,
>>
>> I think the code is cleaned up a lot now, introducing the
>> mapWithBcVariable helped a lot.
>>
>> I also get that the goal was to make a cost function for learning
>> linear model configurable well. My main concern was that the solver
>> itself was already too specifically bound to the kind of cost function
>> we wanted to optimize.
>>
>> Maybe it helps to discuss this in terms of the math of the
>> optimization methods instead of in the code. The mathematics tends to
>> be much more concise and small compared to the completely unrolled
>> code derived from the math.
>>
>> Just very briefly, and correct me if I'm wrong, this is about
>> optimizing some cost function C(w, xys) where w the weight vector and
>> xys is a set of labeled vectors. For the update, we only need to
>> consider the gradient C(w, xys)
>>
>> Typically
>>
>> C(w, xys) = C regularization(w) + sum over yxs of losses
>>
>> so the gradient is
>> C grad reg(w) + sum grad losses(w, xy)
>>
>> On Thu, May 28, 2015 at 3:04 PM, Till Rohrmann <t...@data-artisans.com> 
>> wrote:
>>> What tweaks would that be? I mean what is required to implement L-BFGS?
>>>
>>> I guess that we won’t get rid of the case statements because we have to
>>> decide between two code paths: One with and the other without convergence
>>> criterion. But I think by pulling each branch in its own function, it
>>> becomes clearer now.
>>>
>>> I agree that the update step is not really the responsibility of the
>>> LossFunction. The only reason why the current prototype does it this way is
>>> that the regularization is part of the LossFunction. By moving the
>>> regularization out of the LossFunction and make it part of the Solver we can
>>> avoid to clutter the LossFunction interface. However, then we have to
>>> provide different implementations for the Solver, one for each
>>> regularization type (L1, L2, etc.). Or we provide the regularization as a
>>> parameter which is responsible for the updating of the weight vector.
>>>
>>> What do you prefer?
>>>
>>> Cheer,
>>> Till
>>>
>>>
>>> On Thu, May 28, 2015 at 12:01 PM, Theodore Vasiloudis <t...@kth.se> wrote:
>>>>
>>>> Hello Mikio and Till,
>>>>
>>>> I really like the idea of syntactic sugar to reduce the boilerplate as
>>>> well.
>>>>
>>>> I'm looking at the PR now. For me the goal was to decouple as many things
>>>> as possible, but in the end a lot of the
>>>> logic ended up in SGD itself. Tthe previous design allowed us to implement
>>>> LBFGS using Breeze in a clean manner,
>>>> but I think that this new design from Till can also work for that, with
>>>> some tweaks. But we have to really look closely
>>>> and see if it makes things simpler. The SGD implementation looks simpler I
>>>> think, but there is still the need for the case
>>>> statements, and I'm not sure yet on how we can get rid of them.
>>>>
>>>> The one thing that I'm not sure we want is the the coupling of the update
>>>> step to the loss function (and the regularization by extension).
>>>> It's something that is also in the current implementation of the
>>>> optimization framework, but I was hoping we can get rid of that.
>>>> As Mikio said we want interfaces that are clean,and updating the weights
>>>> should not be the "responsibility" of the loss or the regularization,
>>>> rather it should be handled by a function that takes the required
>>>> information to perform the update.
>>>>
>>>> I'm a fan of how Breeze does optimization, where they have a
>>>> FirstOrderMinimizer interface that has takeStep and chooseDescentDirection
>>>> functions,
>>>> and the actual optimization steps are performed in an infiniteIterations
>>>> function implemented in FirstOrderMinimizer that all the first order
>>>> minimization
>>>> algorithms use, the thing that changes in the various algorithms is how
>>>> the gradients are calculated, weights updated etc.
>>>>
>>>> I would definitely recommend taking a look at their code.
>>>>
>>>>
>>>> On 05/28/2015 10:22 AM, Till Rohrmann wrote:
>>>>
>>>> I opened a PR [1] with the above-mentioned changes. Would be great if you
>>>> could take a look and give me some feedback whether this is now easier to
>>>> understand.
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> [1] https://github.com/apache/flink/pull/740
>>>>
>>>> On Thu, May 28, 2015 at 1:16 AM, Till Rohrmann <t...@data-artisans.com>
>>>> wrote:
>>>>>
>>>>> Hi Mikio,
>>>>>
>>>>> I agree that we should add some more syntactic sugar to make programming
>>>>> with Scala more succinct and more fun :-) Especially, for the use case 
>>>>> where
>>>>> you have a simple map operation with a broadcast variable. I like your
>>>>> approach of wrapping the, admittedly, cumbersome RichMapFunction creation
>>>>> into the helper function. We could even use the pimp my library pattern do
>>>>> make it “part” of the usual DataSet.
>>>>>
>>>>> Something like that works:
>>>>>
>>>>> val x: DataSet[X] = ...
>>>>> val bcVar: DataSet[B] = ...
>>>>>
>>>>> x.mapWithBroadcast(bcVar){
>>>>>     (element: X, var: B) => element - var
>>>>> }
>>>>>
>>>>> Great idea Mikio. I really like that :-) I guess that we all got a little
>>>>> bit “betriebsblind” and stopped questioning ;-)
>>>>>
>>>>> Concerning the strong coupling of SGD with the different parts of the
>>>>> loss function, the idea was to make it easily configurable. I agree, 
>>>>> though,
>>>>> that one could at least encapsulate the prediction function as part of the
>>>>> loss function and also add the regularization function to it. This would
>>>>> simplify the code of SGD. A possible interface for a loss function could
>>>>> look like
>>>>>
>>>>> trait LossFunction {
>>>>>   def loss(labeledVector: LabeledVector, weightVector: WeightVector):
>>>>> Double
>>>>>
>>>>>   def gradient(labeledVector: LabeledVector, weightVector: WeightVector):
>>>>> WeightVector
>>>>>
>>>>>   def updateWeightVector(oldWeightVector: WeightVector, newWeightVector:
>>>>> WeightVector, regularizationValue: Double): WeightVector
>>>>> }
>>>>>
>>>>> I can try to make a prototype of this interface to see how it looks and
>>>>> feels like.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Till
>>>>>
>>>>>
>>>>> On Tue, May 26, 2015 at 3:03 PM, Mikio Braun <mikiobr...@googlemail.com>
>>>>> wrote:
>>>>>>
>>>>>> I've been playing around with a few ideas, but you could have a helper
>>>>>> RichMapFunction like
>>>>>>
>>>>>> class MapWithBroadcast[B, In, Out](name: String)(fct: (B) => (In) =>
>>>>>> Out)
>>>>>>     extends RichMapFunction[In, Out]
>>>>>> {
>>>>>>   var f: (In) => Out = _
>>>>>>
>>>>>>   override def open(parameters: Configuration): Unit = {
>>>>>>     f = fct(getRuntimeContext.getBroadcastVariable[B](name).get(0))
>>>>>>   }
>>>>>>
>>>>>>   def map(in: In): Out = f(in)
>>>>>> }
>>>>>>
>>>>>> which takes a function which gets the broadcast value and then returns
>>>>>> a function doing the actual map, and you would then use it like this:
>>>>>>
>>>>>> def mapWithBroadcast[B, I, O: ClassTag : TypeInformation](
>>>>>>     xs: DataSet[I], bc: DataSet[B])
>>>>>>   (fct: (B) => (I) => O): DataSet[O] = {
>>>>>>   xs.map(new MapWithBroadcast[B, I,
>>>>>> O]("dummy")(fct)).withBroadcastSet(bc, "dummy")
>>>>>> }
>>>>>>
>>>>>> And then...
>>>>>>
>>>>>> val x = env.fromCollection(Seq(1, 2, 3, 4, 5, 6, 7))
>>>>>> val sum = x.reduce(_ + _)
>>>>>>
>>>>>> val foo = mapWithBroadcast(x, sum)(sum => (x: Int) => x - sum)
>>>>>>
>>>>>> Has something like this been considered at any point? Or are you all
>>>>>> at the point where you can type in the broadcasting stuff while you
>>>>>> sleep ;) ?
>>>>>>
>>>>>> -M
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, May 26, 2015 at 1:14 PM, Mikio Braun <mikiobr...@googlemail.com>
>>>>>> wrote:
>>>>>> > Hi Theodore and Till,
>>>>>> >
>>>>>> > last Friday Till suggested I have a look at the gradient code. Took me
>>>>>> > a while, but I think I got the main structure.
>>>>>> >
>>>>>> > I think I understand why the code is written how it is, but just
>>>>>> > speaking from a purely data science perspective, I find that a lot of
>>>>>> > the code is superheavy on boilterplate code. The LossCalculation and
>>>>>> > so on RichMapFunctions in GradientDescent easily take up 20 lines of
>>>>>> > code just to unpacka few broadcast variable and call a single
>>>>>> > function. I understand why we need to use broadcast variables here,
>>>>>> > and that this is a common idiom in Flink, but I wonder whether one
>>>>>> > cannot reduce the boilerplate code.
>>>>>> >
>>>>>> > In my experience, it is pretty important when implementing ML
>>>>>> > algorithms that the mapping back from the code to the algorithm is
>>>>>> > still somewhat clear, because if you need to do modifications or
>>>>>> > debugging, it's important that you still know what corresponds to
>>>>>> > what, and with so many lines of code just moving data around, that's
>>>>>> > hardly the case.
>>>>>> >
>>>>>> > A general design decision I would like to question is the tight
>>>>>> > integration of the specific loss functions, regularizers, etc. into
>>>>>> > the GradientDescent code. Given that it is really a pretty general
>>>>>> > approach which is applicable to many many different models, I think
>>>>>> > It's better to have the stochastic gradient code being as agnostic as
>>>>>> > possible about exactly what it is optimizing. Instead, you would pass
>>>>>> > it a function which contains a method to compute a function value, and
>>>>>> > its derivative, and possibly some action to regularize the parameters
>>>>>> > (if this cannot be already done with the function itself).
>>>>>> >
>>>>>> > Then, such a function for learning linear models can be configured in
>>>>>> > the same way as it is right now. I personally would opt for optional
>>>>>> > parameters instead of the builder-method style, but that is ok with
>>>>>> > me.
>>>>>> >
>>>>>> > This would in particular mean that I would try to eliminate the case
>>>>>> > statements as much as possible.
>>>>>> >
>>>>>> > The result would then be a bunch of classes around the gradient
>>>>>> > descent optimizer which very clearly just encode that algorithm, and a
>>>>>> > bunch of classes around cost functions for linear models which encode
>>>>>> > that, and both parts talk with one another only through a small, clean
>>>>>> > interface. Plus, this interface could be reused for other optimization
>>>>>> > algorithms like BFGS, or the Trust Region Newton method we talked
>>>>>> > about last time.
>>>>>> >
>>>>>> > So my suggestions:
>>>>>> >   - think of some way to reduce boilerplate code for broadcast
>>>>>> > variables.
>>>>>> >   - split gradient and the specific model for linear models to make
>>>>>> > the individual parts cleaner and more extensible
>>>>>> >
>>>>>> > Hope this helps,
>>>>>> >
>>>>>> > -M
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Data Artisans GmbH | Tempelhofer Ufer 17 | 10963 Berlin
>>>>>
>>>>> i...@data-artisans.com
>>>>> phone +493055599146
>>>>> mobile +4917671721261
>>>>>
>>>>> Registered at Amtsgericht Charlottenburg - HRB 158244 B
>>>>> Managing Directors: Kostas Tzoumas, Stephan Ewen
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Data Artisans GmbH | Tempelhofer Ufer 17 | 10963 Berlin
>>>>
>>>> i...@data-artisans.com
>>>> phone +493055599146
>>>> mobile +4917671721261
>>>>
>>>> Registered at Amtsgericht Charlottenburg - HRB 158244 B
>>>> Managing Directors: Kostas Tzoumas, Stephan Ewen
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Data Artisans GmbH | Tempelhofer Ufer 17 | 10963 Berlin
>>>
>>> i...@data-artisans.com
>>> phone +493055599146
>>> mobile +4917671721261
>>>
>>> Registered at Amtsgericht Charlottenburg - HRB 158244 B
>>> Managing Directors: Kostas Tzoumas, Stephan Ewen
>>
>>
>>
>> --
>> Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun
>
>
>
> --
> Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun



-- 
Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun

Re: Some feedback on the Gradient Descent Code

Reply via email to