Re: Some feedback on the Gradient Descent Code

Mikio Braun Thu, 28 May 2015 07:08:43 -0700

Hi Till and Theodore,

I think the code is cleaned up a lot now, introducing the
mapWithBcVariable helped a lot.


I also get that the goal was to make a cost function for learning
linear model configurable well. My main concern was that the solver
itself was already too specifically bound to the kind of cost function
we wanted to optimize.

Maybe it helps to discuss this in terms of the math of the
optimization methods instead of in the code. The mathematics tends to
be much more concise and small compared to the completely unrolled
code derived from the math.

Just very briefly, and correct me if I'm wrong, this is about
optimizing some cost function C(w, xys) where w the weight vector and
xys is a set of labeled vectors. For the update, we only need to
consider the gradient C(w, xys)

Typically

C(w, xys) = C regularization(w) + sum over yxs of losses

so the gradient is
C grad reg(w) + sum grad losses(w, xy)

On Thu, May 28, 2015 at 3:04 PM, Till Rohrmann <t...@data-artisans.com> wrote:
> What tweaks would that be? I mean what is required to implement L-BFGS?
>
> I guess that we won’t get rid of the case statements because we have to
> decide between two code paths: One with and the other without convergence
> criterion. But I think by pulling each branch in its own function, it
> becomes clearer now.
>
> I agree that the update step is not really the responsibility of the
> LossFunction. The only reason why the current prototype does it this way is
> that the regularization is part of the LossFunction. By moving the
> regularization out of the LossFunction and make it part of the Solver we can
> avoid to clutter the LossFunction interface. However, then we have to
> provide different implementations for the Solver, one for each
> regularization type (L1, L2, etc.). Or we provide the regularization as a
> parameter which is responsible for the updating of the weight vector.
>
> What do you prefer?
>
> Cheer,
> Till
>
>
> On Thu, May 28, 2015 at 12:01 PM, Theodore Vasiloudis <t...@kth.se> wrote:
>>
>> Hello Mikio and Till,
>>
>> I really like the idea of syntactic sugar to reduce the boilerplate as
>> well.
>>
>> I'm looking at the PR now. For me the goal was to decouple as many things
>> as possible, but in the end a lot of the
>> logic ended up in SGD itself. Tthe previous design allowed us to implement
>> LBFGS using Breeze in a clean manner,
>> but I think that this new design from Till can also work for that, with
>> some tweaks. But we have to really look closely
>> and see if it makes things simpler. The SGD implementation looks simpler I
>> think, but there is still the need for the case
>> statements, and I'm not sure yet on how we can get rid of them.
>>
>> The one thing that I'm not sure we want is the the coupling of the update
>> step to the loss function (and the regularization by extension).
>> It's something that is also in the current implementation of the
>> optimization framework, but I was hoping we can get rid of that.
>> As Mikio said we want interfaces that are clean,and updating the weights
>> should not be the "responsibility" of the loss or the regularization,
>> rather it should be handled by a function that takes the required
>> information to perform the update.
>>
>> I'm a fan of how Breeze does optimization, where they have a
>> FirstOrderMinimizer interface that has takeStep and chooseDescentDirection
>> functions,
>> and the actual optimization steps are performed in an infiniteIterations
>> function implemented in FirstOrderMinimizer that all the first order
>> minimization
>> algorithms use, the thing that changes in the various algorithms is how
>> the gradients are calculated, weights updated etc.
>>
>> I would definitely recommend taking a look at their code.
>>
>>
>> On 05/28/2015 10:22 AM, Till Rohrmann wrote:
>>
>> I opened a PR [1] with the above-mentioned changes. Would be great if you
>> could take a look and give me some feedback whether this is now easier to
>> understand.
>>
>> Cheers,
>> Till
>>
>> [1] https://github.com/apache/flink/pull/740
>>
>> On Thu, May 28, 2015 at 1:16 AM, Till Rohrmann <t...@data-artisans.com>
>> wrote:
>>>
>>> Hi Mikio,
>>>
>>> I agree that we should add some more syntactic sugar to make programming
>>> with Scala more succinct and more fun :-) Especially, for the use case where
>>> you have a simple map operation with a broadcast variable. I like your
>>> approach of wrapping the, admittedly, cumbersome RichMapFunction creation
>>> into the helper function. We could even use the pimp my library pattern do
>>> make it “part” of the usual DataSet.
>>>
>>> Something like that works:
>>>
>>> val x: DataSet[X] = ...
>>> val bcVar: DataSet[B] = ...
>>>
>>> x.mapWithBroadcast(bcVar){
>>>     (element: X, var: B) => element - var
>>> }
>>>
>>> Great idea Mikio. I really like that :-) I guess that we all got a little
>>> bit “betriebsblind” and stopped questioning ;-)
>>>
>>> Concerning the strong coupling of SGD with the different parts of the
>>> loss function, the idea was to make it easily configurable. I agree, though,
>>> that one could at least encapsulate the prediction function as part of the
>>> loss function and also add the regularization function to it. This would
>>> simplify the code of SGD. A possible interface for a loss function could
>>> look like
>>>
>>> trait LossFunction {
>>>   def loss(labeledVector: LabeledVector, weightVector: WeightVector):
>>> Double
>>>
>>>   def gradient(labeledVector: LabeledVector, weightVector: WeightVector):
>>> WeightVector
>>>
>>>   def updateWeightVector(oldWeightVector: WeightVector, newWeightVector:
>>> WeightVector, regularizationValue: Double): WeightVector
>>> }
>>>
>>> I can try to make a prototype of this interface to see how it looks and
>>> feels like.
>>>
>>> Cheers,
>>>
>>> Till
>>>
>>>
>>> On Tue, May 26, 2015 at 3:03 PM, Mikio Braun <mikiobr...@googlemail.com>
>>> wrote:
>>>>
>>>> I've been playing around with a few ideas, but you could have a helper
>>>> RichMapFunction like
>>>>
>>>> class MapWithBroadcast[B, In, Out](name: String)(fct: (B) => (In) =>
>>>> Out)
>>>>     extends RichMapFunction[In, Out]
>>>> {
>>>>   var f: (In) => Out = _
>>>>
>>>>   override def open(parameters: Configuration): Unit = {
>>>>     f = fct(getRuntimeContext.getBroadcastVariable[B](name).get(0))
>>>>   }
>>>>
>>>>   def map(in: In): Out = f(in)
>>>> }
>>>>
>>>> which takes a function which gets the broadcast value and then returns
>>>> a function doing the actual map, and you would then use it like this:
>>>>
>>>> def mapWithBroadcast[B, I, O: ClassTag : TypeInformation](
>>>>     xs: DataSet[I], bc: DataSet[B])
>>>>   (fct: (B) => (I) => O): DataSet[O] = {
>>>>   xs.map(new MapWithBroadcast[B, I,
>>>> O]("dummy")(fct)).withBroadcastSet(bc, "dummy")
>>>> }
>>>>
>>>> And then...
>>>>
>>>> val x = env.fromCollection(Seq(1, 2, 3, 4, 5, 6, 7))
>>>> val sum = x.reduce(_ + _)
>>>>
>>>> val foo = mapWithBroadcast(x, sum)(sum => (x: Int) => x - sum)
>>>>
>>>> Has something like this been considered at any point? Or are you all
>>>> at the point where you can type in the broadcasting stuff while you
>>>> sleep ;) ?
>>>>
>>>> -M
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, May 26, 2015 at 1:14 PM, Mikio Braun <mikiobr...@googlemail.com>
>>>> wrote:
>>>> > Hi Theodore and Till,
>>>> >
>>>> > last Friday Till suggested I have a look at the gradient code. Took me
>>>> > a while, but I think I got the main structure.
>>>> >
>>>> > I think I understand why the code is written how it is, but just
>>>> > speaking from a purely data science perspective, I find that a lot of
>>>> > the code is superheavy on boilterplate code. The LossCalculation and
>>>> > so on RichMapFunctions in GradientDescent easily take up 20 lines of
>>>> > code just to unpacka few broadcast variable and call a single
>>>> > function. I understand why we need to use broadcast variables here,
>>>> > and that this is a common idiom in Flink, but I wonder whether one
>>>> > cannot reduce the boilerplate code.
>>>> >
>>>> > In my experience, it is pretty important when implementing ML
>>>> > algorithms that the mapping back from the code to the algorithm is
>>>> > still somewhat clear, because if you need to do modifications or
>>>> > debugging, it's important that you still know what corresponds to
>>>> > what, and with so many lines of code just moving data around, that's
>>>> > hardly the case.
>>>> >
>>>> > A general design decision I would like to question is the tight
>>>> > integration of the specific loss functions, regularizers, etc. into
>>>> > the GradientDescent code. Given that it is really a pretty general
>>>> > approach which is applicable to many many different models, I think
>>>> > It's better to have the stochastic gradient code being as agnostic as
>>>> > possible about exactly what it is optimizing. Instead, you would pass
>>>> > it a function which contains a method to compute a function value, and
>>>> > its derivative, and possibly some action to regularize the parameters
>>>> > (if this cannot be already done with the function itself).
>>>> >
>>>> > Then, such a function for learning linear models can be configured in
>>>> > the same way as it is right now. I personally would opt for optional
>>>> > parameters instead of the builder-method style, but that is ok with
>>>> > me.
>>>> >
>>>> > This would in particular mean that I would try to eliminate the case
>>>> > statements as much as possible.
>>>> >
>>>> > The result would then be a bunch of classes around the gradient
>>>> > descent optimizer which very clearly just encode that algorithm, and a
>>>> > bunch of classes around cost functions for linear models which encode
>>>> > that, and both parts talk with one another only through a small, clean
>>>> > interface. Plus, this interface could be reused for other optimization
>>>> > algorithms like BFGS, or the Trust Region Newton method we talked
>>>> > about last time.
>>>> >
>>>> > So my suggestions:
>>>> >   - think of some way to reduce boilerplate code for broadcast
>>>> > variables.
>>>> >   - split gradient and the specific model for linear models to make
>>>> > the individual parts cleaner and more extensible
>>>> >
>>>> > Hope this helps,
>>>> >
>>>> > -M
>>>> >
>>>> >
>>>> > --
>>>> > Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun
>>>>
>>>>
>>>>
>>>> --
>>>> Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun
>>>
>>>
>>>
>>>
>>> --
>>> Data Artisans GmbH | Tempelhofer Ufer 17 | 10963 Berlin
>>>
>>> i...@data-artisans.com
>>> phone +493055599146
>>> mobile +4917671721261
>>>
>>> Registered at Amtsgericht Charlottenburg - HRB 158244 B
>>> Managing Directors: Kostas Tzoumas, Stephan Ewen
>>
>>
>>
>>
>> --
>> Data Artisans GmbH | Tempelhofer Ufer 17 | 10963 Berlin
>>
>> i...@data-artisans.com
>> phone +493055599146
>> mobile +4917671721261
>>
>> Registered at Amtsgericht Charlottenburg - HRB 158244 B
>> Managing Directors: Kostas Tzoumas, Stephan Ewen
>>
>>
>
>
>
> --
> Data Artisans GmbH | Tempelhofer Ufer 17 | 10963 Berlin
>
> i...@data-artisans.com
> phone +493055599146
> mobile +4917671721261
>
> Registered at Amtsgericht Charlottenburg - HRB 158244 B
> Managing Directors: Kostas Tzoumas, Stephan Ewen



-- 
Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun

Re: Some feedback on the Gradient Descent Code

Reply via email to