Re: Some feedback on the Gradient Descent Code

Till Rohrmann Thu, 28 May 2015 07:08:20 -0700

What tweaks would that be? I mean what is required to implement L-BFGS?

I guess that we won’t get rid of the case statements because we have to
decide between two code paths: One with and the other without convergence
criterion. But I think by pulling each branch in its own function, it
becomes clearer now.


I agree that the update step is not really the responsibility of the
LossFunction. The only reason why the current prototype does it this way is
that the regularization is part of the LossFunction. By moving the
regularization out of the LossFunction and make it part of the Solver we
can avoid to clutter the LossFunction interface. However, then we have to
provide different implementations for the Solver, one for each
regularization type (L1, L2, etc.). Or we provide the regularization as a
parameter which is responsible for the updating of the weight vector.

What do you prefer?

Cheer,
Till


On Thu, May 28, 2015 at 12:01 PM, Theodore Vasiloudis <t...@kth.se> wrote:

>  Hello Mikio and Till,
>
> I really like the idea of syntactic sugar to reduce the boilerplate as
> well.
>
> I'm looking at the PR now. For me the goal was to decouple as many things
> as possible, but in the end a lot of the
> logic ended up in SGD itself. Tthe previous design allowed us to implement
> LBFGS using Breeze in a clean manner,
> but I think that this new design from Till can also work for that, with
> some tweaks. But we have to really look closely
> and see if it makes things simpler. The SGD implementation looks simpler I
> think, but there is still the need for the case
> statements, and I'm not sure yet on how we can get rid of them.
>
> The one thing that I'm not sure we want is the the coupling of the update
> step to the loss function (and the regularization by extension).
> It's something that is also in the current implementation of the
> optimization framework, but I was hoping we can get rid of that.
> As Mikio said we want interfaces that are clean,and updating the weights
> should not be the "responsibility" of the loss or the regularization,
> rather it should be handled by a function that takes the required
> information to perform the update.
>
> I'm a fan of how Breeze does optimization, where they have a
> FirstOrderMinimizer interface that has takeStep and chooseDescentDirection
> functions,
> and the actual optimization steps are performed in an infiniteIterations
> function implemented in FirstOrderMinimizer that all the first order
> minimization
> algorithms use, the thing that changes in the various algorithms is how
> the gradients are calculated, weights updated etc.
>
> I would definitely recommend taking a look
> <https://github.com/scalanlp/breeze/tree/master/math/src/main/scala/breeze/optimize>
> at their code.
>
>
> On 05/28/2015 10:22 AM, Till Rohrmann wrote:
>
> I opened a PR [1] with the above-mentioned changes. Would be great if you
> could take a look and give me some feedback whether this is now easier to
> understand.
>
>  Cheers,
> Till
>
>  [1] https://github.com/apache/flink/pull/740
>
> On Thu, May 28, 2015 at 1:16 AM, Till Rohrmann <t...@data-artisans.com>
> wrote:
>
>>  Hi Mikio,
>>
>> I agree that we should add some more syntactic sugar to make programming
>> with Scala more succinct and more fun :-) Especially, for the use case
>> where you have a simple map operation with a broadcast variable. I like
>> your approach of wrapping the, admittedly, cumbersome RichMapFunction
>> creation into the helper function. We could even use the pimp my library
>> pattern do make it “part” of the usual DataSet.
>>
>> Something like that works:
>>
>> val x: DataSet[X] = ...val bcVar: DataSet[B] = ...
>>
>> x.mapWithBroadcast(bcVar){
>>     (element: X, var: B) => element - var
>> }
>>
>> Great idea Mikio. I really like that :-) I guess that we all got a little
>> bit “betriebsblind” and stopped questioning ;-)
>>
>> Concerning the strong coupling of SGD with the different parts of the
>> loss function, the idea was to make it easily configurable. I agree,
>> though, that one could at least encapsulate the prediction function as part
>> of the loss function and also add the regularization function to it. This
>> would simplify the code of SGD. A possible interface for a loss function
>> could look like
>>
>> trait LossFunction {
>>   def loss(labeledVector: LabeledVector, weightVector: WeightVector): Double
>>
>>   def gradient(labeledVector: LabeledVector, weightVector: WeightVector): 
>> WeightVector
>>
>>   def updateWeightVector(oldWeightVector: WeightVector, newWeightVector: 
>> WeightVector, regularizationValue: Double): WeightVector
>> }
>>
>> I can try to make a prototype of this interface to see how it looks and
>> feels like.
>>
>> Cheers,
>>
>> Till
>> 
>>
>> On Tue, May 26, 2015 at 3:03 PM, Mikio Braun <mikiobr...@googlemail.com>
>> wrote:
>>
>>> I've been playing around with a few ideas, but you could have a helper
>>> RichMapFunction like
>>>
>>> class MapWithBroadcast[B, In, Out](name: String)(fct: (B) => (In) => Out)
>>>     extends RichMapFunction[In, Out]
>>> {
>>>   var f: (In) => Out = _
>>>
>>>   override def open(parameters: Configuration): Unit = {
>>>     f = fct(getRuntimeContext.getBroadcastVariable[B](name).get(0))
>>>   }
>>>
>>>   def map(in: In): Out = f(in)
>>> }
>>>
>>> which takes a function which gets the broadcast value and then returns
>>> a function doing the actual map, and you would then use it like this:
>>>
>>> def mapWithBroadcast[B, I, O: ClassTag : TypeInformation](
>>>     xs: DataSet[I], bc: DataSet[B])
>>>   (fct: (B) => (I) => O): DataSet[O] = {
>>>   xs.map(new MapWithBroadcast[B, I,
>>> O]("dummy")(fct)).withBroadcastSet(bc, "dummy")
>>> }
>>>
>>> And then...
>>>
>>> val x = env.fromCollection(Seq(1, 2, 3, 4, 5, 6, 7))
>>> val sum = x.reduce(_ + _)
>>>
>>> val foo = mapWithBroadcast(x, sum)(sum => (x: Int) => x - sum)
>>>
>>> Has something like this been considered at any point? Or are you all
>>> at the point where you can type in the broadcasting stuff while you
>>> sleep ;) ?
>>>
>>> -M
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, May 26, 2015 at 1:14 PM, Mikio Braun <mikiobr...@googlemail.com>
>>> wrote:
>>> > Hi Theodore and Till,
>>> >
>>> > last Friday Till suggested I have a look at the gradient code. Took me
>>> > a while, but I think I got the main structure.
>>> >
>>> > I think I understand why the code is written how it is, but just
>>> > speaking from a purely data science perspective, I find that a lot of
>>> > the code is superheavy on boilterplate code. The LossCalculation and
>>> > so on RichMapFunctions in GradientDescent easily take up 20 lines of
>>> > code just to unpacka few broadcast variable and call a single
>>> > function. I understand why we need to use broadcast variables here,
>>> > and that this is a common idiom in Flink, but I wonder whether one
>>> > cannot reduce the boilerplate code.
>>> >
>>> > In my experience, it is pretty important when implementing ML
>>> > algorithms that the mapping back from the code to the algorithm is
>>> > still somewhat clear, because if you need to do modifications or
>>> > debugging, it's important that you still know what corresponds to
>>> > what, and with so many lines of code just moving data around, that's
>>> > hardly the case.
>>> >
>>> > A general design decision I would like to question is the tight
>>> > integration of the specific loss functions, regularizers, etc. into
>>> > the GradientDescent code. Given that it is really a pretty general
>>> > approach which is applicable to many many different models, I think
>>> > It's better to have the stochastic gradient code being as agnostic as
>>> > possible about exactly what it is optimizing. Instead, you would pass
>>> > it a function which contains a method to compute a function value, and
>>> > its derivative, and possibly some action to regularize the parameters
>>> > (if this cannot be already done with the function itself).
>>> >
>>> > Then, such a function for learning linear models can be configured in
>>> > the same way as it is right now. I personally would opt for optional
>>> > parameters instead of the builder-method style, but that is ok with
>>> > me.
>>> >
>>> > This would in particular mean that I would try to eliminate the case
>>> > statements as much as possible.
>>> >
>>> > The result would then be a bunch of classes around the gradient
>>> > descent optimizer which very clearly just encode that algorithm, and a
>>> > bunch of classes around cost functions for linear models which encode
>>> > that, and both parts talk with one another only through a small, clean
>>> > interface. Plus, this interface could be reused for other optimization
>>> > algorithms like BFGS, or the Trust Region Newton method we talked
>>> > about last time.
>>> >
>>> > So my suggestions:
>>> >   - think of some way to reduce boilerplate code for broadcast
>>> variables.
>>> >   - split gradient and the specific model for linear models to make
>>> > the individual parts cleaner and more extensible
>>> >
>>> > Hope this helps,
>>> >
>>> > -M
>>> >
>>> >
>>> > --
>>> > Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun
>>>
>>>
>>>
>>> --
>>> Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun
>>>
>>
>>
>>
>>   --
>>  Data Artisans GmbH | Tempelhofer Ufer 17 | 10963 Berlin
>>
>>  i...@data-artisans.com
>>  phone +493055599146
>>  mobile +4917671721261
>>
>>  Registered at Amtsgericht Charlottenburg - HRB 158244 B
>> Managing Directors: Kostas Tzoumas, Stephan Ewen
>>
>
>
>
>  --
>  Data Artisans GmbH | Tempelhofer Ufer 17 | 10963 Berlin
>
>  i...@data-artisans.com
>  phone +493055599146
>  mobile +4917671721261
>
>  Registered at Amtsgericht Charlottenburg - HRB 158244 B
> Managing Directors: Kostas Tzoumas, Stephan Ewen
>
>
>


-- 
Data Artisans GmbH | Tempelhofer Ufer 17 | 10963 Berlin

i...@data-artisans.com
phone +493055599146
mobile +4917671721261

Registered at Amtsgericht Charlottenburg - HRB 158244 B
Managing Directors: Kostas Tzoumas, Stephan Ewen

Re: Some feedback on the Gradient Descent Code

Reply via email to