Re: Some feedback on the Gradient Descent Code

Mikio Braun Thu, 28 May 2015 07:09:34 -0700

Oh wait.. continue to type. accidentally sent out the message to early.

On Thu, May 28, 2015 at 4:03 PM, Mikio Braun <mikiobr...@googlemail.com> wrote:
> Hi Till and Theodore,
>
> I think the code is cleaned up a lot now, introducing the
> mapWithBcVariable helped a lot.
>
> I also get that the goal was to make a cost function for learning
> linear model configurable well. My main concern was that the solver
> itself was already too specifically bound to the kind of cost function
> we wanted to optimize.
>
> Maybe it helps to discuss this in terms of the math of the
> optimization methods instead of in the code. The mathematics tends to
> be much more concise and small compared to the completely unrolled
> code derived from the math.
>
> Just very briefly, and correct me if I'm wrong, this is about
> optimizing some cost function C(w, xys) where w the weight vector and
> xys is a set of labeled vectors. For the update, we only need to
> consider the gradient C(w, xys)
>
> Typically
>
> C(w, xys) = C regularization(w) + sum over yxs of losses
>
> so the gradient is
> C grad reg(w) + sum grad losses(w, xy)
>
> On Thu, May 28, 2015 at 3:04 PM, Till Rohrmann <t...@data-artisans.com> wrote:
>> What tweaks would that be? I mean what is required to implement L-BFGS?
>>
>> I guess that we won’t get rid of the case statements because we have to
>> decide between two code paths: One with and the other without convergence
>> criterion. But I think by pulling each branch in its own function, it
>> becomes clearer now.
>>
>> I agree that the update step is not really the responsibility of the
>> LossFunction. The only reason why the current prototype does it this way is
>> that the regularization is part of the LossFunction. By moving the
>> regularization out of the LossFunction and make it part of the Solver we can
>> avoid to clutter the LossFunction interface. However, then we have to
>> provide different implementations for the Solver, one for each
>> regularization type (L1, L2, etc.). Or we provide the regularization as a
>> parameter which is responsible for the updating of the weight vector.
>>
>> What do you prefer?
>>
>> Cheer,
>> Till
>>
>>
>> On Thu, May 28, 2015 at 12:01 PM, Theodore Vasiloudis <t...@kth.se> wrote:
>>>
>>> Hello Mikio and Till,
>>>
>>> I really like the idea of syntactic sugar to reduce the boilerplate as
>>> well.
>>>
>>> I'm looking at the PR now. For me the goal was to decouple as many things
>>> as possible, but in the end a lot of the
>>> logic ended up in SGD itself. Tthe previous design allowed us to implement
>>> LBFGS using Breeze in a clean manner,
>>> but I think that this new design from Till can also work for that, with
>>> some tweaks. But we have to really look closely
>>> and see if it makes things simpler. The SGD implementation looks simpler I
>>> think, but there is still the need for the case
>>> statements, and I'm not sure yet on how we can get rid of them.
>>>
>>> The one thing that I'm not sure we want is the the coupling of the update
>>> step to the loss function (and the regularization by extension).
>>> It's something that is also in the current implementation of the
>>> optimization framework, but I was hoping we can get rid of that.
>>> As Mikio said we want interfaces that are clean,and updating the weights
>>> should not be the "responsibility" of the loss or the regularization,
>>> rather it should be handled by a function that takes the required
>>> information to perform the update.
>>>
>>> I'm a fan of how Breeze does optimization, where they have a
>>> FirstOrderMinimizer interface that has takeStep and chooseDescentDirection
>>> functions,
>>> and the actual optimization steps are performed in an infiniteIterations
>>> function implemented in FirstOrderMinimizer that all the first order
>>> minimization
>>> algorithms use, the thing that changes in the various algorithms is how
>>> the gradients are calculated, weights updated etc.
>>>
>>> I would definitely recommend taking a look at their code.
>>>
>>>
>>> On 05/28/2015 10:22 AM, Till Rohrmann wrote:
>>>
>>> I opened a PR [1] with the above-mentioned changes. Would be great if you
>>> could take a look and give me some feedback whether this is now easier to
>>> understand.
>>>
>>> Cheers,
>>> Till
>>>
>>> [1] https://github.com/apache/flink/pull/740
>>>
>>> On Thu, May 28, 2015 at 1:16 AM, Till Rohrmann <t...@data-artisans.com>
>>> wrote:
>>>>
>>>> Hi Mikio,
>>>>
>>>> I agree that we should add some more syntactic sugar to make programming
>>>> with Scala more succinct and more fun :-) Especially, for the use case 
>>>> where
>>>> you have a simple map operation with a broadcast variable. I like your
>>>> approach of wrapping the, admittedly, cumbersome RichMapFunction creation
>>>> into the helper function. We could even use the pimp my library pattern do
>>>> make it “part” of the usual DataSet.
>>>>
>>>> Something like that works:
>>>>
>>>> val x: DataSet[X] = ...
>>>> val bcVar: DataSet[B] = ...
>>>>
>>>> x.mapWithBroadcast(bcVar){
>>>>     (element: X, var: B) => element - var
>>>> }
>>>>
>>>> Great idea Mikio. I really like that :-) I guess that we all got a little
>>>> bit “betriebsblind” and stopped questioning ;-)
>>>>
>>>> Concerning the strong coupling of SGD with the different parts of the
>>>> loss function, the idea was to make it easily configurable. I agree, 
>>>> though,
>>>> that one could at least encapsulate the prediction function as part of the
>>>> loss function and also add the regularization function to it. This would
>>>> simplify the code of SGD. A possible interface for a loss function could
>>>> look like
>>>>
>>>> trait LossFunction {
>>>>   def loss(labeledVector: LabeledVector, weightVector: WeightVector):
>>>> Double
>>>>
>>>>   def gradient(labeledVector: LabeledVector, weightVector: WeightVector):
>>>> WeightVector
>>>>
>>>>   def updateWeightVector(oldWeightVector: WeightVector, newWeightVector:
>>>> WeightVector, regularizationValue: Double): WeightVector
>>>> }
>>>>
>>>> I can try to make a prototype of this interface to see how it looks and
>>>> feels like.
>>>>
>>>> Cheers,
>>>>
>>>> Till
>>>>
>>>>
>>>> On Tue, May 26, 2015 at 3:03 PM, Mikio Braun <mikiobr...@googlemail.com>
>>>> wrote:
>>>>>
>>>>> I've been playing around with a few ideas, but you could have a helper
>>>>> RichMapFunction like
>>>>>
>>>>> class MapWithBroadcast[B, In, Out](name: String)(fct: (B) => (In) =>
>>>>> Out)
>>>>>     extends RichMapFunction[In, Out]
>>>>> {
>>>>>   var f: (In) => Out = _
>>>>>
>>>>>   override def open(parameters: Configuration): Unit = {
>>>>>     f = fct(getRuntimeContext.getBroadcastVariable[B](name).get(0))
>>>>>   }
>>>>>
>>>>>   def map(in: In): Out = f(in)
>>>>> }
>>>>>
>>>>> which takes a function which gets the broadcast value and then returns
>>>>> a function doing the actual map, and you would then use it like this:
>>>>>
>>>>> def mapWithBroadcast[B, I, O: ClassTag : TypeInformation](
>>>>>     xs: DataSet[I], bc: DataSet[B])
>>>>>   (fct: (B) => (I) => O): DataSet[O] = {
>>>>>   xs.map(new MapWithBroadcast[B, I,
>>>>> O]("dummy")(fct)).withBroadcastSet(bc, "dummy")
>>>>> }
>>>>>
>>>>> And then...
>>>>>
>>>>> val x = env.fromCollection(Seq(1, 2, 3, 4, 5, 6, 7))
>>>>> val sum = x.reduce(_ + _)
>>>>>
>>>>> val foo = mapWithBroadcast(x, sum)(sum => (x: Int) => x - sum)
>>>>>
>>>>> Has something like this been considered at any point? Or are you all
>>>>> at the point where you can type in the broadcasting stuff while you
>>>>> sleep ;) ?
>>>>>
>>>>> -M
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, May 26, 2015 at 1:14 PM, Mikio Braun <mikiobr...@googlemail.com>
>>>>> wrote:
>>>>> > Hi Theodore and Till,
>>>>> >
>>>>> > last Friday Till suggested I have a look at the gradient code. Took me
>>>>> > a while, but I think I got the main structure.
>>>>> >
>>>>> > I think I understand why the code is written how it is, but just
>>>>> > speaking from a purely data science perspective, I find that a lot of
>>>>> > the code is superheavy on boilterplate code. The LossCalculation and
>>>>> > so on RichMapFunctions in GradientDescent easily take up 20 lines of
>>>>> > code just to unpacka few broadcast variable and call a single
>>>>> > function. I understand why we need to use broadcast variables here,
>>>>> > and that this is a common idiom in Flink, but I wonder whether one
>>>>> > cannot reduce the boilerplate code.
>>>>> >
>>>>> > In my experience, it is pretty important when implementing ML
>>>>> > algorithms that the mapping back from the code to the algorithm is
>>>>> > still somewhat clear, because if you need to do modifications or
>>>>> > debugging, it's important that you still know what corresponds to
>>>>> > what, and with so many lines of code just moving data around, that's
>>>>> > hardly the case.
>>>>> >
>>>>> > A general design decision I would like to question is the tight
>>>>> > integration of the specific loss functions, regularizers, etc. into
>>>>> > the GradientDescent code. Given that it is really a pretty general
>>>>> > approach which is applicable to many many different models, I think
>>>>> > It's better to have the stochastic gradient code being as agnostic as
>>>>> > possible about exactly what it is optimizing. Instead, you would pass
>>>>> > it a function which contains a method to compute a function value, and
>>>>> > its derivative, and possibly some action to regularize the parameters
>>>>> > (if this cannot be already done with the function itself).
>>>>> >
>>>>> > Then, such a function for learning linear models can be configured in
>>>>> > the same way as it is right now. I personally would opt for optional
>>>>> > parameters instead of the builder-method style, but that is ok with
>>>>> > me.
>>>>> >
>>>>> > This would in particular mean that I would try to eliminate the case
>>>>> > statements as much as possible.
>>>>> >
>>>>> > The result would then be a bunch of classes around the gradient
>>>>> > descent optimizer which very clearly just encode that algorithm, and a
>>>>> > bunch of classes around cost functions for linear models which encode
>>>>> > that, and both parts talk with one another only through a small, clean
>>>>> > interface. Plus, this interface could be reused for other optimization
>>>>> > algorithms like BFGS, or the Trust Region Newton method we talked
>>>>> > about last time.
>>>>> >
>>>>> > So my suggestions:
>>>>> >   - think of some way to reduce boilerplate code for broadcast
>>>>> > variables.
>>>>> >   - split gradient and the specific model for linear models to make
>>>>> > the individual parts cleaner and more extensible
>>>>> >
>>>>> > Hope this helps,
>>>>> >
>>>>> > -M
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Data Artisans GmbH | Tempelhofer Ufer 17 | 10963 Berlin
>>>>
>>>> i...@data-artisans.com
>>>> phone +493055599146
>>>> mobile +4917671721261
>>>>
>>>> Registered at Amtsgericht Charlottenburg - HRB 158244 B
>>>> Managing Directors: Kostas Tzoumas, Stephan Ewen
>>>
>>>
>>>
>>>
>>> --
>>> Data Artisans GmbH | Tempelhofer Ufer 17 | 10963 Berlin
>>>
>>> i...@data-artisans.com
>>> phone +493055599146
>>> mobile +4917671721261
>>>
>>> Registered at Amtsgericht Charlottenburg - HRB 158244 B
>>> Managing Directors: Kostas Tzoumas, Stephan Ewen
>>>
>>>
>>
>>
>>
>> --
>> Data Artisans GmbH | Tempelhofer Ufer 17 | 10963 Berlin
>>
>> i...@data-artisans.com
>> phone +493055599146
>> mobile +4917671721261
>>
>> Registered at Amtsgericht Charlottenburg - HRB 158244 B
>> Managing Directors: Kostas Tzoumas, Stephan Ewen
>
>
>
> --
> Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun




-- 
Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun

Re: Some feedback on the Gradient Descent Code

Reply via email to