GradientDescent is the just the (batch-)SGD optimizer right? Actually I think the parameter update should be done by a RegularizationFunction.
IMHO the structure should be like this: GradientDescent - collects gradient and regularization updates from - CostFunction LinearModelCostFunction - inherits from Cost Function - parameterized by LossFunction and RegularizationFunction Does that make sense? On Thu, May 28, 2015 at 5:00 PM, Till Rohrmann <[email protected]> wrote: > Hey Mikio, > > yes you’re right. The SGD only needs to know the gradient of the loss > function and some mean to update the weights in accordance with the > regularization scheme. Additionally, we also need to be able to compute the > loss for the convergence criterion. > > That’s also how it is implemented in the before-mentioned PR right now. The > question which arises now is whether the LossFunction should be responsible > for the updating of the weight vector or not. So what we could do is the > following: Remove the regularization from the LossFunction and let the > GradientDescent solver do the parameterUpdate. This either requires > specialized implementations such as GradientDescentL1 and GradientDescentL2 > or a configuration parameter which defines a regularization function which > is then called for the parameterUpdate. > > trait RegularizationFunction{ > def parameterUpdate(weightVector: WeightVector, gradient: WeightVector, > learningRate: Double, regularizationConstant: Double): WeightVector > > def regularizationValue(weigthVector: WeightVector): Double > } > > Both ansätze are semantically equivalent. I have no strong preference for > either of them. What do you think is the better approach? > > > On Thu, May 28, 2015 at 4:13 PM, Mikio Braun <[email protected]> > wrote: >> >> [Ok, so maybe this is exactly what is implemented, sorry if I'm just >> repeating you... ] >> >> So >> >> C(w, xys) = C regularization(w) + sum over yxs of losses >> >> Gradient is >> >> C grad reg(w) + sum grad losses(w, xy) >> >> For some regularization functions, regularization is better performed >> by some explicit operation on the weights like a subtracting some >> value if it's still to large or so (forgot how that exactly) works. >> >> The losses consists of the model (for now, linear I guess), plus the >> kind of losses (hinge, squared, and maybe logistic for LR) >> >> So from the viewpoint of SGD, we only need a way to compute the >> gradient for a set of examples, and maybe some way to update(e.g. >> shrink) weights for regularization. >> >> So SGD takes some >> >> trait CostFunction { >> def gradient(weightVector: Vector, xys: DataSet[LabeledVector]): Vector >> >> def parameterUpdate(weightVector: Vector): Vector >> } >> >> so that SGD can run. >> >> On the other hand, we could have a >> >> class LinearModelCostFunction { >> val lossFunction: LossFunction >> val regularizationFunction: RegularizationFunction >> >> val regularizationConstant >> >> def gradient(weightVector: Vector, xys: DataSet[LabeledVector]): Vector >> = { >> regularizationConstant * >> regularizationFunciton.gradient(weightVector) + >> xys.map(xy => lossFunction(predict(x, y) * loss) * X).sum() >> } >> >> def parameterUpdate = regularizationFunction.parameterUpdate >> } >> >> That way the optimizer and the cost function would be entirely >> uncoupled, and you could use the SGD to train on practically anything >> which has a data-dependent gradient. >> >> What do you think? >> >> -M >> >> On Thu, May 28, 2015 at 4:03 PM, Mikio Braun <[email protected]> >> wrote: >> > Oh wait.. continue to type. accidentally sent out the message to early. >> > >> > On Thu, May 28, 2015 at 4:03 PM, Mikio Braun <[email protected]> >> > wrote: >> >> Hi Till and Theodore, >> >> >> >> I think the code is cleaned up a lot now, introducing the >> >> mapWithBcVariable helped a lot. >> >> >> >> I also get that the goal was to make a cost function for learning >> >> linear model configurable well. My main concern was that the solver >> >> itself was already too specifically bound to the kind of cost function >> >> we wanted to optimize. >> >> >> >> Maybe it helps to discuss this in terms of the math of the >> >> optimization methods instead of in the code. The mathematics tends to >> >> be much more concise and small compared to the completely unrolled >> >> code derived from the math. >> >> >> >> Just very briefly, and correct me if I'm wrong, this is about >> >> optimizing some cost function C(w, xys) where w the weight vector and >> >> xys is a set of labeled vectors. For the update, we only need to >> >> consider the gradient C(w, xys) >> >> >> >> Typically >> >> >> >> C(w, xys) = C regularization(w) + sum over yxs of losses >> >> >> >> so the gradient is >> >> C grad reg(w) + sum grad losses(w, xy) >> >> >> >> On Thu, May 28, 2015 at 3:04 PM, Till Rohrmann <[email protected]> >> >> wrote: >> >>> What tweaks would that be? I mean what is required to implement >> >>> L-BFGS? >> >>> >> >>> I guess that we won’t get rid of the case statements because we have >> >>> to >> >>> decide between two code paths: One with and the other without >> >>> convergence >> >>> criterion. But I think by pulling each branch in its own function, it >> >>> becomes clearer now. >> >>> >> >>> I agree that the update step is not really the responsibility of the >> >>> LossFunction. The only reason why the current prototype does it this >> >>> way is >> >>> that the regularization is part of the LossFunction. By moving the >> >>> regularization out of the LossFunction and make it part of the Solver >> >>> we can >> >>> avoid to clutter the LossFunction interface. However, then we have to >> >>> provide different implementations for the Solver, one for each >> >>> regularization type (L1, L2, etc.). Or we provide the regularization >> >>> as a >> >>> parameter which is responsible for the updating of the weight vector. >> >>> >> >>> What do you prefer? >> >>> >> >>> Cheer, >> >>> Till >> >>> >> >>> >> >>> On Thu, May 28, 2015 at 12:01 PM, Theodore Vasiloudis <[email protected]> >> >>> wrote: >> >>>> >> >>>> Hello Mikio and Till, >> >>>> >> >>>> I really like the idea of syntactic sugar to reduce the boilerplate >> >>>> as >> >>>> well. >> >>>> >> >>>> I'm looking at the PR now. For me the goal was to decouple as many >> >>>> things >> >>>> as possible, but in the end a lot of the >> >>>> logic ended up in SGD itself. Tthe previous design allowed us to >> >>>> implement >> >>>> LBFGS using Breeze in a clean manner, >> >>>> but I think that this new design from Till can also work for that, >> >>>> with >> >>>> some tweaks. But we have to really look closely >> >>>> and see if it makes things simpler. The SGD implementation looks >> >>>> simpler I >> >>>> think, but there is still the need for the case >> >>>> statements, and I'm not sure yet on how we can get rid of them. >> >>>> >> >>>> The one thing that I'm not sure we want is the the coupling of the >> >>>> update >> >>>> step to the loss function (and the regularization by extension). >> >>>> It's something that is also in the current implementation of the >> >>>> optimization framework, but I was hoping we can get rid of that. >> >>>> As Mikio said we want interfaces that are clean,and updating the >> >>>> weights >> >>>> should not be the "responsibility" of the loss or the regularization, >> >>>> rather it should be handled by a function that takes the required >> >>>> information to perform the update. >> >>>> >> >>>> I'm a fan of how Breeze does optimization, where they have a >> >>>> FirstOrderMinimizer interface that has takeStep and >> >>>> chooseDescentDirection >> >>>> functions, >> >>>> and the actual optimization steps are performed in an >> >>>> infiniteIterations >> >>>> function implemented in FirstOrderMinimizer that all the first order >> >>>> minimization >> >>>> algorithms use, the thing that changes in the various algorithms is >> >>>> how >> >>>> the gradients are calculated, weights updated etc. >> >>>> >> >>>> I would definitely recommend taking a look at their code. >> >>>> >> >>>> >> >>>> On 05/28/2015 10:22 AM, Till Rohrmann wrote: >> >>>> >> >>>> I opened a PR [1] with the above-mentioned changes. Would be great if >> >>>> you >> >>>> could take a look and give me some feedback whether this is now >> >>>> easier to >> >>>> understand. >> >>>> >> >>>> Cheers, >> >>>> Till >> >>>> >> >>>> [1] https://github.com/apache/flink/pull/740 >> >>>> >> >>>> On Thu, May 28, 2015 at 1:16 AM, Till Rohrmann >> >>>> <[email protected]> >> >>>> wrote: >> >>>>> >> >>>>> Hi Mikio, >> >>>>> >> >>>>> I agree that we should add some more syntactic sugar to make >> >>>>> programming >> >>>>> with Scala more succinct and more fun :-) Especially, for the use >> >>>>> case where >> >>>>> you have a simple map operation with a broadcast variable. I like >> >>>>> your >> >>>>> approach of wrapping the, admittedly, cumbersome RichMapFunction >> >>>>> creation >> >>>>> into the helper function. We could even use the pimp my library >> >>>>> pattern do >> >>>>> make it “part” of the usual DataSet. >> >>>>> >> >>>>> Something like that works: >> >>>>> >> >>>>> val x: DataSet[X] = ... >> >>>>> val bcVar: DataSet[B] = ... >> >>>>> >> >>>>> x.mapWithBroadcast(bcVar){ >> >>>>> (element: X, var: B) => element - var >> >>>>> } >> >>>>> >> >>>>> Great idea Mikio. I really like that :-) I guess that we all got a >> >>>>> little >> >>>>> bit “betriebsblind” and stopped questioning ;-) >> >>>>> >> >>>>> Concerning the strong coupling of SGD with the different parts of >> >>>>> the >> >>>>> loss function, the idea was to make it easily configurable. I agree, >> >>>>> though, >> >>>>> that one could at least encapsulate the prediction function as part >> >>>>> of the >> >>>>> loss function and also add the regularization function to it. This >> >>>>> would >> >>>>> simplify the code of SGD. A possible interface for a loss function >> >>>>> could >> >>>>> look like >> >>>>> >> >>>>> trait LossFunction { >> >>>>> def loss(labeledVector: LabeledVector, weightVector: >> >>>>> WeightVector): >> >>>>> Double >> >>>>> >> >>>>> def gradient(labeledVector: LabeledVector, weightVector: >> >>>>> WeightVector): >> >>>>> WeightVector >> >>>>> >> >>>>> def updateWeightVector(oldWeightVector: WeightVector, >> >>>>> newWeightVector: >> >>>>> WeightVector, regularizationValue: Double): WeightVector >> >>>>> } >> >>>>> >> >>>>> I can try to make a prototype of this interface to see how it looks >> >>>>> and >> >>>>> feels like. >> >>>>> >> >>>>> Cheers, >> >>>>> >> >>>>> Till >> >>>>> >> >>>>> >> >>>>> On Tue, May 26, 2015 at 3:03 PM, Mikio Braun >> >>>>> <[email protected]> >> >>>>> wrote: >> >>>>>> >> >>>>>> I've been playing around with a few ideas, but you could have a >> >>>>>> helper >> >>>>>> RichMapFunction like >> >>>>>> >> >>>>>> class MapWithBroadcast[B, In, Out](name: String)(fct: (B) => (In) >> >>>>>> => >> >>>>>> Out) >> >>>>>> extends RichMapFunction[In, Out] >> >>>>>> { >> >>>>>> var f: (In) => Out = _ >> >>>>>> >> >>>>>> override def open(parameters: Configuration): Unit = { >> >>>>>> f = fct(getRuntimeContext.getBroadcastVariable[B](name).get(0)) >> >>>>>> } >> >>>>>> >> >>>>>> def map(in: In): Out = f(in) >> >>>>>> } >> >>>>>> >> >>>>>> which takes a function which gets the broadcast value and then >> >>>>>> returns >> >>>>>> a function doing the actual map, and you would then use it like >> >>>>>> this: >> >>>>>> >> >>>>>> def mapWithBroadcast[B, I, O: ClassTag : TypeInformation]( >> >>>>>> xs: DataSet[I], bc: DataSet[B]) >> >>>>>> (fct: (B) => (I) => O): DataSet[O] = { >> >>>>>> xs.map(new MapWithBroadcast[B, I, >> >>>>>> O]("dummy")(fct)).withBroadcastSet(bc, "dummy") >> >>>>>> } >> >>>>>> >> >>>>>> And then... >> >>>>>> >> >>>>>> val x = env.fromCollection(Seq(1, 2, 3, 4, 5, 6, 7)) >> >>>>>> val sum = x.reduce(_ + _) >> >>>>>> >> >>>>>> val foo = mapWithBroadcast(x, sum)(sum => (x: Int) => x - sum) >> >>>>>> >> >>>>>> Has something like this been considered at any point? Or are you >> >>>>>> all >> >>>>>> at the point where you can type in the broadcasting stuff while you >> >>>>>> sleep ;) ? >> >>>>>> >> >>>>>> -M >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> On Tue, May 26, 2015 at 1:14 PM, Mikio Braun >> >>>>>> <[email protected]> >> >>>>>> wrote: >> >>>>>> > Hi Theodore and Till, >> >>>>>> > >> >>>>>> > last Friday Till suggested I have a look at the gradient code. >> >>>>>> > Took me >> >>>>>> > a while, but I think I got the main structure. >> >>>>>> > >> >>>>>> > I think I understand why the code is written how it is, but just >> >>>>>> > speaking from a purely data science perspective, I find that a >> >>>>>> > lot of >> >>>>>> > the code is superheavy on boilterplate code. The LossCalculation >> >>>>>> > and >> >>>>>> > so on RichMapFunctions in GradientDescent easily take up 20 lines >> >>>>>> > of >> >>>>>> > code just to unpacka few broadcast variable and call a single >> >>>>>> > function. I understand why we need to use broadcast variables >> >>>>>> > here, >> >>>>>> > and that this is a common idiom in Flink, but I wonder whether >> >>>>>> > one >> >>>>>> > cannot reduce the boilerplate code. >> >>>>>> > >> >>>>>> > In my experience, it is pretty important when implementing ML >> >>>>>> > algorithms that the mapping back from the code to the algorithm >> >>>>>> > is >> >>>>>> > still somewhat clear, because if you need to do modifications or >> >>>>>> > debugging, it's important that you still know what corresponds to >> >>>>>> > what, and with so many lines of code just moving data around, >> >>>>>> > that's >> >>>>>> > hardly the case. >> >>>>>> > >> >>>>>> > A general design decision I would like to question is the tight >> >>>>>> > integration of the specific loss functions, regularizers, etc. >> >>>>>> > into >> >>>>>> > the GradientDescent code. Given that it is really a pretty >> >>>>>> > general >> >>>>>> > approach which is applicable to many many different models, I >> >>>>>> > think >> >>>>>> > It's better to have the stochastic gradient code being as >> >>>>>> > agnostic as >> >>>>>> > possible about exactly what it is optimizing. Instead, you would >> >>>>>> > pass >> >>>>>> > it a function which contains a method to compute a function >> >>>>>> > value, and >> >>>>>> > its derivative, and possibly some action to regularize the >> >>>>>> > parameters >> >>>>>> > (if this cannot be already done with the function itself). >> >>>>>> > >> >>>>>> > Then, such a function for learning linear models can be >> >>>>>> > configured in >> >>>>>> > the same way as it is right now. I personally would opt for >> >>>>>> > optional >> >>>>>> > parameters instead of the builder-method style, but that is ok >> >>>>>> > with >> >>>>>> > me. >> >>>>>> > >> >>>>>> > This would in particular mean that I would try to eliminate the >> >>>>>> > case >> >>>>>> > statements as much as possible. >> >>>>>> > >> >>>>>> > The result would then be a bunch of classes around the gradient >> >>>>>> > descent optimizer which very clearly just encode that algorithm, >> >>>>>> > and a >> >>>>>> > bunch of classes around cost functions for linear models which >> >>>>>> > encode >> >>>>>> > that, and both parts talk with one another only through a small, >> >>>>>> > clean >> >>>>>> > interface. Plus, this interface could be reused for other >> >>>>>> > optimization >> >>>>>> > algorithms like BFGS, or the Trust Region Newton method we talked >> >>>>>> > about last time. >> >>>>>> > >> >>>>>> > So my suggestions: >> >>>>>> > - think of some way to reduce boilerplate code for broadcast >> >>>>>> > variables. >> >>>>>> > - split gradient and the specific model for linear models to >> >>>>>> > make >> >>>>>> > the individual parts cleaner and more extensible >> >>>>>> > >> >>>>>> > Hope this helps, >> >>>>>> > >> >>>>>> > -M >> >>>>>> > >> >>>>>> > >> >>>>>> > -- >> >>>>>> > Mikio Braun - http://blog.mikiobraun.de, >> >>>>>> > http://twitter.com/mikiobraun >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> -- >> >>>>>> Mikio Braun - http://blog.mikiobraun.de, >> >>>>>> http://twitter.com/mikiobraun >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> -- >> >>>>> Data Artisans GmbH | Tempelhofer Ufer 17 | 10963 Berlin >> >>>>> >> >>>>> [email protected] >> >>>>> phone +493055599146 >> >>>>> mobile +4917671721261 >> >>>>> >> >>>>> Registered at Amtsgericht Charlottenburg - HRB 158244 B >> >>>>> Managing Directors: Kostas Tzoumas, Stephan Ewen >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> -- >> >>>> Data Artisans GmbH | Tempelhofer Ufer 17 | 10963 Berlin >> >>>> >> >>>> [email protected] >> >>>> phone +493055599146 >> >>>> mobile +4917671721261 >> >>>> >> >>>> Registered at Amtsgericht Charlottenburg - HRB 158244 B >> >>>> Managing Directors: Kostas Tzoumas, Stephan Ewen >> >>>> >> >>>> >> >>> >> >>> >> >>> >> >>> -- >> >>> Data Artisans GmbH | Tempelhofer Ufer 17 | 10963 Berlin >> >>> >> >>> [email protected] >> >>> phone +493055599146 >> >>> mobile +4917671721261 >> >>> >> >>> Registered at Amtsgericht Charlottenburg - HRB 158244 B >> >>> Managing Directors: Kostas Tzoumas, Stephan Ewen >> >> >> >> >> >> >> >> -- >> >> Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun >> > >> > >> > >> > -- >> > Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun >> >> >> >> -- >> Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun > > -- Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun
