[Ok, so maybe this is exactly what is implemented, sorry if I'm just repeating you... ]
So C(w, xys) = C regularization(w) + sum over yxs of losses Gradient is C grad reg(w) + sum grad losses(w, xy) For some regularization functions, regularization is better performed by some explicit operation on the weights like a subtracting some value if it's still to large or so (forgot how that exactly) works. The losses consists of the model (for now, linear I guess), plus the kind of losses (hinge, squared, and maybe logistic for LR) So from the viewpoint of SGD, we only need a way to compute the gradient for a set of examples, and maybe some way to update(e.g. shrink) weights for regularization. So SGD takes some trait CostFunction { def gradient(weightVector: Vector, xys: DataSet[LabeledVector]): Vector def parameterUpdate(weightVector: Vector): Vector } so that SGD can run. On the other hand, we could have a class LinearModelCostFunction { val lossFunction: LossFunction val regularizationFunction: RegularizationFunction val regularizationConstant def gradient(weightVector: Vector, xys: DataSet[LabeledVector]): Vector = { regularizationConstant * regularizationFunciton.gradient(weightVector) + xys.map(xy => lossFunction(predict(x, y) * loss) * X).sum() } def parameterUpdate = regularizationFunction.parameterUpdate } That way the optimizer and the cost function would be entirely uncoupled, and you could use the SGD to train on practically anything which has a data-dependent gradient. What do you think? -M On Thu, May 28, 2015 at 4:03 PM, Mikio Braun <mikiobr...@googlemail.com> wrote: > Oh wait.. continue to type. accidentally sent out the message to early. > > On Thu, May 28, 2015 at 4:03 PM, Mikio Braun <mikiobr...@googlemail.com> > wrote: >> Hi Till and Theodore, >> >> I think the code is cleaned up a lot now, introducing the >> mapWithBcVariable helped a lot. >> >> I also get that the goal was to make a cost function for learning >> linear model configurable well. My main concern was that the solver >> itself was already too specifically bound to the kind of cost function >> we wanted to optimize. >> >> Maybe it helps to discuss this in terms of the math of the >> optimization methods instead of in the code. The mathematics tends to >> be much more concise and small compared to the completely unrolled >> code derived from the math. >> >> Just very briefly, and correct me if I'm wrong, this is about >> optimizing some cost function C(w, xys) where w the weight vector and >> xys is a set of labeled vectors. For the update, we only need to >> consider the gradient C(w, xys) >> >> Typically >> >> C(w, xys) = C regularization(w) + sum over yxs of losses >> >> so the gradient is >> C grad reg(w) + sum grad losses(w, xy) >> >> On Thu, May 28, 2015 at 3:04 PM, Till Rohrmann <t...@data-artisans.com> >> wrote: >>> What tweaks would that be? I mean what is required to implement L-BFGS? >>> >>> I guess that we won’t get rid of the case statements because we have to >>> decide between two code paths: One with and the other without convergence >>> criterion. But I think by pulling each branch in its own function, it >>> becomes clearer now. >>> >>> I agree that the update step is not really the responsibility of the >>> LossFunction. The only reason why the current prototype does it this way is >>> that the regularization is part of the LossFunction. By moving the >>> regularization out of the LossFunction and make it part of the Solver we can >>> avoid to clutter the LossFunction interface. However, then we have to >>> provide different implementations for the Solver, one for each >>> regularization type (L1, L2, etc.). Or we provide the regularization as a >>> parameter which is responsible for the updating of the weight vector. >>> >>> What do you prefer? >>> >>> Cheer, >>> Till >>> >>> >>> On Thu, May 28, 2015 at 12:01 PM, Theodore Vasiloudis <t...@kth.se> wrote: >>>> >>>> Hello Mikio and Till, >>>> >>>> I really like the idea of syntactic sugar to reduce the boilerplate as >>>> well. >>>> >>>> I'm looking at the PR now. For me the goal was to decouple as many things >>>> as possible, but in the end a lot of the >>>> logic ended up in SGD itself. Tthe previous design allowed us to implement >>>> LBFGS using Breeze in a clean manner, >>>> but I think that this new design from Till can also work for that, with >>>> some tweaks. But we have to really look closely >>>> and see if it makes things simpler. The SGD implementation looks simpler I >>>> think, but there is still the need for the case >>>> statements, and I'm not sure yet on how we can get rid of them. >>>> >>>> The one thing that I'm not sure we want is the the coupling of the update >>>> step to the loss function (and the regularization by extension). >>>> It's something that is also in the current implementation of the >>>> optimization framework, but I was hoping we can get rid of that. >>>> As Mikio said we want interfaces that are clean,and updating the weights >>>> should not be the "responsibility" of the loss or the regularization, >>>> rather it should be handled by a function that takes the required >>>> information to perform the update. >>>> >>>> I'm a fan of how Breeze does optimization, where they have a >>>> FirstOrderMinimizer interface that has takeStep and chooseDescentDirection >>>> functions, >>>> and the actual optimization steps are performed in an infiniteIterations >>>> function implemented in FirstOrderMinimizer that all the first order >>>> minimization >>>> algorithms use, the thing that changes in the various algorithms is how >>>> the gradients are calculated, weights updated etc. >>>> >>>> I would definitely recommend taking a look at their code. >>>> >>>> >>>> On 05/28/2015 10:22 AM, Till Rohrmann wrote: >>>> >>>> I opened a PR [1] with the above-mentioned changes. Would be great if you >>>> could take a look and give me some feedback whether this is now easier to >>>> understand. >>>> >>>> Cheers, >>>> Till >>>> >>>> [1] https://github.com/apache/flink/pull/740 >>>> >>>> On Thu, May 28, 2015 at 1:16 AM, Till Rohrmann <t...@data-artisans.com> >>>> wrote: >>>>> >>>>> Hi Mikio, >>>>> >>>>> I agree that we should add some more syntactic sugar to make programming >>>>> with Scala more succinct and more fun :-) Especially, for the use case >>>>> where >>>>> you have a simple map operation with a broadcast variable. I like your >>>>> approach of wrapping the, admittedly, cumbersome RichMapFunction creation >>>>> into the helper function. We could even use the pimp my library pattern do >>>>> make it “part” of the usual DataSet. >>>>> >>>>> Something like that works: >>>>> >>>>> val x: DataSet[X] = ... >>>>> val bcVar: DataSet[B] = ... >>>>> >>>>> x.mapWithBroadcast(bcVar){ >>>>> (element: X, var: B) => element - var >>>>> } >>>>> >>>>> Great idea Mikio. I really like that :-) I guess that we all got a little >>>>> bit “betriebsblind” and stopped questioning ;-) >>>>> >>>>> Concerning the strong coupling of SGD with the different parts of the >>>>> loss function, the idea was to make it easily configurable. I agree, >>>>> though, >>>>> that one could at least encapsulate the prediction function as part of the >>>>> loss function and also add the regularization function to it. This would >>>>> simplify the code of SGD. A possible interface for a loss function could >>>>> look like >>>>> >>>>> trait LossFunction { >>>>> def loss(labeledVector: LabeledVector, weightVector: WeightVector): >>>>> Double >>>>> >>>>> def gradient(labeledVector: LabeledVector, weightVector: WeightVector): >>>>> WeightVector >>>>> >>>>> def updateWeightVector(oldWeightVector: WeightVector, newWeightVector: >>>>> WeightVector, regularizationValue: Double): WeightVector >>>>> } >>>>> >>>>> I can try to make a prototype of this interface to see how it looks and >>>>> feels like. >>>>> >>>>> Cheers, >>>>> >>>>> Till >>>>> >>>>> >>>>> On Tue, May 26, 2015 at 3:03 PM, Mikio Braun <mikiobr...@googlemail.com> >>>>> wrote: >>>>>> >>>>>> I've been playing around with a few ideas, but you could have a helper >>>>>> RichMapFunction like >>>>>> >>>>>> class MapWithBroadcast[B, In, Out](name: String)(fct: (B) => (In) => >>>>>> Out) >>>>>> extends RichMapFunction[In, Out] >>>>>> { >>>>>> var f: (In) => Out = _ >>>>>> >>>>>> override def open(parameters: Configuration): Unit = { >>>>>> f = fct(getRuntimeContext.getBroadcastVariable[B](name).get(0)) >>>>>> } >>>>>> >>>>>> def map(in: In): Out = f(in) >>>>>> } >>>>>> >>>>>> which takes a function which gets the broadcast value and then returns >>>>>> a function doing the actual map, and you would then use it like this: >>>>>> >>>>>> def mapWithBroadcast[B, I, O: ClassTag : TypeInformation]( >>>>>> xs: DataSet[I], bc: DataSet[B]) >>>>>> (fct: (B) => (I) => O): DataSet[O] = { >>>>>> xs.map(new MapWithBroadcast[B, I, >>>>>> O]("dummy")(fct)).withBroadcastSet(bc, "dummy") >>>>>> } >>>>>> >>>>>> And then... >>>>>> >>>>>> val x = env.fromCollection(Seq(1, 2, 3, 4, 5, 6, 7)) >>>>>> val sum = x.reduce(_ + _) >>>>>> >>>>>> val foo = mapWithBroadcast(x, sum)(sum => (x: Int) => x - sum) >>>>>> >>>>>> Has something like this been considered at any point? Or are you all >>>>>> at the point where you can type in the broadcasting stuff while you >>>>>> sleep ;) ? >>>>>> >>>>>> -M >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, May 26, 2015 at 1:14 PM, Mikio Braun <mikiobr...@googlemail.com> >>>>>> wrote: >>>>>> > Hi Theodore and Till, >>>>>> > >>>>>> > last Friday Till suggested I have a look at the gradient code. Took me >>>>>> > a while, but I think I got the main structure. >>>>>> > >>>>>> > I think I understand why the code is written how it is, but just >>>>>> > speaking from a purely data science perspective, I find that a lot of >>>>>> > the code is superheavy on boilterplate code. The LossCalculation and >>>>>> > so on RichMapFunctions in GradientDescent easily take up 20 lines of >>>>>> > code just to unpacka few broadcast variable and call a single >>>>>> > function. I understand why we need to use broadcast variables here, >>>>>> > and that this is a common idiom in Flink, but I wonder whether one >>>>>> > cannot reduce the boilerplate code. >>>>>> > >>>>>> > In my experience, it is pretty important when implementing ML >>>>>> > algorithms that the mapping back from the code to the algorithm is >>>>>> > still somewhat clear, because if you need to do modifications or >>>>>> > debugging, it's important that you still know what corresponds to >>>>>> > what, and with so many lines of code just moving data around, that's >>>>>> > hardly the case. >>>>>> > >>>>>> > A general design decision I would like to question is the tight >>>>>> > integration of the specific loss functions, regularizers, etc. into >>>>>> > the GradientDescent code. Given that it is really a pretty general >>>>>> > approach which is applicable to many many different models, I think >>>>>> > It's better to have the stochastic gradient code being as agnostic as >>>>>> > possible about exactly what it is optimizing. Instead, you would pass >>>>>> > it a function which contains a method to compute a function value, and >>>>>> > its derivative, and possibly some action to regularize the parameters >>>>>> > (if this cannot be already done with the function itself). >>>>>> > >>>>>> > Then, such a function for learning linear models can be configured in >>>>>> > the same way as it is right now. I personally would opt for optional >>>>>> > parameters instead of the builder-method style, but that is ok with >>>>>> > me. >>>>>> > >>>>>> > This would in particular mean that I would try to eliminate the case >>>>>> > statements as much as possible. >>>>>> > >>>>>> > The result would then be a bunch of classes around the gradient >>>>>> > descent optimizer which very clearly just encode that algorithm, and a >>>>>> > bunch of classes around cost functions for linear models which encode >>>>>> > that, and both parts talk with one another only through a small, clean >>>>>> > interface. Plus, this interface could be reused for other optimization >>>>>> > algorithms like BFGS, or the Trust Region Newton method we talked >>>>>> > about last time. >>>>>> > >>>>>> > So my suggestions: >>>>>> > - think of some way to reduce boilerplate code for broadcast >>>>>> > variables. >>>>>> > - split gradient and the specific model for linear models to make >>>>>> > the individual parts cleaner and more extensible >>>>>> > >>>>>> > Hope this helps, >>>>>> > >>>>>> > -M >>>>>> > >>>>>> > >>>>>> > -- >>>>>> > Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Data Artisans GmbH | Tempelhofer Ufer 17 | 10963 Berlin >>>>> >>>>> i...@data-artisans.com >>>>> phone +493055599146 >>>>> mobile +4917671721261 >>>>> >>>>> Registered at Amtsgericht Charlottenburg - HRB 158244 B >>>>> Managing Directors: Kostas Tzoumas, Stephan Ewen >>>> >>>> >>>> >>>> >>>> -- >>>> Data Artisans GmbH | Tempelhofer Ufer 17 | 10963 Berlin >>>> >>>> i...@data-artisans.com >>>> phone +493055599146 >>>> mobile +4917671721261 >>>> >>>> Registered at Amtsgericht Charlottenburg - HRB 158244 B >>>> Managing Directors: Kostas Tzoumas, Stephan Ewen >>>> >>>> >>> >>> >>> >>> -- >>> Data Artisans GmbH | Tempelhofer Ufer 17 | 10963 Berlin >>> >>> i...@data-artisans.com >>> phone +493055599146 >>> mobile +4917671721261 >>> >>> Registered at Amtsgericht Charlottenburg - HRB 158244 B >>> Managing Directors: Kostas Tzoumas, Stephan Ewen >> >> >> >> -- >> Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun > > > > -- > Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun -- Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun