Oh wait.. continue to type. accidentally sent out the message to early. On Thu, May 28, 2015 at 4:03 PM, Mikio Braun <mikiobr...@googlemail.com> wrote: > Hi Till and Theodore, > > I think the code is cleaned up a lot now, introducing the > mapWithBcVariable helped a lot. > > I also get that the goal was to make a cost function for learning > linear model configurable well. My main concern was that the solver > itself was already too specifically bound to the kind of cost function > we wanted to optimize. > > Maybe it helps to discuss this in terms of the math of the > optimization methods instead of in the code. The mathematics tends to > be much more concise and small compared to the completely unrolled > code derived from the math. > > Just very briefly, and correct me if I'm wrong, this is about > optimizing some cost function C(w, xys) where w the weight vector and > xys is a set of labeled vectors. For the update, we only need to > consider the gradient C(w, xys) > > Typically > > C(w, xys) = C regularization(w) + sum over yxs of losses > > so the gradient is > C grad reg(w) + sum grad losses(w, xy) > > On Thu, May 28, 2015 at 3:04 PM, Till Rohrmann <t...@data-artisans.com> wrote: >> What tweaks would that be? I mean what is required to implement L-BFGS? >> >> I guess that we won’t get rid of the case statements because we have to >> decide between two code paths: One with and the other without convergence >> criterion. But I think by pulling each branch in its own function, it >> becomes clearer now. >> >> I agree that the update step is not really the responsibility of the >> LossFunction. The only reason why the current prototype does it this way is >> that the regularization is part of the LossFunction. By moving the >> regularization out of the LossFunction and make it part of the Solver we can >> avoid to clutter the LossFunction interface. However, then we have to >> provide different implementations for the Solver, one for each >> regularization type (L1, L2, etc.). Or we provide the regularization as a >> parameter which is responsible for the updating of the weight vector. >> >> What do you prefer? >> >> Cheer, >> Till >> >> >> On Thu, May 28, 2015 at 12:01 PM, Theodore Vasiloudis <t...@kth.se> wrote: >>> >>> Hello Mikio and Till, >>> >>> I really like the idea of syntactic sugar to reduce the boilerplate as >>> well. >>> >>> I'm looking at the PR now. For me the goal was to decouple as many things >>> as possible, but in the end a lot of the >>> logic ended up in SGD itself. Tthe previous design allowed us to implement >>> LBFGS using Breeze in a clean manner, >>> but I think that this new design from Till can also work for that, with >>> some tweaks. But we have to really look closely >>> and see if it makes things simpler. The SGD implementation looks simpler I >>> think, but there is still the need for the case >>> statements, and I'm not sure yet on how we can get rid of them. >>> >>> The one thing that I'm not sure we want is the the coupling of the update >>> step to the loss function (and the regularization by extension). >>> It's something that is also in the current implementation of the >>> optimization framework, but I was hoping we can get rid of that. >>> As Mikio said we want interfaces that are clean,and updating the weights >>> should not be the "responsibility" of the loss or the regularization, >>> rather it should be handled by a function that takes the required >>> information to perform the update. >>> >>> I'm a fan of how Breeze does optimization, where they have a >>> FirstOrderMinimizer interface that has takeStep and chooseDescentDirection >>> functions, >>> and the actual optimization steps are performed in an infiniteIterations >>> function implemented in FirstOrderMinimizer that all the first order >>> minimization >>> algorithms use, the thing that changes in the various algorithms is how >>> the gradients are calculated, weights updated etc. >>> >>> I would definitely recommend taking a look at their code. >>> >>> >>> On 05/28/2015 10:22 AM, Till Rohrmann wrote: >>> >>> I opened a PR [1] with the above-mentioned changes. Would be great if you >>> could take a look and give me some feedback whether this is now easier to >>> understand. >>> >>> Cheers, >>> Till >>> >>> [1] https://github.com/apache/flink/pull/740 >>> >>> On Thu, May 28, 2015 at 1:16 AM, Till Rohrmann <t...@data-artisans.com> >>> wrote: >>>> >>>> Hi Mikio, >>>> >>>> I agree that we should add some more syntactic sugar to make programming >>>> with Scala more succinct and more fun :-) Especially, for the use case >>>> where >>>> you have a simple map operation with a broadcast variable. I like your >>>> approach of wrapping the, admittedly, cumbersome RichMapFunction creation >>>> into the helper function. We could even use the pimp my library pattern do >>>> make it “part” of the usual DataSet. >>>> >>>> Something like that works: >>>> >>>> val x: DataSet[X] = ... >>>> val bcVar: DataSet[B] = ... >>>> >>>> x.mapWithBroadcast(bcVar){ >>>> (element: X, var: B) => element - var >>>> } >>>> >>>> Great idea Mikio. I really like that :-) I guess that we all got a little >>>> bit “betriebsblind” and stopped questioning ;-) >>>> >>>> Concerning the strong coupling of SGD with the different parts of the >>>> loss function, the idea was to make it easily configurable. I agree, >>>> though, >>>> that one could at least encapsulate the prediction function as part of the >>>> loss function and also add the regularization function to it. This would >>>> simplify the code of SGD. A possible interface for a loss function could >>>> look like >>>> >>>> trait LossFunction { >>>> def loss(labeledVector: LabeledVector, weightVector: WeightVector): >>>> Double >>>> >>>> def gradient(labeledVector: LabeledVector, weightVector: WeightVector): >>>> WeightVector >>>> >>>> def updateWeightVector(oldWeightVector: WeightVector, newWeightVector: >>>> WeightVector, regularizationValue: Double): WeightVector >>>> } >>>> >>>> I can try to make a prototype of this interface to see how it looks and >>>> feels like. >>>> >>>> Cheers, >>>> >>>> Till >>>> >>>> >>>> On Tue, May 26, 2015 at 3:03 PM, Mikio Braun <mikiobr...@googlemail.com> >>>> wrote: >>>>> >>>>> I've been playing around with a few ideas, but you could have a helper >>>>> RichMapFunction like >>>>> >>>>> class MapWithBroadcast[B, In, Out](name: String)(fct: (B) => (In) => >>>>> Out) >>>>> extends RichMapFunction[In, Out] >>>>> { >>>>> var f: (In) => Out = _ >>>>> >>>>> override def open(parameters: Configuration): Unit = { >>>>> f = fct(getRuntimeContext.getBroadcastVariable[B](name).get(0)) >>>>> } >>>>> >>>>> def map(in: In): Out = f(in) >>>>> } >>>>> >>>>> which takes a function which gets the broadcast value and then returns >>>>> a function doing the actual map, and you would then use it like this: >>>>> >>>>> def mapWithBroadcast[B, I, O: ClassTag : TypeInformation]( >>>>> xs: DataSet[I], bc: DataSet[B]) >>>>> (fct: (B) => (I) => O): DataSet[O] = { >>>>> xs.map(new MapWithBroadcast[B, I, >>>>> O]("dummy")(fct)).withBroadcastSet(bc, "dummy") >>>>> } >>>>> >>>>> And then... >>>>> >>>>> val x = env.fromCollection(Seq(1, 2, 3, 4, 5, 6, 7)) >>>>> val sum = x.reduce(_ + _) >>>>> >>>>> val foo = mapWithBroadcast(x, sum)(sum => (x: Int) => x - sum) >>>>> >>>>> Has something like this been considered at any point? Or are you all >>>>> at the point where you can type in the broadcasting stuff while you >>>>> sleep ;) ? >>>>> >>>>> -M >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, May 26, 2015 at 1:14 PM, Mikio Braun <mikiobr...@googlemail.com> >>>>> wrote: >>>>> > Hi Theodore and Till, >>>>> > >>>>> > last Friday Till suggested I have a look at the gradient code. Took me >>>>> > a while, but I think I got the main structure. >>>>> > >>>>> > I think I understand why the code is written how it is, but just >>>>> > speaking from a purely data science perspective, I find that a lot of >>>>> > the code is superheavy on boilterplate code. The LossCalculation and >>>>> > so on RichMapFunctions in GradientDescent easily take up 20 lines of >>>>> > code just to unpacka few broadcast variable and call a single >>>>> > function. I understand why we need to use broadcast variables here, >>>>> > and that this is a common idiom in Flink, but I wonder whether one >>>>> > cannot reduce the boilerplate code. >>>>> > >>>>> > In my experience, it is pretty important when implementing ML >>>>> > algorithms that the mapping back from the code to the algorithm is >>>>> > still somewhat clear, because if you need to do modifications or >>>>> > debugging, it's important that you still know what corresponds to >>>>> > what, and with so many lines of code just moving data around, that's >>>>> > hardly the case. >>>>> > >>>>> > A general design decision I would like to question is the tight >>>>> > integration of the specific loss functions, regularizers, etc. into >>>>> > the GradientDescent code. Given that it is really a pretty general >>>>> > approach which is applicable to many many different models, I think >>>>> > It's better to have the stochastic gradient code being as agnostic as >>>>> > possible about exactly what it is optimizing. Instead, you would pass >>>>> > it a function which contains a method to compute a function value, and >>>>> > its derivative, and possibly some action to regularize the parameters >>>>> > (if this cannot be already done with the function itself). >>>>> > >>>>> > Then, such a function for learning linear models can be configured in >>>>> > the same way as it is right now. I personally would opt for optional >>>>> > parameters instead of the builder-method style, but that is ok with >>>>> > me. >>>>> > >>>>> > This would in particular mean that I would try to eliminate the case >>>>> > statements as much as possible. >>>>> > >>>>> > The result would then be a bunch of classes around the gradient >>>>> > descent optimizer which very clearly just encode that algorithm, and a >>>>> > bunch of classes around cost functions for linear models which encode >>>>> > that, and both parts talk with one another only through a small, clean >>>>> > interface. Plus, this interface could be reused for other optimization >>>>> > algorithms like BFGS, or the Trust Region Newton method we talked >>>>> > about last time. >>>>> > >>>>> > So my suggestions: >>>>> > - think of some way to reduce boilerplate code for broadcast >>>>> > variables. >>>>> > - split gradient and the specific model for linear models to make >>>>> > the individual parts cleaner and more extensible >>>>> > >>>>> > Hope this helps, >>>>> > >>>>> > -M >>>>> > >>>>> > >>>>> > -- >>>>> > Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun >>>>> >>>>> >>>>> >>>>> -- >>>>> Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun >>>> >>>> >>>> >>>> >>>> -- >>>> Data Artisans GmbH | Tempelhofer Ufer 17 | 10963 Berlin >>>> >>>> i...@data-artisans.com >>>> phone +493055599146 >>>> mobile +4917671721261 >>>> >>>> Registered at Amtsgericht Charlottenburg - HRB 158244 B >>>> Managing Directors: Kostas Tzoumas, Stephan Ewen >>> >>> >>> >>> >>> -- >>> Data Artisans GmbH | Tempelhofer Ufer 17 | 10963 Berlin >>> >>> i...@data-artisans.com >>> phone +493055599146 >>> mobile +4917671721261 >>> >>> Registered at Amtsgericht Charlottenburg - HRB 158244 B >>> Managing Directors: Kostas Tzoumas, Stephan Ewen >>> >>> >> >> >> >> -- >> Data Artisans GmbH | Tempelhofer Ufer 17 | 10963 Berlin >> >> i...@data-artisans.com >> phone +493055599146 >> mobile +4917671721261 >> >> Registered at Amtsgericht Charlottenburg - HRB 158244 B >> Managing Directors: Kostas Tzoumas, Stephan Ewen > > > > -- > Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun
-- Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun