Re: Problem with ML pipeline

Felix Neutatz Mon, 08 Jun 2015 03:58:09 -0700

I am in favor of efficiency. Therefore I would be prefer to introduce new
methods, in order to save memory and network traffic. This would also solve
the problem of "how to come up with ids?"


Best regards,
Felix
Am 08.06.2015 12:52 nachm. schrieb "Sachin Goel" <sachingoel0...@gmail.com>:

> I think if the user doesn't provide IDs, we can safely assume that they
> don't need it. We can just simply assign an ID of one as a temporary
> measure and return the result, with no IDs [just to make the interface
> cleaner].
> If the IDs are provided, in that case, we simply use those IDs.
> A possible template for this would be:
>
> implicit def predictValues[T <: Vector] = {
>     new PredictOperation[SVM, T, LabeledVector]{
>       override def predict(
>           instance: SVM,
>           predictParameters: ParameterMap,
>           input: DataSet[T])
>         : DataSet[LabeledVector] = {
>             predict(ParameterMap,input.map(x=>(1,x))).map(x=> x._2)
>         }
>     }
> }
>
> implicit def predictValues[T <: (ID,Vector)] = {
>     new PredictOperation[SVM, T, (ID,LabeledVector)]{
>       override def predict(
>           instance: SVM,
>           predictParameters: ParameterMap,
>           input: DataSet[T])
>         : DataSet[LabeledVector] = {
>             predict(ParameterMap,input)
>         }
>     }
> }
>
> Regards
> Sachin Goel
>
> On Mon, Jun 8, 2015 at 4:11 PM, Till Rohrmann <till.rohrm...@gmail.com>
> wrote:
>
> > My gut feeling is also that a `Transformer` would be a good place to
> > implement feature selection. Then you can simply reuse it across multiple
> > algorithms by simply chaining them together.
> >
> > However, I don't know yet what's the best way to realize the IDs. One way
> > would be to add an ID field to `Vector` and `LabeledVector`. Another way
> > would be to provide operations for `(ID, Vector)` and `(ID,
> LabeledVector)`
> > tuple types which reuse the implementations for `Vector` and
> > `LabeledVector`. This means that the developer doesn't have to implement
> > special operations for the tuple variants. The latter approach has the
> > advantage that you only use memory for IDs if you really need them.
> >
> > Another question is how to assign the IDs. Does the user have to provide
> > them? Are they randomly chosen. Or do we assign each element an
> increasing
> > index based on the total number of elements?
> >
> > On Mon, Jun 8, 2015 at 12:00 PM Mikio Braun <mikiobr...@googlemail.com>
> > wrote:
> >
> > > Hi all,
> > >
> > > I think there are number of issues here:
> > >
> > > - whether or not we generally need ids for our examples. For
> > > time-series, this is a must, but I think it would also help us with
> > > many other things (like partitioning the data, or picking a consistent
> > > subset), so I would think adding (numeric) ids in general to
> > > LabeledVector would be ok.
> > > - Some machinery to select features. My biggest concern here for
> > > putting that as a parameter to the learning algorithm is that this
> > > something independent of the learning algorith, so every algorithm
> > > would need to duplicate the code for that. I think it's better if the
> > > learning algorithm can assume that the LabelVector already contains
> > > all the relevant features, and then there should be other operations
> > > to project or extract a subset of examples.
> > >
> > > -M
> > >
> > > On Mon, Jun 8, 2015 at 10:01 AM, Till Rohrmann <
> till.rohrm...@gmail.com>
> > > wrote:
> > > > You're right Felix. You need to provide the `FitOperation` and
> > > > `PredictOperation` for the `Predictor` you want to use and the
> > > > `FitOperation` and `TransformOperation` for all `Transformer`s you
> want
> > > to
> > > > chain in front of the `Predictor`.
> > > >
> > > > Specifying which features to take could be a solution. However, then
> > > you're
> > > > always carrying data along which is not needed. Especially for large
> > > scale
> > > > data, this might be prohibitive expensive. I guess the more efficient
> > > > solution would be to assign an ID and later join with the removed
> > feature
> > > > elements.
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Mon, Jun 8, 2015 at 7:11 AM Sachin Goel <sachingoel0...@gmail.com
> >
> > > wrote:
> > > >
> > > >> A more general approach would be to take as input which indices of
> the
> > > >> vector to consider as features. After that, the vector can be
> returned
> > > as
> > > >> such and user can do what they  wish with the non-feature values.
> This
> > > >> wouldn't need extending the predict operation, instead this can be
> > > >> specified in the model itself using a set parameter function. Or
> > > perhaps a
> > > >> better approach is to just take this input in the predict operation.
> > > >>
> > > >> Cheers!
> > > >> Sachin
> > > >> On Jun 8, 2015 10:17 AM, "Felix Neutatz" <neut...@googlemail.com>
> > > wrote:
> > > >>
> > > >> > Probably we also need it for the other classes of the pipeline as
> > > well,
> > > >> in
> > > >> > order to be able to pass the ID through the whole pipeline.
> > > >> >
> > > >> > Best regards,
> > > >> > Felix
> > > >> >  Am 06.06.2015 9:46 vorm. schrieb "Till Rohrmann" <
> > > trohrm...@apache.org
> > > >> >:
> > > >> >
> > > >> > > Then you only have to provide an implicit PredictOperation[SVM,
> > (T,
> > > >> Int),
> > > >> > > (LabeledVector, Int)] value with T <: Vector in the scope where
> > you
> > > >> call
> > > >> > > the predict operation.
> > > >> > > On Jun 6, 2015 8:14 AM, "Felix Neutatz" <neut...@googlemail.com
> >
> > > >> wrote:
> > > >> > >
> > > >> > > > That would be great. I like the special predict operation
> better
> > > >> > because
> > > >> > > it
> > > >> > > > is only in some cases necessary to return the id. The special
> > > predict
> > > >> > > > Operation would save this overhead.
> > > >> > > >
> > > >> > > > Best regards,
> > > >> > > > Felix
> > > >> > > > Am 04.06.2015 7:56 nachm. schrieb "Till Rohrmann" <
> > > >> > > till.rohrm...@gmail.com
> > > >> > > > >:
> > > >> > > >
> > > >> > > > > I see your problem. One way to solve the problem is to
> > > implement a
> > > >> > > > special
> > > >> > > > > PredictOperation which takes a tuple (id, vector) and
> returns
> > a
> > > >> tuple
> > > >> > > > (id,
> > > >> > > > > labeledVector). You can take a look at the implementation
> for
> > > the
> > > >> > > vector
> > > >> > > > > prediction operation.
> > > >> > > > >
> > > >> > > > > But we can also discuss about adding an ID field to the
> Vector
> > > >> type.
> > > >> > > > >
> > > >> > > > > Cheers,
> > > >> > > > > Till
> > > >> > > > > On Jun 4, 2015 7:30 PM, "Felix Neutatz" <
> > neut...@googlemail.com
> > > >
> > > >> > > wrote:
> > > >> > > > >
> > > >> > > > > > Hi,
> > > >> > > > > >
> > > >> > > > > > I have the following use case: I want to to regression
> for a
> > > >> > > timeseries
> > > >> > > > > > dataset like:
> > > >> > > > > >
> > > >> > > > > > id, x1, x2, ..., xn, y
> > > >> > > > > >
> > > >> > > > > > id = point in time
> > > >> > > > > > x = features
> > > >> > > > > > y = target value
> > > >> > > > > >
> > > >> > > > > > In the Flink frame work I would map this to a
> LabeledVector
> > > (y,
> > > >> > > > > > DenseVector(x)). (I don't want to use the id as a feature)
> > > >> > > > > >
> > > >> > > > > > When I apply finally the predict() method I get a
> > > LabeledVector
> > > >> > > > > > (y_predicted, DenseVector(x)).
> > > >> > > > > >
> > > >> > > > > > Now my problem is that I would like to plot the predicted
> > > target
> > > >> > > value
> > > >> > > > > > according to its time.
> > > >> > > > > >
> > > >> > > > > > What I have to do now is:
> > > >> > > > > >
> > > >> > > > > > a = predictedDataSet.map ( LabeledVector => Tuple2(x,y_p))
> > > >> > > > > > b = originalDataSet.map("id, x1, x2, ..., xn, y" =>
> > > Tuple2(x,id))
> > > >> > > > > >
> > > >> > > > > > a.join(b).where("x").equalTo("x") { (a,b) => (id, y_p)
> > > >> > > > > >
> > > >> > > > > > This is really a cumbersome process for such an simple
> > thing.
> > > Is
> > > >> > > there
> > > >> > > > > any
> > > >> > > > > > approach which makes this more simple. If not, can we
> extend
> > > the
> > > >> ML
> > > >> > > > API.
> > > >> > > > > to
> > > >> > > > > > allow ids?
> > > >> > > > > >
> > > >> > > > > > Best regards,
> > > >> > > > > > Felix
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> > >
> > >
> > > --
> > > Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun
> > >
> >
>

Re: Problem with ML pipeline

Reply via email to