Re: Problem with ML pipeline

Theodore Vasiloudis Mon, 08 Jun 2015 03:37:32 -0700

I agree with Mikio; ids would be useful overall, and feature selection
should not be a part of learning algorithms,
all features in a LabeledVector should be assumed to be relevant by the
learners.


On Mon, Jun 8, 2015 at 12:00 PM, Mikio Braun <mikiobr...@googlemail.com>
wrote:

> Hi all,
>
> I think there are number of issues here:
>
> - whether or not we generally need ids for our examples. For
> time-series, this is a must, but I think it would also help us with
> many other things (like partitioning the data, or picking a consistent
> subset), so I would think adding (numeric) ids in general to
> LabeledVector would be ok.
> - Some machinery to select features. My biggest concern here for
> putting that as a parameter to the learning algorithm is that this
> something independent of the learning algorith, so every algorithm
> would need to duplicate the code for that. I think it's better if the
> learning algorithm can assume that the LabelVector already contains
> all the relevant features, and then there should be other operations
> to project or extract a subset of examples.
>
> -M
>
> On Mon, Jun 8, 2015 at 10:01 AM, Till Rohrmann <till.rohrm...@gmail.com>
> wrote:
> > You're right Felix. You need to provide the `FitOperation` and
> > `PredictOperation` for the `Predictor` you want to use and the
> > `FitOperation` and `TransformOperation` for all `Transformer`s you want
> to
> > chain in front of the `Predictor`.
> >
> > Specifying which features to take could be a solution. However, then
> you're
> > always carrying data along which is not needed. Especially for large
> scale
> > data, this might be prohibitive expensive. I guess the more efficient
> > solution would be to assign an ID and later join with the removed feature
> > elements.
> >
> > Cheers,
> > Till
> >
> > On Mon, Jun 8, 2015 at 7:11 AM Sachin Goel <sachingoel0...@gmail.com>
> wrote:
> >
> >> A more general approach would be to take as input which indices of the
> >> vector to consider as features. After that, the vector can be returned
> as
> >> such and user can do what they  wish with the non-feature values. This
> >> wouldn't need extending the predict operation, instead this can be
> >> specified in the model itself using a set parameter function. Or
> perhaps a
> >> better approach is to just take this input in the predict operation.
> >>
> >> Cheers!
> >> Sachin
> >> On Jun 8, 2015 10:17 AM, "Felix Neutatz" <neut...@googlemail.com>
> wrote:
> >>
> >> > Probably we also need it for the other classes of the pipeline as
> well,
> >> in
> >> > order to be able to pass the ID through the whole pipeline.
> >> >
> >> > Best regards,
> >> > Felix
> >> >  Am 06.06.2015 9:46 vorm. schrieb "Till Rohrmann" <
> trohrm...@apache.org
> >> >:
> >> >
> >> > > Then you only have to provide an implicit PredictOperation[SVM, (T,
> >> Int),
> >> > > (LabeledVector, Int)] value with T <: Vector in the scope where you
> >> call
> >> > > the predict operation.
> >> > > On Jun 6, 2015 8:14 AM, "Felix Neutatz" <neut...@googlemail.com>
> >> wrote:
> >> > >
> >> > > > That would be great. I like the special predict operation better
> >> > because
> >> > > it
> >> > > > is only in some cases necessary to return the id. The special
> predict
> >> > > > Operation would save this overhead.
> >> > > >
> >> > > > Best regards,
> >> > > > Felix
> >> > > > Am 04.06.2015 7:56 nachm. schrieb "Till Rohrmann" <
> >> > > till.rohrm...@gmail.com
> >> > > > >:
> >> > > >
> >> > > > > I see your problem. One way to solve the problem is to
> implement a
> >> > > > special
> >> > > > > PredictOperation which takes a tuple (id, vector) and returns a
> >> tuple
> >> > > > (id,
> >> > > > > labeledVector). You can take a look at the implementation for
> the
> >> > > vector
> >> > > > > prediction operation.
> >> > > > >
> >> > > > > But we can also discuss about adding an ID field to the Vector
> >> type.
> >> > > > >
> >> > > > > Cheers,
> >> > > > > Till
> >> > > > > On Jun 4, 2015 7:30 PM, "Felix Neutatz" <neut...@googlemail.com
> >
> >> > > wrote:
> >> > > > >
> >> > > > > > Hi,
> >> > > > > >
> >> > > > > > I have the following use case: I want to to regression for a
> >> > > timeseries
> >> > > > > > dataset like:
> >> > > > > >
> >> > > > > > id, x1, x2, ..., xn, y
> >> > > > > >
> >> > > > > > id = point in time
> >> > > > > > x = features
> >> > > > > > y = target value
> >> > > > > >
> >> > > > > > In the Flink frame work I would map this to a LabeledVector
> (y,
> >> > > > > > DenseVector(x)). (I don't want to use the id as a feature)
> >> > > > > >
> >> > > > > > When I apply finally the predict() method I get a
> LabeledVector
> >> > > > > > (y_predicted, DenseVector(x)).
> >> > > > > >
> >> > > > > > Now my problem is that I would like to plot the predicted
> target
> >> > > value
> >> > > > > > according to its time.
> >> > > > > >
> >> > > > > > What I have to do now is:
> >> > > > > >
> >> > > > > > a = predictedDataSet.map ( LabeledVector => Tuple2(x,y_p))
> >> > > > > > b = originalDataSet.map("id, x1, x2, ..., xn, y" =>
> Tuple2(x,id))
> >> > > > > >
> >> > > > > > a.join(b).where("x").equalTo("x") { (a,b) => (id, y_p)
> >> > > > > >
> >> > > > > > This is really a cumbersome process for such an simple thing.
> Is
> >> > > there
> >> > > > > any
> >> > > > > > approach which makes this more simple. If not, can we extend
> the
> >> ML
> >> > > > API.
> >> > > > > to
> >> > > > > > allow ids?
> >> > > > > >
> >> > > > > > Best regards,
> >> > > > > > Felix
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
>
>
>
> --
> Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun
>

Re: Problem with ML pipeline

Reply via email to