Re: Problem with ML pipeline

2015-06-08 Thread Sachin Goel
That would be better of course. My opinion had to do with not-implementing-exactly-the-same-thing-twice. Perhaps Till could weigh in here. We really do need to come up with a general mechanism for this. Testing labeled vectors has exactly the same problem. I'll look into how Spark and sci-kit appro

Re: Problem with ML pipeline

2015-06-08 Thread Felix Neutatz
I am in favor of efficiency. Therefore I would be prefer to introduce new methods, in order to save memory and network traffic. This would also solve the problem of "how to come up with ids?" Best regards, Felix Am 08.06.2015 12:52 nachm. schrieb "Sachin Goel" : > I think if the user doesn't prov

Re: Problem with ML pipeline

2015-06-08 Thread Sachin Goel
I think if the user doesn't provide IDs, we can safely assume that they don't need it. We can just simply assign an ID of one as a temporary measure and return the result, with no IDs [just to make the interface cleaner]. If the IDs are provided, in that case, we simply use those IDs. A possible te

Re: Problem with ML pipeline

2015-06-08 Thread Till Rohrmann
My gut feeling is also that a `Transformer` would be a good place to implement feature selection. Then you can simply reuse it across multiple algorithms by simply chaining them together. However, I don't know yet what's the best way to realize the IDs. One way would be to add an ID field to `Vect

Re: Problem with ML pipeline

2015-06-08 Thread Sachin Goel
Yes. I agree too. It makes no sense for the learning algorithm to have extra payload. Only relevant data makes sense. Further, adding ID to the predict operation type definition seems a legitimate choice. +1 from my side. Regards Sachin Goel On Mon, Jun 8, 2015 at 4:06 PM, Theodore Vasiloudis < t

Re: Problem with ML pipeline

2015-06-08 Thread Theodore Vasiloudis
I agree with Mikio; ids would be useful overall, and feature selection should not be a part of learning algorithms, all features in a LabeledVector should be assumed to be relevant by the learners. On Mon, Jun 8, 2015 at 12:00 PM, Mikio Braun wrote: > Hi all, > > I think there are number of issu

Re: Problem with ML pipeline

2015-06-08 Thread Mikio Braun
Hi all, I think there are number of issues here: - whether or not we generally need ids for our examples. For time-series, this is a must, but I think it would also help us with many other things (like partitioning the data, or picking a consistent subset), so I would think adding (numeric) ids i

Re: Problem with ML pipeline

2015-06-08 Thread Till Rohrmann
You're right Felix. You need to provide the `FitOperation` and `PredictOperation` for the `Predictor` you want to use and the `FitOperation` and `TransformOperation` for all `Transformer`s you want to chain in front of the `Predictor`. Specifying which features to take could be a solution. However

Re: Problem with ML pipeline

2015-06-07 Thread Sachin Goel
A more general approach would be to take as input which indices of the vector to consider as features. After that, the vector can be returned as such and user can do what they wish with the non-feature values. This wouldn't need extending the predict operation, instead this can be specified in the

Re: Problem with ML pipeline

2015-06-07 Thread Felix Neutatz
Probably we also need it for the other classes of the pipeline as well, in order to be able to pass the ID through the whole pipeline. Best regards, Felix Am 06.06.2015 9:46 vorm. schrieb "Till Rohrmann" : > Then you only have to provide an implicit PredictOperation[SVM, (T, Int), > (LabeledVect

Re: Problem with ML pipeline

2015-06-06 Thread Till Rohrmann
Then you only have to provide an implicit PredictOperation[SVM, (T, Int), (LabeledVector, Int)] value with T <: Vector in the scope where you call the predict operation. On Jun 6, 2015 8:14 AM, "Felix Neutatz" wrote: > That would be great. I like the special predict operation better because it >

Re: Problem with ML pipeline

2015-06-05 Thread Felix Neutatz
That would be great. I like the special predict operation better because it is only in some cases necessary to return the id. The special predict Operation would save this overhead. Best regards, Felix Am 04.06.2015 7:56 nachm. schrieb "Till Rohrmann" : > I see your problem. One way to solve the

Re: Problem with ML pipeline

2015-06-04 Thread Till Rohrmann
I see your problem. One way to solve the problem is to implement a special PredictOperation which takes a tuple (id, vector) and returns a tuple (id, labeledVector). You can take a look at the implementation for the vector prediction operation. But we can also discuss about adding an ID field to t

Problem with ML pipeline

2015-06-04 Thread Felix Neutatz
Hi, I have the following use case: I want to to regression for a timeseries dataset like: id, x1, x2, ..., xn, y id = point in time x = features y = target value In the Flink frame work I would map this to a LabeledVector (y, DenseVector(x)). (I don't want to use the id as a feature) When I ap