FWIW the JIRA I was thinking about is
https://issues.apache.org/jira/browse/SPARK-3098
On Mon, Mar 16, 2015 at 6:10 PM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:
> I vaguely remember that JIRA and AFAIK Matei's point was that the order is
> not guaranteed *after* a shuffle. If yo
I vaguely remember that JIRA and AFAIK Matei's point was that the order is
not guaranteed *after* a shuffle. If you only use operations like map which
preserve partitioning, ordering should be guaranteed from what I know.
On Mon, Mar 16, 2015 at 6:06 PM, Sean Owen wrote:
> Dang I can't seem to f
Dang I can't seem to find the JIRA now but I am sure we had a discussion
with Matei about this and the conclusion was that RDD order is not
guaranteed unless a sort is involved.
On Mar 17, 2015 12:14 AM, "Joseph Bradley" wrote:
> This was brought up again in
> https://issues.apache.org/jira/brows
This was brought up again in
https://issues.apache.org/jira/browse/SPARK-6340 so I'll answer one item
which was asked about the reliability of zipping RDDs. Basically, it
should be reliable, and if it is not, then it should be reported as a bug.
This general approach should work (with explicit ty
Hopefully the new pipeline API addresses this problem. We have a code
example here:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/SimpleTextClassificationPipeline.scala
-Xiangrui
On Mon, Dec 29, 2014 at 5:22 AM, andy petrella wrote:
> Here is w
Here is what I did for this case : https://github.com/andypetrella/tf-idf
Le lun 29 déc. 2014 11:31, Sean Owen a écrit :
> Given (label, terms) you can just transform the values to a TF vector,
> then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can
> make a LabeledPoint from (labe
Given (label, terms) you can just transform the values to a TF vector,
then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can
make a LabeledPoint from (label, vector) pairs. Is that what you're
looking for?
On Mon, Dec 29, 2014 at 3:37 AM, Yao wrote:
> I found the TF-IDF feature extr
I found the TF-IDF feature extraction and all the MLlib code that work with
pure Vector RDD very difficult to work with due to the lack of ability to
associate vector back to the original data. Why can't Spark MLlib support
LabeledPoint?
--
View this message in context:
http://apache-spark-use
Yeah, I initially used zip but I was wondering how reliable it is. I mean,
it's the order guaranteed? What if some mode fail, and the data is pulled
out from different nodes?
And even if it can work, I found this implicit semantic quite
uncomfortable, don't you?
My0.2c
Le ven 21 nov. 2014 15:26,
Thanks for the info Andy. A big help.
One thing - I think you can figure out which document is responsible for which
vector without checking in more code.
Start with a PairRDD of [doc_id, doc_string] for each document and split that
into one RDD for each column.
The values in the doc_string RDD
/Someone will correct me if I'm wrong./
Actually, TF-IDF scores terms for a given document, an specifically TF.
Internally, these things are holding a Vector (hopefully sparsed)
representing all the possible words (up to 2²⁰) per document. So each
document afer applying TF, will be transformed in
11 matches
Mail list logo