tions of rows to aid with shuffling ... if this is the
case, any suggestions on how to ameliorate this situation?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-runs-out-of-memory-when-reading-in-a-huge-matrix-tp25590.html
Sent from the Apache Spark U
I am still a bit confused whether numbers like these can be aggregated as
double:
iVal * jVal / (math.min(sg, colMags(i)) * math.min(sg, colMags(j))
It should be aggregated using something like List[iVal*jVal, colMags(i),
colMags(j)]
I am not sure Algebird can aggregate deterministically over Do
Hi Deb,
I am currently seeding the algorithm to be pseudo-random, this is an issue
being addressed in the PR. If you pull the current version it will be
deterministic, but not potentially not pseudo-random. The PR will updated
today.
Best,
Reza
On Thu, Sep 18, 2014 at 2:06 PM, Debasish Das
wrote:
Hi Reza,
Have you tested if different runs of the algorithm produce different
similarities (basically if the algorithm is deterministic) ?
This number does not look like a Monoid aggregation...iVal * jVal /
(math.min(sg, colMags(i)) * math.min(sg, colMags(j))
I am noticing some weird behavior as
Yup that's what I did for now...
On Thu, Sep 18, 2014 at 10:34 AM, Reza Zadeh wrote:
> Hi Deb,
>
> I am not templating RowMatrix/CoordinateMatrix since that would be a big
> deviation from the PR. We can add jaccard and other similarity measures in
> later PRs.
>
> In the meantime, you can un-no
Hi Deb,
I am not templating RowMatrix/CoordinateMatrix since that would be a big
deviation from the PR. We can add jaccard and other similarity measures in
later PRs.
In the meantime, you can un-normalize the cosine similarities to get the
dot product, and then compute the other similarity measur
Hi Reza,
In similarColumns, it seems with cosine similarity I also need other
numbers such as intersection, jaccard and other measures...
Right now I modified the code to generate jaccard but I had to run it twice
due to the design of RowMatrix / CoordinateMatrix...I feel we should modify
RowMatr
Better to do it in a PR of your own, it's not sufficiently related to dimsum
On Tue, Sep 9, 2014 at 7:03 AM, Debasish Das
wrote:
> Cool...can I add loadRowMatrix in your PR ?
>
> Thanks.
> Deb
>
> On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh wrote:
>
>> Hi Deb,
>>
>> Did you mean to message me in
Cool...can I add loadRowMatrix in your PR ?
Thanks.
Deb
On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh wrote:
> Hi Deb,
>
> Did you mean to message me instead of Xiangrui?
>
> For TS matrices, dimsum with positiveinfinity and computeGramian have the
> same cost, so you can do either one. For dense
Hi Deb,
Did you mean to message me instead of Xiangrui?
For TS matrices, dimsum with positiveinfinity and computeGramian have the
same cost, so you can do either one. For dense matrices with say, 1m
columns this won't be computationally feasible and you'll want to start
sampling with dimsum.
It
Hi Xiangrui,
For tall skinny matrices, if I can pass a similarityMeasure to
computeGrammian, I could re-use the SVD's computeGrammian for similarity
computation as well...
Do you recommend using this approach for tall skinny matrices or just use
the dimsum's routines ?
Right now RowMatrix does n
I will add dice, overlap, and jaccard similarity in a future PR, probably
still for 1.2
On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das
wrote:
> Awesome...Let me try it out...
>
> Any plans of putting other similarity measures in future (jaccard is
> something that will be useful) ? I guess it mak
Awesome...Let me try it out...
Any plans of putting other similarity measures in future (jaccard is
something that will be useful) ? I guess it makes sense to add some
similarity measures in mllib...
On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh wrote:
> Yes you're right, calling dimsum with gamm
Yes you're right, calling dimsum with gamma as PositiveInfinity turns it
into the usual brute force algorithm for cosine similarity, there is no
sampling. This is by design.
On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das
wrote:
> I looked at the code: similarColumns(Double.posInf) is generating t
I looked at the code: similarColumns(Double.posInf) is generating the brute
force...
Basically dimsum with gamma as PositiveInfinity will produce the exact same
result as doing catesian products of RDD[(product, vector)] and computing
similarities or there will be some approximation ?
Sorry I hav
For 60M x 10K brute force and dimsum thresholding should be fine.
For 60M x 10M probably brute force won't work depending on the cluster's
power, and dimsum thresholding should work with appropriate threshold.
Dimensionality reduction should help, and how effective it is will depend
on your appli
Also for tall and wide (rows ~60M, columns 10M), I am considering running a
matrix factorization to reduce the dimension to say ~60M x 50 and then run
all pair similarity...
Did you also try similar ideas and saw positive results ?
On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das
wrote:
> Ok...ju
Ok...just to make sure I have RowMatrix[SparseVector] where rows are ~ 60M
and columns are 10M say with billion data points...
I have another version that's around 60M and ~ 10K...
I guess for the second one both all pair and dimsum will run fine...
But for tall and wide, what do you suggest ? c
You might want to wait until Wednesday since the interface will be changing
in that PR before Wednesday, probably over the weekend, so that you don't
have to redo your code. Your call if you need it before a week.
Reza
On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das
wrote:
> Ohh coolall-pairs
Ohh coolall-pairs brute force is also part of this PR ? Let me pull it
in and test on our dataset...
Thanks.
Deb
On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh wrote:
> Hi Deb,
>
> We are adding all-pairs and thresholded all-pairs via dimsum in this PR:
> https://github.com/apache/spark/pull/1
Hi Deb,
We are adding all-pairs and thresholded all-pairs via dimsum in this PR:
https://github.com/apache/spark/pull/1778
Your question wasn't entirely clear - does this answer it?
Best,
Reza
On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das
wrote:
> Hi Reza,
>
> Have you compared with the brute
Hi Reza,
Have you compared with the brute force algorithm for similarity computation
with something like the following in Spark ?
https://github.com/echen/scaldingale
I am adding cosine similarity computation but I do want to compute an all
pair similarities...
Note that the data is sparse for
Hi Guillaume,
Thanks for your explanation. It helps me a lot. I will try it.
Xiaoli
On 04/12/2014 06:35 PM, Xiaoli Li
wrote:
Hi Guillaume,
This sounds a good idea to me. I am a newbie here. Could
you further explain how will you determine which clusters to
keep? According to the distance between each el
Hi Tom,
Thank you very much for your detailed explanation. I think it is very
helpful to me.
On Sat, Apr 12, 2014 at 1:06 PM, Tom V wrote:
> The last writer is suggesting using the triangle inequality to cut down
> the search space. If c is the centroid of cluster C, then the closest any
> p
The last writer is suggesting using the triangle inequality to cut down the
search space. If c is the centroid of cluster C, then the closest any
point in C is to x is ||x-c|| - r(C), where r(C) is the (precomputed)
radius of the cluster---the distance of the farthest point in C to c.
Whether you
Hi Guillaume,
This sounds a good idea to me. I am a newbie here. Could you further
explain how will you determine which clusters to keep? According to the
distance between each element with each cluster center?
Will you keep several clusters for each element for searching nearest
neighbours? Thank
Hi,
I'm doing this here for multiple tens of millions of elements (and
the goal is to reach multiple billions), on a relatively small
cluster (7 nodes 4 cores 32GB RAM). We use multiprobe KLSH. All you
have to do is run a Kmeans on your data, then compute the distanc
Hi Reza,
Thank you for your information. I will try it.
On Fri, Apr 11, 2014 at 11:21 PM, Reza Zadeh wrote:
> Hi Xiaoli,
>
> There is a PR currently in progress to allow this, via the sampling scheme
> described in this paper: stanford.edu/~rezab/papers/dimsum.pdf
>
> The PR is at https://git
Hi Xiaoli,
There is a PR currently in progress to allow this, via the sampling scheme
described in this paper: stanford.edu/~rezab/papers/dimsum.pdf
The PR is at https://github.com/apache/spark/pull/336 though it will need
refactoring given the recent changes to matrix interface in MLlib. You may
Hi Andrew,
Thanks for your suggestion. I have tried the method. I used 8 nodes and
every node has 8G memory. The program just stopped at a stage for about
several hours without any further information. Maybe I need to find
out a more efficient way.
On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash wr
The naive way would be to put all the users and their attributes into an
RDD, then cartesian product that with itself. Run the similarity score on
every pair (1M * 1M => 1T scores), map to (user, (score, otherUser)) and
take the .top(k) for each user.
I doubt that you'll be able to take this appr
Hi all,
I am implementing an algorithm using Spark. I have one million users. I
need to compute the similarity between each pair of users using some user's
attributes. For each user, I need to get top k most similar users. What is
the best way to implement this?
Thanks.
33 matches
Mail list logo