Hi Vibhor,
We worked on a project to create lucene indexes using spark but the project
has not been managed for some time now. If there is interest we can
resurrect it
https://github.com/vsumanth10/trapezium/blob/master/dal/src/test/scala/com/verizon/bda/trapezium/dal/lucene/LuceneIndexerSuite.sc
The batch version of this is part of rowSimilarities JIRA 4823 ...if your
query points can fit in memory there is broadcast version which we are
experimenting with internallywe are using brute force KNN right now in
the PR...based on flann paper lsh did not work well but before you go to
approx
SparkSQL was built to improve upon Hive on Spark runtime further...
On Tue, May 19, 2015 at 10:37 PM, guoqing0...@yahoo.com.hk <
guoqing0...@yahoo.com.hk> wrote:
> Hive on Spark and SparkSQL which should be better , and what are the key
> characteristics and the advantages and the disadvantages b
You don't need sort...use topbykey if your topk number is less...it uses
java heap...
On May 24, 2015 10:53 AM, "Tal" wrote:
> Hi,
> I'm running this piece of code in my program:
>
> smallRdd.join(largeRdd)
> .groupBy { case (id, (_, X(a, _, _))) => a }
> .map { case (a, iterable)
What is decision list ? Inorder traversal (or some other traversal) of
fitted decision tree
On Jun 5, 2015 1:21 AM, "Sateesh Kavuri" wrote:
> Is there an existing way in SparkML to convert a decision tree to a
> decision list?
>
> On Thu, Jun 4, 2015 at 10:50 PM, Reza Zadeh wrote:
>
>> The close
It's always better to use a quasi newton solver if the runtime and problem
scale permits as there are guarantees on opti mization...owlqn and bfgs are
both quasi newton
Most single node code bases will run quasi newton solvesif you are
using sgd better is to use adadelta/adagrad or similar tri
We added SPARK-3066 for this. In 1.4 you should get the code to do BLAS
dgemm based calculation.
On Thu, Jun 18, 2015 at 8:20 AM, Ayman Farahat <
ayman.fara...@yahoo.com.invalid> wrote:
> Thanks Sabarish and Nick
> Would you happen to have some code snippets that you can share.
> Best
> Ayman
>
>
Also not sure how threading helps here because Spark puts a partition to
each core. On each core may be there are multiple threads if you are using
intel hyperthreading but I will let Spark handle the threading.
On Thu, Jun 18, 2015 at 8:38 AM, Debasish Das
wrote:
> We added SPARK-3066 for t
Also in my experiments, it's much faster to blocked BLAS through cartesian
rather than doing sc.union. Here are the details on the experiments:
https://issues.apache.org/jira/browse/SPARK-4823
On Thu, Jun 18, 2015 at 8:40 AM, Debasish Das
wrote:
> Also not sure how threading helps here
Running l1 and picking non zero coefficient s gives a good estimate of
interesting features as well...
On Jun 17, 2015 4:51 PM, "Xiangrui Meng" wrote:
> We don't have it in MLlib. The closest would be the ChiSqSelector,
> which works for categorical data. -Xiangrui
>
> On Thu, Jun 11, 2015 at 4:3
Hi,
The demo of end-to-end ML pipeline including the model server component at
Spark Summit was really cool.
I was wondering if the Model Server component is based upon Velox or it
uses a completely different architecture.
https://github.com/amplab/velox-modelserver
We are looking for an open s
> On Sat, Jun 20, 2015 at 8:00 AM, Charles Earl
>> wrote:
>>
>>> Is velox NOT open source?
>>>
>>>
>>> On Saturday, June 20, 2015, Debasish Das
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> The demo
Integration of model server with ML pipeline API.
On Sat, Jun 20, 2015 at 12:25 PM, Donald Szeto wrote:
> Mind if I ask what 1.3/1.4 ML features that you are looking for?
>
>
> On Saturday, June 20, 2015, Debasish Das wrote:
>
>> After getting used to Scala, writing
he servlet engine probably doesn't matter at all in comparison.
On Sat, Jun 20, 2015, 9:40 PM Debasish Das wrote:
> After getting used to Scala, writing Java is too much work :-)
>
> I am looking for scala based project that's using netty at its core (spray
> is one example).
&g
Model sizes are 10m x rank, 100k x rank range.
For recommendation/topic modeling I can run batch recommendAll and then
keep serving the model using a distributed cache but then I can't
incorporate per user model re-predict if user feedback is making the
current topk stale. I have to wait for next
Spark JobServer which would allow triggering
> re-computation jobs periodically. We currently just run batch
> re-computation and reload factors from S3 periodically.
>
> We then use Elasticsearch to post-filter results and blend content-based
> stuff - which I think might be more efficient
I got good runtime improvement from hive partitioninp, caching the dataset
and increasing the cores through repartition...I think for your case
generating mysql style indexing will help further..it is not supported in
spark sql yet...
I know the dataset might be too big for 1 node mysql but do you
but I'm interested to see how far
> it can be pushed.
>
> Thanks for your help!
>
>
> -- Eric
>
> On Tue, Jun 30, 2015 at 5:28 PM, Debasish Das
> wrote:
>
>> I got good runtime improvement from hive partitioninp, caching the
>> dataset and increasing
What do you need in sparkR that mllib / ml don't havemost of the basic
analysis that you need on stream can be done through mllib components...
On Jul 13, 2015 2:35 PM, "Feynman Liang" wrote:
> Sorry; I think I may have used poor wording. SparkR will let you use R to
> analyze the data, but
How do you manage the spark context elastically when your load grows from
1000 users to 1 users ?
On Tue, Jul 14, 2015 at 8:31 AM, Hafsa Asif
wrote:
> I have almost the same case. I will tell you what I am actually doing, if
> it
> is according to your requirement, then I will love to help y
>> (point, distances.filter(_._2 <= kthDistance._2))
>> }
>> }
>>
>> This is part of my Local Outlier Factor implementation.
>>
>> Of course the distances can be sorted because it is an Iterable, but it
>> gives an idea. Is it possi
Simultaneous action works on cluster fine if they are independent...on
local I never paid attention but the code path should be similar...
On Jan 18, 2016 8:00 AM, "Koert Kuipers" wrote:
> stacktrace? details?
>
> On Mon, Jan 18, 2016 at 5:58 AM, Mennour Rostom
> wrote:
>
>> Hi,
>>
>> I am runni
You can run 2 threads in driver and spark will fifo schedule the 2 jobs on
the same spark context you created (executors and cores)...same idea is
used for spark sql thriftserver flow...
For streaming i think it lets you run only one stream at a time even if you
run them on multiple threads on dri
Use breeze simplex which inturn uses apache maths simplex...if you want to
use interior point method you can use ecos
https://github.com/embotech/ecos-java-scala ...spark summit 2014 talk on
quadratic solver in matrix factorization will show you example integration
with spark. ecos runs as jni proc
t be steering this a bit off topic: does this need the simplex
> method? this is just an instance of nonnegative least squares. I don't
> think it relates to LDA either.
>
> Spark doesn't have any particular support for NNLS (right?) or simplex
> though.
>
> On Mon, N
le to add. You can add an issue in
breeze for the enhancememt.
Alternatively you can use breeze lpsolver as well that uses simplex from
apache math.
On Nov 4, 2015 1:05 AM, "Zhiliang Zhu" wrote:
> Hi Debasish Das,
>
> Firstly I must show my deep appreciation towards you kind he
Does it also support insert operations ?
On Jul 22, 2015 4:53 PM, "Bing Xiao (Bing)" wrote:
> We are happy to announce the availability of the Spark SQL on HBase
> 1.0.0 release.
> http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase
>
> The main features in this package, dubbed “As
Hi Yan,
Is it possible to access the hbase table through spark sql jdbc layer ?
Thanks.
Deb
On Jul 22, 2015 9:03 PM, "Yan Zhou.sc" wrote:
> Yes, but not all SQL-standard insert variants .
>
>
>
> *From:* Debasish Das [mailto:debasish.da...@gmail.com]
> *Sent:* Wedn
t;
>
> Graphically, the access path is as follows:
>
>
>
> Spark SQL JDBC Interface -> Spark SQL Parser/Analyzer/Optimizer->Astro
> Optimizer-> HBase Scans/Gets -> … -> HBase Region server
>
>
>
>
>
> Regards,
>
>
>
> Yan
>
>
>
Not sure dropout but if you change the solver from breeze bfgs to breeze
owlqn or breeze.proximal.NonlinearMinimizer you can solve ann loss with l1
regularization which will yield elastic net style sparse solutionsusing
that you can clean up edges which has 0.0 as weight...
On Sep 7, 2015 7:35
You may want to pull up release/1.2 branch and 1.2.0 tag to build it
yourself incase the packages are not available.
On Jan 15, 2017 2:55 PM, "Md. Rezaul Karim"
wrote:
> Hi Ayan,
>
> Thanks a million.
>
> Regards,
> _
> *Md. Rezaul Karim*, BSc, MSc
> PhD Researcher
I am not sure why I will use pipeline to do scoring...idea is to build a
model, use model ser/deser feature to put it in the row or column store of
choice and provide a api access to the model...we support these primitives
in github.com/Verizon/trapezium...the api has access to spark context in
loc
l.Model
>predict API". The predict API is in the old mllib package not the new ml
>package.
>- "why r we using dataframe and not the ML model directly from API" -
>Because as of now the new ml package does not have the direct API.
>
>
> On Sat, F
, graph and kernel models we use a lot and for them turned out that
mllib style model predict were useful if we change the underlying store...
On Feb 4, 2017 9:37 AM, "Debasish Das" wrote:
> If we expose an API to access the raw models out of PipelineModel can't we
> call p
ector.
> There is no API exposed. It is WIP but not yet released.
>
> On Sat, Feb 4, 2017 at 11:07 PM, Debasish Das
> wrote:
>
>> If we expose an API to access the raw models out of PipelineModel can't
>> we call predict directly on it from an API ? Is there a task ope
If it is 7m rows and 700k features (or say 1m features) brute force row
similarity will run fine as well...check out spark-4823...you can compare
quality with approximate variant...
On Feb 9, 2017 2:55 AM, "nguyen duc Tuan" wrote:
> Hi everyone,
> Since spark 2.1.0 introduces LSH (http://spark.ap
You can run l
On May 15, 2017 3:29 PM, "Nipun Arora" wrote:
> Thanks all for your response. I will have a look at them.
>
> Nipun
>
> On Sat, May 13, 2017 at 2:38 AM vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>> It's in scala but it should be portable in java
>> https://githu
Hi,
ECOS is a solver for second order conic programs and we showed the Spark
integration at 2014 Spark Summit
https://spark-summit.org/2014/quadratic-programing-solver-for-non-negative-matrix-factorization/.
Right now the examples show how to reformulate matrix factorization as a
SOCP and solve ea
Open source impl of dremel is parquet !
On Mon, Oct 29, 2018, 8:42 AM Gourav Sengupta
wrote:
> Hi,
>
> why not just use dremel?
>
> Regards,
> Gourav Sengupta
>
> On Mon, Oct 29, 2018 at 1:35 PM lchorbadjiev <
> lubomir.chorbadj...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm trying to reproduce the exa
Hi Michael,
I want to cache a RDD and define get() and set() operators on it. Basically
like memcached. Is it possible to build a memcached like distributed cache
using Spark SQL ? If not what do you suggest we should use for such
operations...
Thanks.
Deb
On Fri, Jul 18, 2014 at 1:00 PM, Michae
;
> On Tue, Feb 10, 2015 at 2:27 PM, Debasish Das
> wrote:
>
>> Hi Michael,
>>
>> I want to cache a RDD and define get() and set() operators on it.
>> Basically like memcached. Is it possible to build a memcached like
>> distributed cache using Spark SQL ? If
PM, Debasish Das
wrote:
> Thanks...this is what I was looking for...
>
> It will be great if Ankur can give brief details about it...Basically how
> does it contrast with memcached for example...
>
> On Tue, Feb 10, 2015 at 2:32 PM, Michael Armbrust
> wrote:
>
>
...
Neither play nor spray is being used in Spark right nowso it brings
dependencies and we already know about the akka conflicts...thriftserver on
the other hand is already integrated for JDBC access
On Tue, Feb 10, 2015 at 3:43 PM, Debasish Das
wrote:
> Also I wanted to run get() and
Hi,
I am sometimes getting WARN from running Similarity calculation:
15/02/15 23:07:55 WARN BlockManagerMasterActor: Removing BlockManager
BlockManagerId(7, abc.com, 48419, 0) with no recent heart beats: 66435ms
exceeds 45000ms
Do I need to increase the default 45 s to larger values for cases wh
Hi,
I am running brute force similarity from RowMatrix on a job with 5M x 1.5M
sparse matrix with 800M entries. With 200M entries the job run fine but
with 800M I am getting exceptions like too many files open and no space
left on device...
Seems like I need more nodes or use dimsum sampling ?
I
Did you check the GC time in the Spark
> UI? -Xiangrui
>
> On Sun, Feb 15, 2015 at 8:10 PM, Debasish Das
> wrote:
> > Hi,
> >
> > I am sometimes getting WARN from running Similarity calculation:
> >
> > 15/02/15 23:07:55 WARN BlockManagerMasterActor: Removi
Hi,
Before I send out the keys for network shuffle, in reduceByKey after map +
combine are done, I would like to filter the keys based on some threshold...
Is there a way to get the key, value after map+combine stages so that I can
run a filter on the keys ?
Thanks.
Deb
s and apply your
> filtering. Then you can finish with a reduceByKey.
>
> On Thu, Feb 19, 2015 at 9:21 AM, Debasish Das
> wrote:
>
>> Hi,
>>
>> Before I send out the keys for network shuffle, in reduceByKey after map
>> + combine are done, I would like to filt
it may mean you only shuffle (key,None) for some keys if the map-side
> combine already worked out that the key would be filtered.
>
> And then after, run a flatMap or something to make Option[B] into B.
>
> On Thu, Feb 19, 2015 at 2:21 PM, Debasish Das
> wrote:
> > Hi,
&g
be too large to use DIMSUM. Try to increase the threshold and see
> whether it helps. -Xiangrui
>
> On Tue, Feb 17, 2015 at 6:28 AM, Debasish Das
> wrote:
> > Hi,
> >
> > I am running brute force similarity from RowMatrix on a job with 5M x
> 1.5M
> > sparse ma
the
> boundary with 1.5m columns, because the output can potentially have 2.25 x
> 10^12 entries, which is a lot. (squares 1.5m)
>
> Best,
> Reza
>
>
> On Wed, Feb 25, 2015 at 10:13 AM, Debasish Das
> wrote:
>
>> Is the threshold valid only for tall skinny matri
Column based similarities work well if the columns are mild (10K, 100K, we
actually scaled it to 1.5M columns but it really stress tests the shuffle
and it needs to tune the shuffle parameters)...You can either use dimsum
sampling or come up with your own threshold based on your application that
yo
There is also a batch prediction API in PR
https://github.com/apache/spark/pull/3098
Idea here is what Sean said...don't try to reconstruct the whole matrix
which will be dense but pick a set of users and calculate topk
recommendations for them using dense level 3 blas.we are going to merge
th
You can do it in-memory as wellget 10% topK elements from each
partition and use merge from any sort algorithm like timsortbasically
aggregateBy
Your version uses shuffle but this version is 0 shuffle..assuming your data
set is cached you will be using in-memory allReduce through treeAggre
ggestions. In-memory version is quite useful. I do not
> quite understand how you can use aggregateBy to get 10% top K elements. Can
> you please give an example?
>
> Thanks,
> Aung
>
> On Fri, Mar 27, 2015 at 2:40 PM, Debasish Das
> wrote:
>
>> You can do it in-me
g a count-min data structure such as in
> https://github.com/laserson/dsq
>
> to get approximate quantiles, then use whatever values you want to filter
> the original sequence.
> ------
> *From:* Debasish Das
> *Sent:* Thursday, March 26, 2015 9:45 PM
er than N (the bound) then,
> create a new sorted list by using a priority queue and dequeuing top N
> values.
>
> In the end, I get a record for each segment with N max values for each
> segment.
>
> Regards,
> Aung
>
>
>
>
>
>
>
>
> On Fri, Mar
I have a version that works well for Netflix data but now I am validating
on internal datasets..this code will work on matrix factors and sparse
matrices that has rows = 100* columnsif columns are much smaller than
rows then col based flow works well...basically we need both flows...
I did not
Hi,
I have some code that creates ~ 80 RDD and then a sc.union is applied to
combine all 80 into one for the next step (to run topByKey for example)...
While creating 80 RDDs take 3 mins per RDD, doing a union over them takes 3
hrs (I am validating these numbers)...
Is there any checkpoint based
I will increase memory for the job...that will also fix it right ?
On Apr 10, 2015 12:43 PM, "Reza Zadeh" wrote:
> You should pull in this PR: https://github.com/apache/spark/pull/5364
> It should resolve that. It is in master.
> Best,
> Reza
>
> On Fri, Apr 10, 2
Cross Join shuffle space might not be needed since most likely through
application specific logic (topK etc) you can cut the shuffle space...Also
most likely the brute force approach will be a benchmark tool to see how
better is your clustering based KNN solution since there are several ways
you ca
Xiangrui,
Could you point to the JIRA related to tree aggregate ? ...sounds like the
allreduce idea...
I would definitely like to try it on our dataset...
Makoto,
I did run pretty big sparse dataset (20M rows, 3M sparse features) and I
got 100 iterations of SGD running in 200 seconds...10 execu
600s for Spark vs 5s for Redshift...The numbers look much different from
the amplab benchmark...
https://amplab.cs.berkeley.edu/benchmark/
Is it like SSDs or something that's helping redshift or the whole data is
in memory when you run the query ? Could you publish the query ?
Also after spark-s
Libsvm dataset converters are data dependent since your input data can be
in any serialization format and not necessarily csv...
We have flows that coverts hdfs data to libsvm/sparse vector rdd which is
sent to mllib
I am not sure if it will be easy to standardize libsvm converter on data
tha
Hi,
Databricks demo at spark summit was amazing...what's the frontend stack
used specifically for rendering multiple reactive charts on same dom? Looks
like that's an emerging pattern for correlating different data api...
Thanks
Deb
I compiled spark 1.0.1 with 2.3.0cdh5.0.2 today...
No issues with mvn compilation but my sbt build keeps failing on the sql
module...
I just saw that my scala is at 2.11.0 (with brew update)...not sure if
that's why the sbt compilation is failing...retrying..
On Sat, Jul 19, 2014 at 6:16 PM,
Yup...the scala version 2.11.0 caused it...with 2.10.4, I could compile
1.0.1 and HEAD both for 2.3.0cdh5.0.2
On Sat, Jul 19, 2014 at 8:14 PM, Debasish Das
wrote:
> I compiled spark 1.0.1 with 2.3.0cdh5.0.2 today...
>
> No issues with mvn compilation but my sbt build keeps failing o
Hi,
We have been using standalone spark for last 6 months and I used to run
application jars fine on spark cluster with the following command.
java -cp
":/app/data/spark_deploy/conf:/app/data/spark_deploy/lib/spark-assembly-1.0.0-SNAPSHOT-hadoop2.0.0-mr1-cdh4.5.0.jar:./app.jar"
-Xms2g -Xmx2g -Ds
I found the issue...
If you use spark git and generate the assembly jar then
org.apache.hadoop.io.Writable.class is packaged with it
If you use the assembly jar that ships with CDH in
/opt/cloudera/parcels/CDH/lib/spark/assembly/lib/spark-assembly_2.10-0.9.0-cdh5.0.2-hadoop2.3.0-cdh5.0.2.jar,
Hi Aureliano,
Will it be possible for you to give the test-case ? You can add it to JIRA
as well as an attachment I guess...
I am preparing the PR for ADMM based QuadraticMinimizer...In my matlab
experiments with scaling the rank to 1000 and beyond (which is too high for
ALS but gives a good idea
Dennis,
If it is PLSA with least square loss then the QuadraticMinimizer that we
open sourced should be able to solve it for modest topics (till 1000 I
believe)...if we integrate a cg solver for equality (Nocedal's KNITRO paper
is the reference) the topic size can be increased much larger than ALS
Hi,
I have set up the SPARK_LOCAL_DIRS option in spark-env.sh so that Spark can
use more shuffle space...
Does Spark cleans all the shuffle files once the runs are done ? Seems to
me that the shuffle files are not cleaned...
Do I need to set this variable ? spark.cleaner.ttl
Right now we are pl
Actually I faced it yesterday...
I had to put it in spark-env.sh and take it out from spark-defaults.conf on
1.0.1...Note that this settings should be visible on all workers..
After that I validated that SPARK_LOCAL_DIRS was indeed getting used for
shuffling...
On Thu, Aug 14, 2014 at 10:27 AM,
Hi,
For our large ALS runs, we are considering using sc.setCheckPointDir so
that the intermediate factors are written to HDFS and the lineage is
broken...
Is there a comparison which shows the performance degradation due to these
options ? If not I will be happy to add experiments with it...
Tha
DB,
Did you compare softmax regression with one-vs-all and found that softmax
is better ?
one-vs-all can be implemented as a wrapper over binary classifier that we
have in mllib...I am curious if softmax multinomial is better on most cases
or is it worthwhile to add a one vs all version of mlor a
Hi,
Are there any experiments detailing the performance hit due to HDFS
checkpoint in ALS ?
As we scale to large ranks with more ratings, I believe we have to cut the
RDD lineage to safe guard against the lineage issue...
Thanks.
Deb
Hi Brandon,
Looks very cool...will try it out for ad-hoc analysis of our datasets and
provide more feedback...
Could you please give bit more details about the differences of Spindle
architecture compared to Hue + Spark integration (python stack) and Ooyala
Jobserver ?
Does Spindle allow sharing
Hi Wei,
Sparkler code was not available for benchmarking and so I picked up
Jellyfish which uses SGD and if you look at the paper, the ideas are very
similar to sparkler paper but Jellyfish is on shared memory and uses C code
while sparkler was built on top of spark...Jellyfish used some interesti
Hi Burak,
This LDA implementation is friendly to the equality and positivity als code
that I added in the following JIRA to formulate robust plsa
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-2426
Should I build upon the PR that you pointed ? I want to run some
experiment
Breeze author David also has a github project on cuda binding in
scalado you prefer using java or scala ?
On Aug 27, 2014 2:05 PM, "Frank van Lankvelt"
wrote:
> you could try looking at ScalaCL[1], it's targeting OpenCL rather than
> CUDA, but that might be close enough?
>
> cheers, Frank
>
Hi Reza,
Have you compared with the brute force algorithm for similarity computation
with something like the following in Spark ?
https://github.com/echen/scaldingale
I am adding cosine similarity computation but I do want to compute an all
pair similarities...
Note that the data is sparse for
e/spark/pull/1778
>
> Your question wasn't entirely clear - does this answer it?
>
> Best,
> Reza
>
>
> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das
> wrote:
>
>> Hi Reza,
>>
>> Have you compared with the brute force algorithm for sim
you don't have to redo your code. Your call if you need it before a week.
> Reza
>
>
> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das
> wrote:
>
>> Ohh coolall-pairs brute force is also part of this PR ? Let me pull
>> it in and test on our dataset...
>>
&g
Also for tall and wide (rows ~60M, columns 10M), I am considering running a
matrix factorization to reduce the dimension to say ~60M x 50 and then run
all pair similarity...
Did you also try similar ideas and saw positive results ?
On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das
wrote:
>
ring (perhaps after dimensionality
> reduction) if your goal is to find batches of similar points instead of all
> pairs above a threshold.
>
>
>
>
> On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das
> wrote:
>
>> Also for tall and wide (rows ~60M, columns 10M), I am conside
sum with gamma as PositiveInfinity turns it
> into the usual brute force algorithm for cosine similarity, there is no
> sampling. This is by design.
>
>
> On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das
> wrote:
>
>> I looked at the code: similarColumns(Double.posIn
Durin,
I have integrated ecos with spark which uses suitesparse under the hood for
linear equation solvesI have exposed only the qp solver api in spark
since I was comparing ip with proximal algorithms but we can expose
suitesparse api as well...jni is used to load up ldl amd and ecos librarie
e jni version of ldl and
amd which are lgpl...
Let me know.
Thanks.
Deb
On Sep 8, 2014 7:04 AM, "Debasish Das" wrote:
> Durin,
>
> I have integrated ecos with spark which uses suitesparse under the hood
> for linear equation solvesI have exposed only the qp solver
how to do linear programming in a distributed way.
> -Xiangrui
>
> On Mon, Sep 8, 2014 at 7:12 AM, Debasish Das
> wrote:
> > Xiangrui,
> >
> > Should I open up a JIRA for this ?
> >
> > Distributed lp/socp solver through ecos/ldl/amd ?
> >
> > I c
y in a future PR, probably
> still for 1.2
>
>
> On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das
> wrote:
>
>> Awesome...Let me try it out...
>>
>> Any plans of putting other similarity measures in future (jaccard is
>> something that will be useful) ? I gue
her one. For dense matrices with say, 1m
> columns this won't be computationally feasible and you'll want to start
> sampling with dimsum.
>
> It would be helpful to have a loadRowMatrix function, I would use it.
>
> Best,
> Reza
>
> On Tue, Sep 9, 2014 at 12:05
Congratulations on the 1.1 release !
On Thu, Sep 11, 2014 at 9:08 PM, Matei Zaharia
wrote:
> Thanks to everyone who contributed to implementing and testing this
> release!
>
> Matei
>
> On September 11, 2014 at 11:52:43 PM, Tim Smith (secs...@gmail.com) wrote:
>
> Thanks for all the good work. V
RowMatrix and CoordinateMatrix to be templated on the value...
Are you considering this in your design ?
Thanks.
Deb
On Tue, Sep 9, 2014 at 9:45 AM, Reza Zadeh wrote:
> Better to do it in a PR of your own, it's not sufficiently related to
> dimsum
>
> On Tue, Sep 9, 2014 at 7:03
We dump fairly big libsvm to compare against liblinear/libsvm...the
following code dumps out libsvm format from SparseVector...
def toLibSvm(features: SparseVector): String = {
val indices = features.indices.map(_ + 1)
val values = features.values
indices.zip(values).mkString(" ").r
Hi,
I have some RowMatrices all with the same key (MatrixEntry.i,
MatrixEntry.j) and I would like to join multiple matrices to come up with a
sqlTable for each key...
What's the best way to do it ?
Right now I am doing N joins if I want to combine data from N matrices
which does not look quite r
n the meantime, you can un-normalize the cosine similarities to get the
> dot product, and then compute the other similarity measures from the dot
> product.
>
> Best,
> Reza
>
>
> On Wed, Sep 17, 2014 at 6:52 PM, Debasish Das
> wrote:
>
>> Hi Reza,
>>
>
sc.parallelize(model.weights.toArray, blocks).top(k) will get that right ?
For logistic you might want both positive and negative feature...so just
pass it through a filter on abs and then pick top(k)
On Thu, Sep 18, 2014 at 10:30 AM, Sameer Tilak wrote:
> Hi All,
>
> I am able to run LinearReg
. We can add jaccard and other similarity measures in
> later PRs.
>
> In the meantime, you can un-normalize the cosine similarities to get the
> dot product, and then compute the other similarity measures from the dot
> product.
>
> Best,
> Reza
>
>
> On Wed, S
The PR will updated
> today.
> Best,
> Reza
>
> On Thu, Sep 18, 2014 at 2:06 PM, Debasish Das
> wrote:
>
>> Hi Reza,
>>
>> Have you tested if different runs of the algorithm produce different
>> similarities (basically if the algorithm is deterministic) ?
Hi,
I am building a dictionary of RDD[(String, Long)] and after the dictionary
is built and cached, I find key "almonds" at value 5187 using:
rdd.filter{case(product, index) => product == "almonds"}.collect
Output:
Debug product almonds index 5187
Now I take the same dictionary and write it out
1 - 100 of 178 matches
Mail list logo