Re: Offline elastic index creation

2022-11-10 Thread Debasish Das
Hi Vibhor, We worked on a project to create lucene indexes using spark but the project has not been managed for some time now. If there is interest we can resurrect it https://github.com/vsumanth10/trapezium/blob/master/dal/src/test/scala/com/verizon/bda/trapezium/dal/lucene/LuceneIndexerSuite.sc

Re: Find KNN in Spark SQL

2015-05-19 Thread Debasish Das
The batch version of this is part of rowSimilarities JIRA 4823 ...if your query points can fit in memory there is broadcast version which we are experimenting with internallywe are using brute force KNN right now in the PR...based on flann paper lsh did not work well but before you go to approx

Re: Hive on Spark VS Spark SQL

2015-05-19 Thread Debasish Das
SparkSQL was built to improve upon Hive on Spark runtime further... On Tue, May 19, 2015 at 10:37 PM, guoqing0...@yahoo.com.hk < guoqing0...@yahoo.com.hk> wrote: > Hive on Spark and SparkSQL which should be better , and what are the key > characteristics and the advantages and the disadvantages b

Re: Help optimizing some spark code

2015-05-26 Thread Debasish Das
You don't need sort...use topbykey if your topk number is less...it uses java heap... On May 24, 2015 10:53 AM, "Tal" wrote: > Hi, > I'm running this piece of code in my program: > > smallRdd.join(largeRdd) > .groupBy { case (id, (_, X(a, _, _))) => a } > .map { case (a, iterable)

Re: Spark ML decision list

2015-06-07 Thread Debasish Das
What is decision list ? Inorder traversal (or some other traversal) of fitted decision tree On Jun 5, 2015 1:21 AM, "Sateesh Kavuri" wrote: > Is there an existing way in SparkML to convert a decision tree to a > decision list? > > On Thu, Jun 4, 2015 at 10:50 PM, Reza Zadeh wrote: > >> The close

Re: Linear Regression with SGD

2015-06-10 Thread Debasish Das
It's always better to use a quasi newton solver if the runtime and problem scale permits as there are guarantees on opti mization...owlqn and bfgs are both quasi newton Most single node code bases will run quasi newton solvesif you are using sgd better is to use adadelta/adagrad or similar tri

Re: Matrix Multiplication and mllib.recommendation

2015-06-18 Thread Debasish Das
We added SPARK-3066 for this. In 1.4 you should get the code to do BLAS dgemm based calculation. On Thu, Jun 18, 2015 at 8:20 AM, Ayman Farahat < ayman.fara...@yahoo.com.invalid> wrote: > Thanks Sabarish and Nick > Would you happen to have some code snippets that you can share. > Best > Ayman > >

Re: Matrix Multiplication and mllib.recommendation

2015-06-18 Thread Debasish Das
Also not sure how threading helps here because Spark puts a partition to each core. On each core may be there are multiple threads if you are using intel hyperthreading but I will let Spark handle the threading. On Thu, Jun 18, 2015 at 8:38 AM, Debasish Das wrote: > We added SPARK-3066 for t

Re: Matrix Multiplication and mllib.recommendation

2015-06-18 Thread Debasish Das
Also in my experiments, it's much faster to blocked BLAS through cartesian rather than doing sc.union. Here are the details on the experiments: https://issues.apache.org/jira/browse/SPARK-4823 On Thu, Jun 18, 2015 at 8:40 AM, Debasish Das wrote: > Also not sure how threading helps here

Re: Does MLLib has attribute importance?

2015-06-18 Thread Debasish Das
Running l1 and picking non zero coefficient s gives a good estimate of interesting features as well... On Jun 17, 2015 4:51 PM, "Xiangrui Meng" wrote: > We don't have it in MLlib. The closest would be the ChiSqSelector, > which works for categorical data. -Xiangrui > > On Thu, Jun 11, 2015 at 4:3

Velox Model Server

2015-06-20 Thread Debasish Das
Hi, The demo of end-to-end ML pipeline including the model server component at Spark Summit was really cool. I was wondering if the Model Server component is based upon Velox or it uses a completely different architecture. https://github.com/amplab/velox-modelserver We are looking for an open s

Re: Velox Model Server

2015-06-20 Thread Debasish Das
> On Sat, Jun 20, 2015 at 8:00 AM, Charles Earl >> wrote: >> >>> Is velox NOT open source? >>> >>> >>> On Saturday, June 20, 2015, Debasish Das >>> wrote: >>> >>>> Hi, >>>> >>>> The demo

Re: Velox Model Server

2015-06-20 Thread Debasish Das
Integration of model server with ML pipeline API. On Sat, Jun 20, 2015 at 12:25 PM, Donald Szeto wrote: > Mind if I ask what 1.3/1.4 ML features that you are looking for? > > > On Saturday, June 20, 2015, Debasish Das wrote: > >> After getting used to Scala, writing

Re: Velox Model Server

2015-06-22 Thread Debasish Das
he servlet engine probably doesn't matter at all in comparison. On Sat, Jun 20, 2015, 9:40 PM Debasish Das wrote: > After getting used to Scala, writing Java is too much work :-) > > I am looking for scala based project that's using netty at its core (spray > is one example). &g

Re: Velox Model Server

2015-06-24 Thread Debasish Das
Model sizes are 10m x rank, 100k x rank range. For recommendation/topic modeling I can run batch recommendAll and then keep serving the model using a distributed cache but then I can't incorporate per user model re-predict if user feedback is making the current topk stale. I have to wait for next

Re: Velox Model Server

2015-06-24 Thread Debasish Das
Spark JobServer which would allow triggering > re-computation jobs periodically. We currently just run batch > re-computation and reload factors from S3 periodically. > > We then use Elasticsearch to post-filter results and blend content-based > stuff - which I think might be more efficient

Re: Subsecond queries possible?

2015-06-30 Thread Debasish Das
I got good runtime improvement from hive partitioninp, caching the dataset and increasing the cores through repartition...I think for your case generating mysql style indexing will help further..it is not supported in spark sql yet... I know the dataset might be too big for 1 node mysql but do you

Re: Subsecond queries possible?

2015-07-01 Thread Debasish Das
but I'm interested to see how far > it can be pushed. > > Thanks for your help! > > > -- Eric > > On Tue, Jun 30, 2015 at 5:28 PM, Debasish Das > wrote: > >> I got good runtime improvement from hive partitioninp, caching the >> dataset and increasing

Re: Few basic spark questions

2015-07-14 Thread Debasish Das
What do you need in sparkR that mllib / ml don't havemost of the basic analysis that you need on stream can be done through mllib components... On Jul 13, 2015 2:35 PM, "Feynman Liang" wrote: > Sorry; I think I may have used poor wording. SparkR will let you use R to > analyze the data, but

Re: Spark application with a RESTful API

2015-07-14 Thread Debasish Das
How do you manage the spark context elastically when your load grows from 1000 users to 1 users ? On Tue, Jul 14, 2015 at 8:31 AM, Hafsa Asif wrote: > I have almost the same case. I will tell you what I am actually doing, if > it > is according to your requirement, then I will love to help y

Re: Compute pairwise distance

2016-07-07 Thread Debasish Das
>> (point, distances.filter(_._2 <= kthDistance._2)) >> } >> } >> >> This is part of my Local Outlier Factor implementation. >> >> Of course the distances can be sorted because it is an Iterable, but it >> gives an idea. Is it possi

Re: simultaneous actions

2016-01-18 Thread Debasish Das
Simultaneous action works on cluster fine if they are independent...on local I never paid attention but the code path should be similar... On Jan 18, 2016 8:00 AM, "Koert Kuipers" wrote: > stacktrace? details? > > On Mon, Jan 18, 2016 at 5:58 AM, Mennour Rostom > wrote: > >> Hi, >> >> I am runni

Re: Running 2 spark application in parallel

2015-10-23 Thread Debasish Das
You can run 2 threads in driver and spark will fifo schedule the 2 jobs on the same spark context you created (executors and cores)...same idea is used for spark sql thriftserver flow... For streaming i think it lets you run only one stream at a time even if you run them on multiple threads on dri

Re: apply simplex method to fix linear programming in spark

2015-11-02 Thread Debasish Das
Use breeze simplex which inturn uses apache maths simplex...if you want to use interior point method you can use ecos https://github.com/embotech/ecos-java-scala ...spark summit 2014 talk on quadratic solver in matrix factorization will show you example integration with spark. ecos runs as jni proc

Re: apply simplex method to fix linear programming in spark

2015-11-03 Thread Debasish Das
t be steering this a bit off topic: does this need the simplex > method? this is just an instance of nonnegative least squares. I don't > think it relates to LDA either. > > Spark doesn't have any particular support for NNLS (right?) or simplex > though. > > On Mon, N

Re: apply simplex method to fix linear programming in spark

2015-11-04 Thread Debasish Das
le to add. You can add an issue in breeze for the enhancememt. Alternatively you can use breeze lpsolver as well that uses simplex from apache math. On Nov 4, 2015 1:05 AM, "Zhiliang Zhu" wrote: > Hi Debasish Das, > > Firstly I must show my deep appreciation towards you kind he

Re: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-07-22 Thread Debasish Das
Does it also support insert operations ? On Jul 22, 2015 4:53 PM, "Bing Xiao (Bing)" wrote: > We are happy to announce the availability of the Spark SQL on HBase > 1.0.0 release. > http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase > > The main features in this package, dubbed “As

RE: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-07-27 Thread Debasish Das
Hi Yan, Is it possible to access the hbase table through spark sql jdbc layer ? Thanks. Deb On Jul 22, 2015 9:03 PM, "Yan Zhou.sc" wrote: > Yes, but not all SQL-standard insert variants . > > > > *From:* Debasish Das [mailto:debasish.da...@gmail.com] > *Sent:* Wedn

Re: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-07-28 Thread Debasish Das
t; > > Graphically, the access path is as follows: > > > > Spark SQL JDBC Interface -> Spark SQL Parser/Analyzer/Optimizer->Astro > Optimizer-> HBase Scans/Gets -> … -> HBase Region server > > > > > > Regards, > > > > Yan > > >

Re: Spark ANN

2015-09-07 Thread Debasish Das
Not sure dropout but if you change the solver from breeze bfgs to breeze owlqn or breeze.proximal.NonlinearMinimizer you can solve ann loss with l1 regularization which will yield elastic net style sparse solutionsusing that you can clean up edges which has 0.0 as weight... On Sep 7, 2015 7:35

Re: Old version of Spark [v1.2.0]

2017-01-16 Thread Debasish Das
You may want to pull up release/1.2 branch and 1.2.0 tag to build it yourself incase the packages are not available. On Jan 15, 2017 2:55 PM, "Md. Rezaul Karim" wrote: > Hi Ayan, > > Thanks a million. > > Regards, > _ > *Md. Rezaul Karim*, BSc, MSc > PhD Researcher

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-04 Thread Debasish Das
I am not sure why I will use pipeline to do scoring...idea is to build a model, use model ser/deser feature to put it in the row or column store of choice and provide a api access to the model...we support these primitives in github.com/Verizon/trapezium...the api has access to spark context in loc

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-04 Thread Debasish Das
l.Model >predict API". The predict API is in the old mllib package not the new ml >package. >- "why r we using dataframe and not the ML model directly from API" - >Because as of now the new ml package does not have the direct API. > > > On Sat, F

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-04 Thread Debasish Das
, graph and kernel models we use a lot and for them turned out that mllib style model predict were useful if we change the underlying store... On Feb 4, 2017 9:37 AM, "Debasish Das" wrote: > If we expose an API to access the raw models out of PipelineModel can't we > call p

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-05 Thread Debasish Das
ector. > There is no API exposed. It is WIP but not yet released. > > On Sat, Feb 4, 2017 at 11:07 PM, Debasish Das > wrote: > >> If we expose an API to access the raw models out of PipelineModel can't >> we call predict directly on it from an API ? Is there a task ope

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-10 Thread Debasish Das
If it is 7m rows and 700k features (or say 1m features) brute force row similarity will run fine as well...check out spark-4823...you can compare quality with approximate variant... On Feb 9, 2017 2:55 AM, "nguyen duc Tuan" wrote: > Hi everyone, > Since spark 2.1.0 introduces LSH (http://spark.ap

Re: Restful API Spark Application

2017-05-16 Thread Debasish Das
You can run l On May 15, 2017 3:29 PM, "Nipun Arora" wrote: > Thanks all for your response. I will have a look at them. > > Nipun > > On Sat, May 13, 2017 at 2:38 AM vincent gromakowski < > vincent.gromakow...@gmail.com> wrote: > >> It's in scala but it should be portable in java >> https://githu

ECOS Spark Integration

2017-12-17 Thread Debasish Das
Hi, ECOS is a solver for second order conic programs and we showed the Spark integration at 2014 Spark Summit https://spark-summit.org/2014/quadratic-programing-solver-for-non-negative-matrix-factorization/. Right now the examples show how to reformulate matrix factorization as a SOCP and solve ea

Re: dremel paper example schema

2018-10-29 Thread Debasish Das
Open source impl of dremel is parquet ! On Mon, Oct 29, 2018, 8:42 AM Gourav Sengupta wrote: > Hi, > > why not just use dremel? > > Regards, > Gourav Sengupta > > On Mon, Oct 29, 2018 at 1:35 PM lchorbadjiev < > lubomir.chorbadj...@gmail.com> wrote: > >> Hi, >> >> I'm trying to reproduce the exa

Re: can we insert and update with spark sql

2015-02-10 Thread Debasish Das
Hi Michael, I want to cache a RDD and define get() and set() operators on it. Basically like memcached. Is it possible to build a memcached like distributed cache using Spark SQL ? If not what do you suggest we should use for such operations... Thanks. Deb On Fri, Jul 18, 2014 at 1:00 PM, Michae

Re: can we insert and update with spark sql

2015-02-10 Thread Debasish Das
; > On Tue, Feb 10, 2015 at 2:27 PM, Debasish Das > wrote: > >> Hi Michael, >> >> I want to cache a RDD and define get() and set() operators on it. >> Basically like memcached. Is it possible to build a memcached like >> distributed cache using Spark SQL ? If

Re: can we insert and update with spark sql

2015-02-10 Thread Debasish Das
PM, Debasish Das wrote: > Thanks...this is what I was looking for... > > It will be great if Ankur can give brief details about it...Basically how > does it contrast with memcached for example... > > On Tue, Feb 10, 2015 at 2:32 PM, Michael Armbrust > wrote: > >

Re: can we insert and update with spark sql

2015-02-12 Thread Debasish Das
... Neither play nor spray is being used in Spark right nowso it brings dependencies and we already know about the akka conflicts...thriftserver on the other hand is already integrated for JDBC access On Tue, Feb 10, 2015 at 3:43 PM, Debasish Das wrote: > Also I wanted to run get() and

WARN from Similarity Calculation

2015-02-15 Thread Debasish Das
Hi, I am sometimes getting WARN from running Similarity calculation: 15/02/15 23:07:55 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(7, abc.com, 48419, 0) with no recent heart beats: 66435ms exceeds 45000ms Do I need to increase the default 45 s to larger values for cases wh

Large Similarity Job failing

2015-02-17 Thread Debasish Das
Hi, I am running brute force similarity from RowMatrix on a job with 5M x 1.5M sparse matrix with 800M entries. With 200M entries the job run fine but with 800M I am getting exceptions like too many files open and no space left on device... Seems like I need more nodes or use dimsum sampling ? I

Re: WARN from Similarity Calculation

2015-02-18 Thread Debasish Das
Did you check the GC time in the Spark > UI? -Xiangrui > > On Sun, Feb 15, 2015 at 8:10 PM, Debasish Das > wrote: > > Hi, > > > > I am sometimes getting WARN from running Similarity calculation: > > > > 15/02/15 23:07:55 WARN BlockManagerMasterActor: Removi

Filtering keys after map+combine

2015-02-19 Thread Debasish Das
Hi, Before I send out the keys for network shuffle, in reduceByKey after map + combine are done, I would like to filter the keys based on some threshold... Is there a way to get the key, value after map+combine stages so that I can run a filter on the keys ? Thanks. Deb

Re: Filtering keys after map+combine

2015-02-19 Thread Debasish Das
s and apply your > filtering. Then you can finish with a reduceByKey. > > On Thu, Feb 19, 2015 at 9:21 AM, Debasish Das > wrote: > >> Hi, >> >> Before I send out the keys for network shuffle, in reduceByKey after map >> + combine are done, I would like to filt

Re: Filtering keys after map+combine

2015-02-19 Thread Debasish Das
it may mean you only shuffle (key,None) for some keys if the map-side > combine already worked out that the key would be filtered. > > And then after, run a flatMap or something to make Option[B] into B. > > On Thu, Feb 19, 2015 at 2:21 PM, Debasish Das > wrote: > > Hi, &g

Re: Large Similarity Job failing

2015-02-25 Thread Debasish Das
be too large to use DIMSUM. Try to increase the threshold and see > whether it helps. -Xiangrui > > On Tue, Feb 17, 2015 at 6:28 AM, Debasish Das > wrote: > > Hi, > > > > I am running brute force similarity from RowMatrix on a job with 5M x > 1.5M > > sparse ma

Re: Large Similarity Job failing

2015-02-25 Thread Debasish Das
the > boundary with 1.5m columns, because the output can potentially have 2.25 x > 10^12 entries, which is a lot. (squares 1.5m) > > Best, > Reza > > > On Wed, Feb 25, 2015 at 10:13 AM, Debasish Das > wrote: > >> Is the threshold valid only for tall skinny matri

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Debasish Das
Column based similarities work well if the columns are mild (10K, 100K, we actually scaled it to 1.5M columns but it really stress tests the shuffle and it needs to tune the shuffle parameters)...You can either use dimsum sampling or come up with your own threshold based on your application that yo

Re: Apache Spark ALS recommendations approach

2015-03-18 Thread Debasish Das
There is also a batch prediction API in PR https://github.com/apache/spark/pull/3098 Idea here is what Sean said...don't try to reconstruct the whole matrix which will be dense but pick a set of users and calculate topk recommendations for them using dense level 3 blas.we are going to merge th

Re: How to get a top X percent of a distribution represented as RDD

2015-03-26 Thread Debasish Das
You can do it in-memory as wellget 10% topK elements from each partition and use merge from any sort algorithm like timsortbasically aggregateBy Your version uses shuffle but this version is 0 shuffle..assuming your data set is cached you will be using in-memory allReduce through treeAggre

Re: How to get a top X percent of a distribution represented as RDD

2015-03-26 Thread Debasish Das
ggestions. In-memory version is quite useful. I do not > quite understand how you can use aggregateBy to get 10% top K elements. Can > you please give an example? > > Thanks, > Aung > > On Fri, Mar 27, 2015 at 2:40 PM, Debasish Das > wrote: > >> You can do it in-me

Re: How to get a top X percent of a distribution represented as RDD

2015-03-26 Thread Debasish Das
g a count-min data structure such as in > https://github.com/laserson/dsq​ > > to get approximate quantiles, then use whatever values you want to filter > the original sequence. > ------ > *From:* Debasish Das > *Sent:* Thursday, March 26, 2015 9:45 PM

Re: How to get a top X percent of a distribution represented as RDD

2015-04-03 Thread Debasish Das
er than N (the bound) then, > create a new sorted list by using a priority queue and dequeuing top N > values. > > In the end, I get a record for each segment with N max values for each > segment. > > Regards, > Aung > > > > > > > > > On Fri, Mar

Re: Using DIMSUM with ids

2015-04-07 Thread Debasish Das
I have a version that works well for Netflix data but now I am validating on internal datasets..this code will work on matrix factors and sparse matrices that has rows = 100* columnsif columns are much smaller than rows then col based flow works well...basically we need both flows... I did not

RDD union

2015-04-09 Thread Debasish Das
Hi, I have some code that creates ~ 80 RDD and then a sc.union is applied to combine all 80 into one for the next step (to run topByKey for example)... While creating 80 RDDs take 3 mins per RDD, doing a union over them takes 3 hrs (I am validating these numbers)... Is there any checkpoint based

Re: Benchmaking col vs row similarities

2015-04-10 Thread Debasish Das
I will increase memory for the job...that will also fix it right ? On Apr 10, 2015 12:43 PM, "Reza Zadeh" wrote: > You should pull in this PR: https://github.com/apache/spark/pull/5364 > It should resolve that. It is in master. > Best, > Reza > > On Fri, Apr 10, 2

Re: Compute pairwise distance

2015-04-29 Thread Debasish Das
Cross Join shuffle space might not be needed since most likely through application specific logic (topK etc) you can cut the shuffle space...Also most likely the brute force approach will be a benchmark tool to see how better is your clustering based KNN solution since there are several ways you ca

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread Debasish Das
Xiangrui, Could you point to the JIRA related to tree aggregate ? ...sounds like the allreduce idea... I would definitely like to try it on our dataset... Makoto, I did run pretty big sparse dataset (20M rows, 3M sparse features) and I got 100 iterations of SGD running in 200 seconds...10 execu

Re: Shark vs Impala

2014-06-22 Thread Debasish Das
600s for Spark vs 5s for Redshift...The numbers look much different from the amplab benchmark... https://amplab.cs.berkeley.edu/benchmark/ Is it like SSDs or something that's helping redshift or the whole data is in memory when you run the query ? Could you publish the query ? Also after spark-s

RE: Prediction using Classification with text attributes in Apache Spark MLLib

2014-06-25 Thread Debasish Das
Libsvm dataset converters are data dependent since your input data can be in any serialization format and not necessarily csv... We have flows that coverts hdfs data to libsvm/sparse vector rdd which is sent to mllib I am not sure if it will be easy to standardize libsvm converter on data tha

Databricks demo

2014-07-11 Thread Debasish Das
Hi, Databricks demo at spark summit was amazing...what's the frontend stack used specifically for rendering multiple reactive charts on same dom? Looks like that's an emerging pattern for correlating different data api... Thanks Deb

Re: spark1.0.1 & hadoop2.2.0 issue

2014-07-19 Thread Debasish Das
I compiled spark 1.0.1 with 2.3.0cdh5.0.2 today... No issues with mvn compilation but my sbt build keeps failing on the sql module... I just saw that my scala is at 2.11.0 (with brew update)...not sure if that's why the sbt compilation is failing...retrying.. On Sat, Jul 19, 2014 at 6:16 PM,

Re: spark1.0.1 & hadoop2.2.0 issue

2014-07-20 Thread Debasish Das
Yup...the scala version 2.11.0 caused it...with 2.10.4, I could compile 1.0.1 and HEAD both for 2.3.0cdh5.0.2 On Sat, Jul 19, 2014 at 8:14 PM, Debasish Das wrote: > I compiled spark 1.0.1 with 2.3.0cdh5.0.2 today... > > No issues with mvn compilation but my sbt build keeps failing o

Spark deployed by Cloudera Manager

2014-07-23 Thread Debasish Das
Hi, We have been using standalone spark for last 6 months and I used to run application jars fine on spark cluster with the following command. java -cp ":/app/data/spark_deploy/conf:/app/data/spark_deploy/lib/spark-assembly-1.0.0-SNAPSHOT-hadoop2.0.0-mr1-cdh4.5.0.jar:./app.jar" -Xms2g -Xmx2g -Ds

Re: Spark deployed by Cloudera Manager

2014-07-23 Thread Debasish Das
I found the issue... If you use spark git and generate the assembly jar then org.apache.hadoop.io.Writable.class is packaged with it If you use the assembly jar that ships with CDH in /opt/cloudera/parcels/CDH/lib/spark/assembly/lib/spark-assembly_2.10-0.9.0-cdh5.0.2-hadoop2.3.0-cdh5.0.2.jar,

Re: MLlib NNLS implementation is buggy, returning wrong solutions

2014-07-28 Thread Debasish Das
Hi Aureliano, Will it be possible for you to give the test-case ? You can add it to JIRA as well as an attachment I guess... I am preparing the PR for ADMM based QuadraticMinimizer...In my matlab experiments with scaling the rank to 1000 and beyond (which is too high for ALS but gives a good idea

Re: Contribution to Spark MLLib

2014-08-13 Thread Debasish Das
Dennis, If it is PLSA with least square loss then the QuadraticMinimizer that we open sourced should be able to solve it for modest topics (till 1000 I believe)...if we integrate a cg solver for equality (Nocedal's KNITRO paper is the reference) the topic size can be increased much larger than ALS

SPARK_LOCAL_DIRS option

2014-08-13 Thread Debasish Das
Hi, I have set up the SPARK_LOCAL_DIRS option in spark-env.sh so that Spark can use more shuffle space... Does Spark cleans all the shuffle files once the runs are done ? Seems to me that the shuffle files are not cleaned... Do I need to set this variable ? spark.cleaner.ttl Right now we are pl

Re: SPARK_LOCAL_DIRS

2014-08-14 Thread Debasish Das
Actually I faced it yesterday... I had to put it in spark-env.sh and take it out from spark-defaults.conf on 1.0.1...Note that this settings should be visible on all workers.. After that I validated that SPARK_LOCAL_DIRS was indeed getting used for shuffling... On Thu, Aug 14, 2014 at 10:27 AM,

Performance hit for using sc.setCheckPointDir

2014-08-14 Thread Debasish Das
Hi, For our large ALS runs, we are considering using sc.setCheckPointDir so that the intermediate factors are written to HDFS and the lineage is broken... Is there a comparison which shows the performance degradation due to these options ? If not I will be happy to add experiments with it... Tha

Re: How to implement multinomial logistic regression(softmax regression) in Spark?

2014-08-15 Thread Debasish Das
DB, Did you compare softmax regression with one-vs-all and found that softmax is better ? one-vs-all can be implemented as a wrapper over binary classifier that we have in mllib...I am curious if softmax multinomial is better on most cases or is it worthwhile to add a one vs all version of mlor a

ALS checkpoint performance

2014-08-15 Thread Debasish Das
Hi, Are there any experiments detailing the performance hit due to HDFS checkpoint in ALS ? As we scale to large ranks with more ratings, I believe we have to cut the RDD lineage to safe guard against the lineage issue... Thanks. Deb

Re: Open sourcing Spindle by Adobe Research, a web analytics processing engine in Scala, Spark, and Parquet.

2014-08-16 Thread Debasish Das
Hi Brandon, Looks very cool...will try it out for ad-hoc analysis of our datasets and provide more feedback... Could you please give bit more details about the differences of Spindle architecture compared to Hue + Spark integration (python stack) and Ooyala Jobserver ? Does Spindle allow sharing

Re: MLLib: implementing ALS with distributed matrix

2014-08-17 Thread Debasish Das
Hi Wei, Sparkler code was not available for benchmarking and so I picked up Jellyfish which uses SGD and if you look at the paper, the ideas are very similar to sparkler paper but Jellyfish is on shared memory and uses C code while sparkler was built on top of spark...Jellyfish used some interesti

Re: LDA example?

2014-08-22 Thread Debasish Das
Hi Burak, This LDA implementation is friendly to the equality and positivity als code that I added in the following JIRA to formulate robust plsa https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-2426 Should I build upon the PR that you pointed ? I want to run some experiment

Re: CUDA in spark, especially in MLlib?

2014-08-28 Thread Debasish Das
Breeze author David also has a github project on cuda binding in scalado you prefer using java or scala ? On Aug 27, 2014 2:05 PM, "Frank van Lankvelt" wrote: > you could try looking at ScalaCL[1], it's targeting OpenCL rather than > CUDA, but that might be close enough? > > cheers, Frank >

Re: Huge matrix

2014-09-05 Thread Debasish Das
Hi Reza, Have you compared with the brute force algorithm for similarity computation with something like the following in Spark ? https://github.com/echen/scaldingale I am adding cosine similarity computation but I do want to compute an all pair similarities... Note that the data is sparse for

Re: Huge matrix

2014-09-05 Thread Debasish Das
e/spark/pull/1778 > > Your question wasn't entirely clear - does this answer it? > > Best, > Reza > > > On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das > wrote: > >> Hi Reza, >> >> Have you compared with the brute force algorithm for sim

Re: Huge matrix

2014-09-05 Thread Debasish Das
you don't have to redo your code. Your call if you need it before a week. > Reza > > > On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das > wrote: > >> Ohh coolall-pairs brute force is also part of this PR ? Let me pull >> it in and test on our dataset... >> &g

Re: Huge matrix

2014-09-05 Thread Debasish Das
Also for tall and wide (rows ~60M, columns 10M), I am considering running a matrix factorization to reduce the dimension to say ~60M x 50 and then run all pair similarity... Did you also try similar ideas and saw positive results ? On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das wrote: >

Re: Huge matrix

2014-09-05 Thread Debasish Das
ring (perhaps after dimensionality > reduction) if your goal is to find batches of similar points instead of all > pairs above a threshold. > > > > > On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das > wrote: > >> Also for tall and wide (rows ~60M, columns 10M), I am conside

Re: Huge matrix

2014-09-05 Thread Debasish Das
sum with gamma as PositiveInfinity turns it > into the usual brute force algorithm for cosine similarity, there is no > sampling. This is by design. > > > On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das > wrote: > >> I looked at the code: similarColumns(Double.posIn

Re: Solving Systems of Linear Equations Using Spark?

2014-09-08 Thread Debasish Das
Durin, I have integrated ecos with spark which uses suitesparse under the hood for linear equation solvesI have exposed only the qp solver api in spark since I was comparing ip with proximal algorithms but we can expose suitesparse api as well...jni is used to load up ldl amd and ecos librarie

Re: Solving Systems of Linear Equations Using Spark?

2014-09-08 Thread Debasish Das
e jni version of ldl and amd which are lgpl... Let me know. Thanks. Deb On Sep 8, 2014 7:04 AM, "Debasish Das" wrote: > Durin, > > I have integrated ecos with spark which uses suitesparse under the hood > for linear equation solvesI have exposed only the qp solver

Re: Solving Systems of Linear Equations Using Spark?

2014-09-08 Thread Debasish Das
how to do linear programming in a distributed way. > -Xiangrui > > On Mon, Sep 8, 2014 at 7:12 AM, Debasish Das > wrote: > > Xiangrui, > > > > Should I open up a JIRA for this ? > > > > Distributed lp/socp solver through ecos/ldl/amd ? > > > > I c

Re: Huge matrix

2014-09-09 Thread Debasish Das
y in a future PR, probably > still for 1.2 > > > On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das > wrote: > >> Awesome...Let me try it out... >> >> Any plans of putting other similarity measures in future (jaccard is >> something that will be useful) ? I gue

Re: Huge matrix

2014-09-09 Thread Debasish Das
her one. For dense matrices with say, 1m > columns this won't be computationally feasible and you'll want to start > sampling with dimsum. > > It would be helpful to have a loadRowMatrix function, I would use it. > > Best, > Reza > > On Tue, Sep 9, 2014 at 12:05

Re: Announcing Spark 1.1.0!

2014-09-11 Thread Debasish Das
Congratulations on the 1.1 release ! On Thu, Sep 11, 2014 at 9:08 PM, Matei Zaharia wrote: > Thanks to everyone who contributed to implementing and testing this > release! > > Matei > > On September 11, 2014 at 11:52:43 PM, Tim Smith (secs...@gmail.com) wrote: > > Thanks for all the good work. V

Re: Huge matrix

2014-09-17 Thread Debasish Das
RowMatrix and CoordinateMatrix to be templated on the value... Are you considering this in your design ? Thanks. Deb On Tue, Sep 9, 2014 at 9:45 AM, Reza Zadeh wrote: > Better to do it in a PR of your own, it's not sufficiently related to > dimsum > > On Tue, Sep 9, 2014 at 7:03

Re: MLLib: LIBSVM issue

2014-09-17 Thread Debasish Das
We dump fairly big libsvm to compare against liblinear/libsvm...the following code dumps out libsvm format from SparseVector... def toLibSvm(features: SparseVector): String = { val indices = features.indices.map(_ + 1) val values = features.values indices.zip(values).mkString(" ").r

Joining multiple rowMatrix

2014-09-18 Thread Debasish Das
Hi, I have some RowMatrices all with the same key (MatrixEntry.i, MatrixEntry.j) and I would like to join multiple matrices to come up with a sqlTable for each key... What's the best way to do it ? Right now I am doing N joins if I want to combine data from N matrices which does not look quite r

Re: Huge matrix

2014-09-18 Thread Debasish Das
n the meantime, you can un-normalize the cosine similarities to get the > dot product, and then compute the other similarity measures from the dot > product. > > Best, > Reza > > > On Wed, Sep 17, 2014 at 6:52 PM, Debasish Das > wrote: > >> Hi Reza, >> >

Re: MLLib regression model weights

2014-09-18 Thread Debasish Das
sc.parallelize(model.weights.toArray, blocks).top(k) will get that right ? For logistic you might want both positive and negative feature...so just pass it through a filter on abs and then pick top(k) On Thu, Sep 18, 2014 at 10:30 AM, Sameer Tilak wrote: > Hi All, > > I am able to run LinearReg

Re: Huge matrix

2014-09-18 Thread Debasish Das
. We can add jaccard and other similarity measures in > later PRs. > > In the meantime, you can un-normalize the cosine similarities to get the > dot product, and then compute the other similarity measures from the dot > product. > > Best, > Reza > > > On Wed, S

Re: Huge matrix

2014-09-18 Thread Debasish Das
The PR will updated > today. > Best, > Reza > > On Thu, Sep 18, 2014 at 2:06 PM, Debasish Das > wrote: > >> Hi Reza, >> >> Have you tested if different runs of the algorithm produce different >> similarities (basically if the algorithm is deterministic) ?

Distributed dictionary building

2014-09-20 Thread Debasish Das
Hi, I am building a dictionary of RDD[(String, Long)] and after the dictionary is built and cached, I find key "almonds" at value 5187 using: rdd.filter{case(product, index) => product == "almonds"}.collect Output: Debug product almonds index 5187 Now I take the same dictionary and write it out

  1   2   >