Re: SIGMOD System Award for Apache Spark

2022-05-15 Thread Debasish Das
Congratulations to the whole spark community ! It's a great achievement. On Sat, May 14, 2022, 2:49 AM Yikun Jiang wrote: > Awesome! Congrats to the whole community! > > On Fri, May 13, 2022 at 3:44 AM Matei Zaharia > wrote: > >> Hi all, >> >> We recently found out that Apache Spark received >>

Re: Welcome Xinrong Meng as a Spark committer

2022-08-10 Thread Debasish Das
Congratulations Xinrong ! On Tue, Aug 9, 2022, 10:00 PM Rui Wang wrote: > Congrats Xinrong! > > > -Rui > > On Tue, Aug 9, 2022 at 8:57 PM Xingbo Jiang wrote: > >> Congratulations! >> >> Yuanjian Li 于2022年8月9日 周二20:31写道: >> >>> Congratulations, Xinrong! >>> >>> XiDuo You 于2022年8月9日 周二19:18写道: >>

Re: Welcome two new Apache Spark committers

2023-08-06 Thread Debasish Das
Congratulations Peter and Xidou. On Sun, Aug 6, 2023, 7:05 PM Wenchen Fan wrote: > Hi all, > > The Spark PMC recently voted to add two new committers. Please join me in > welcoming them to their new role! > > - Peter Toth (Spark SQL) > - Xiduo You (Spark SQL) > > They consistently make contribut

Re: [Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-01 Thread Debasish Das
+1 Is there any design doc related to API/internal changes ? Will CP be the default in structured streaming or it's a mode in conjunction with exisiting behavior. Thanks. Deb On Nov 1, 2017 8:37 AM, "Reynold Xin" wrote: Earlier I sent out a discussion thread for CP in Structured Streaming: ht

Hinge Gradient

2017-12-13 Thread Debasish Das
Hi, I looked into the LinearSVC flow and found the gradient for hinge as follows: Our loss function with {0, 1} labels is max(0, 1 - (2y - 1) (f_w(x))) Therefore the gradient is -(2y - 1)*x max is a non-smooth function. Did we try using ReLu/Softmax function and use that to smooth the hinge los

Re: Hinge Gradient

2017-12-16 Thread Debasish Das
re that proves changing max to soft-max can behave > well? > I’m more than happy to see some benchmarks if you can have. > > + Yuhao, who did similar effort in this PR: https://github.com/apache/ > spark/pull/17862 > > Regards > Yanbo > > On Dec 13, 2017, at 12:20 A

Re: Hinge Gradient

2017-12-17 Thread Debasish Das
If you can point me to previous benchmarks that are done, I would like to use smoothing and see if the LBFGS convergence improved while not impacting linear svc loss. Thanks. Deb On Dec 16, 2017 7:48 PM, "Debasish Das" wrote: Hi Weichen, Traditionally svm are solved using quadratic p

ECOS Spark Integration

2017-12-17 Thread Debasish Das
Hi, ECOS is a solver for second order conic programs and we showed the Spark integration at 2014 Spark Summit https://spark-summit.org/2014/quadratic-programing-solver-for-non-negative-matrix-factorization/. Right now the examples show how to reformulate matrix factorization as a SOCP and solve ea

Re: Spark Improvement Proposals

2016-10-16 Thread Debasish Das
Thanks Cody for bringing up a valid point...I picked up Spark in 2014 as soon as I looked into it since compared to writing Java map-reduce and Cascading code, Spark made writing distributed code fun...But now as we went deeper with Spark and real-time streaming use-case gets more prominent, I thin

Re: matrix factorization cross validation

2014-11-03 Thread Debasish Das
from Mailbox > > > > > > On Thu, Oct 30, 2014 at 11:24 PM, Sean Owen wrote: > >> > >> MAP is effectively an average over all k from 1 to min(# > >> recommendations, # items rated) Getting first recommendations right is > >> more important than the

MatrixFactorizationModel predict(Int, Int) API

2014-11-03 Thread Debasish Das
Hi, I am testing MatrixFactorizationModel.predict(user: Int, product: Int) but the code fails on userFeatures.lookup(user).head In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)]) has been called and in all the test-cases that API has been used... I can perhaps refactor my code to d

Issues with AbstractParams

2014-11-04 Thread Debasish Das
Hi, I build the master today and I was testing IR statistics on movielens dataset (open up a PR in a bit)... Right now in the master examples.MovieLensALS, case class Params extends AbstractParam[Params] On my localhost spark, if I run as follows it fails: ./bin/spark-submit --master spark:// t

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Debasish Das
+1 The app to track PRs based on component is a great idea... On Thu, Nov 6, 2014 at 8:47 AM, Sean McNamara wrote: > +1 > > Sean > > On Nov 5, 2014, at 6:32 PM, Matei Zaharia wrote: > > > Hi all, > > > > I wanted to share a discussion we've been having on the PMC list, as > well as call for an

Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-06 Thread Debasish Das
e to cache the models to make userFeatures.lookup(user).head to work ? On Mon, Nov 3, 2014 at 9:24 PM, Xiangrui Meng wrote: > Was "user" presented in training? We can put a check there and return > NaN if the user is not included in the model. -Xiangrui > > On Mon, Nov 3,

Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-06 Thread Debasish Das
that we can calculate MAP statistics on large samples of data ? On Thu, Nov 6, 2014 at 4:41 PM, Xiangrui Meng wrote: > ALS model contains RDDs. So you cannot put `model.recommendProducts` > inside a RDD closure `userProductsRDD.map`. -Xiangrui > > On Thu, Nov 6, 2014 at 4:39 PM, Deba

Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-10 Thread Debasish Das
066 > > The easiest case is when one side is small. If both sides are large, > this is a super-expensive operation. We can do block-wise cross > product and then find top-k for each user. > > Best, > Xiangrui > > On Thu, Nov 6, 2014 at 4:51 PM, Debasish Das > wrote: > &

TimSort in 1.2

2014-11-13 Thread Debasish Das
Hi, I am noticing the first step for Spark jobs does a TimSort in 1.2 branch...and there is some time spent doing the TimSort...Is this assigning the RDD blocks to different nodes based on a sort order ? Could someone please point to a JIRA about this change so that I can read more about it ? Th

Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-17 Thread Debasish Das
Andrew, I put up 1.1.1 branch and I am getting shuffle failures while doing flatMap followed by groupBy...My cluster memory is less than the memory I need and therefore flatMap does around 400 GB of shuffle...memory is around 120 GB... 14/11/13 23:10:49 WARN TaskSetManager: Lost task 22.1 in stag

Using sampleByKey

2014-11-17 Thread Debasish Das
Hi, I have a rdd whose key is a userId and value is (movieId, rating)... I want to sample 80% of the (movieId,rating) that each userId has seen for train, rest is for test... val indexedRating = sc.textFile(...).map{x=> Rating(x(0), x(1), x(2)) val keyedRatings = indexedRating.map{x => (x.produ

Re: Using sampleByKey

2014-11-18 Thread Debasish Das
t; > I am not sure why your subtract does not work. I suspect it is because > the values do not partition the same way, or they don't evaluate > equality in the expected way, but I don't see any reason why. Tuples > work as expected here. > > On Tue, Nov 18, 2014 at 4:32

Re: Using sampleByKey

2014-11-18 Thread Debasish Das
missing in training and appears in test, we can simply > ignore it. -Xiangrui > > On Tue, Nov 18, 2014 at 6:59 AM, Debasish Das > wrote: > > Sean, > > > > I thought sampleByKey (stratified sampling) in 1.1 was designed to solve > > the problem that randomSpl

Re: [VOTE] Release Apache Spark 1.1.1 (RC2)

2014-11-23 Thread Debasish Das
-1 from me...same FetchFailed issue as what Hector saw... I am running Netflix dataset and dumping out recommendation for all users. It shuffles around 100 GB data on disk to run a reduceByKey per user on utils.BoundedPriorityQueue...The code runs fine with MovieLens1m dataset... I gave Spark 10

Re: [VOTE] Release Apache Spark 1.1.1 (RC2)

2014-11-24 Thread Debasish Das
with Jellyfish code http://i.stanford.edu/hazy/victor/Hogwild/), will reproduce the failure... https://issues.apache.org/jira/browse/SPARK-4231 The failed job I will debug more and figure out the real cause. If needed I will open up new JIRAs. On Sun, Nov 23, 2014 at 9:50 AM, Debasish Das wrote

Row Similarity

2014-12-10 Thread Debasish Das
Hi, It seems there are multiple places where we would like to compute row similarity (accurate or approximate similarities) Basically through RowMatrix columnSimilarities we can compute column similarities of a tall skinny matrix Similarly we should have an API in RowMatrix called rowSimilaritie

Re: Row Similarity

2014-12-10 Thread Debasish Das
a matrix A (i.e. computing > AA^T, which is expensive). > > There is a JIRA to track handling (1) and (2) more efficiently than > computing all pairs: https://issues.apache.org/jira/browse/SPARK-3066 > > > > On Wed, Dec 10, 2014 at 2:44 PM, Debasish Das > wrote: > >>

Re: Newest ML-Lib on Spark 1.1

2014-12-12 Thread Debasish Das
For CDH this works well for me...tested till 5.1... ./make-distribution -Dhadoop.version=2.3.0-cdh5.1.0 -Phadoop-2.3 -Pyarn -Phive -DskipTests To build with hive thriftserver support for spark-sql On Fri, Dec 12, 2014 at 1:41 PM, Ganelin, Ilya wrote: > > Hi all – we’re running CDH 5.2 and would

Re: Newest ML-Lib on Spark 1.1

2014-12-12 Thread Debasish Das
protobuf comes from missing -Phadoop2.3 On Fri, Dec 12, 2014 at 2:34 PM, Sean Owen wrote: > > What errors do you see? protobuf errors usually mean you didn't build > for the right version of Hadoop, but if you are using -Phadoop-2.3 or > better -Phadoop-2.4 that should be fine. Yes, a stack trace

Re: Welcoming three new committers

2015-02-03 Thread Debasish Das
Congratulations ! Keep helping the community :-) On Tue, Feb 3, 2015 at 5:34 PM, Denny Lee wrote: > Awesome stuff - congratulations! :) > > On Tue Feb 03 2015 at 5:34:06 PM Chao Chen wrote: > > > Congratulations guys, well done! > > > > 在 15-2-4 上午9:26, Nan Zhu 写道: > > > Congratulations! > > >

Batch prediciton for ALS

2015-02-10 Thread Debasish Das
Hi, Will it be possible to merge this PR to 1.3 ? https://github.com/apache/spark/pull/3098 The batch prediction API in ALS will be useful for us who want to cross validate on prec@k and MAP... Thanks. Deb

mllib.recommendation Design

2015-02-13 Thread Debasish Das
Hi, I am bit confused on the mllib design in the master. I thought that core algorithms will stay in mllib and ml will define the pipelines over the core algorithm but looks like in master ALS is moved from mllib to ml... I am refactoring my PR to a factorization package and I want to build it on

Re: Batch prediciton for ALS

2015-02-17 Thread Debasish Das
r > pass on your PR today. -Xiangrui > > On Tue, Feb 10, 2015 at 8:01 AM, Debasish Das > wrote: > > Hi, > > > > Will it be possible to merge this PR to 1.3 ? > > > > https://github.com/apache/spark/pull/3098 > > > > The batch prediction API

Re: mllib.recommendation Design

2015-02-17 Thread Debasish Das
d fit. For a general matrix factorization package, let's > make a JIRA and move our discussion there. -Xiangrui > > On Fri, Feb 13, 2015 at 7:46 AM, Debasish Das > wrote: > > Hi, > > > > I am bit confused on the mllib design in the master. I thought that core > >

Re: Batch prediciton for ALS

2015-02-18 Thread Debasish Das
7, 2015 at 4:10 PM, Debasish Das > wrote: > > It will be really help us if we merge it but I guess it is already > diverged > > from the new ALS...I will also take a look at it again and try update > with > > the new ALS... > > > > On Tue, Feb 17, 2015 at 3:2

If job fails shuffle space is not cleaned

2015-02-18 Thread Debasish Das
Hi, Some of my jobs failed due to no space left on device and on those jobs I was monitoring the shuffle space...when the job failed shuffle space did not clean and I had to manually clean it... Is there a JIRA already tracking this issue ? If no one has been assigned to it, I can take a look. T

Re: Have Friedman's glmnet algo running in Spark

2015-02-25 Thread Debasish Das
Any reason why the regularization path cannot be implemented using current owlqn pr ? We can change owlqn in breeze to fit your needs... On Feb 24, 2015 3:27 PM, "Joseph Bradley" wrote: > Hi Mike, > > I'm not aware of a "standard" big dataset, but there are a number > available: > * The YearPre

Re: [mllib] Is there any bugs to divide a Breeze sparse vectors at Spark v1.3.0-rc3?

2015-03-18 Thread Debasish Das
Hi David, We are stress testing breeze.optimize.proximal and nnls...if you are cutting a release now, we will need another release soon once we get the runtime optimizations in place and merged to breeze. Thanks. Deb On Mar 15, 2015 9:39 PM, "David Hall" wrote: > snapshot is pushed. If you ver

Re: Which linear algebra interface to use within Spark MLlib?

2015-03-18 Thread Debasish Das
dgemm dgemv and dot come to Breeze and Spark through netlib-java Right now both in dot and dgemv Breeze does a extra memory allocate but we already found the issue and we are working on adding a common trait that will provide a sink operation (basically memory will be allocated by user)...addi

Re: Which linear algebra interface to use within Spark MLlib?

2015-03-19 Thread Debasish Das
> > Also, could someone please elaborate on the linalg.BLAS and Matrix? Are > they going to be developed further, should in long term all developers use > them? > > Best regards, Alexander > > 18.03.2015, в 23:21, "Debasish Das" написал(а): > > dgemm dg

Re: Which linear algebra interface to use within Spark MLlib?

2015-03-19 Thread Debasish Das
nctions I need, that can be found in Breeze (and netlib-java). The same > concerns are applicable to MLlib Vector. > > Best regards, Alexander > > 19.03.2015, в 14:16, "Debasish Das" написал(а): > > I think for Breeze we are focused on dot and dgemv right now (al

Re: Which linear algebra interface to use within Spark MLlib?

2015-03-21 Thread Debasish Das
ind the JIRA to track this here: SPARK-6442 > <https://issues.apache.org/jira/browse/SPARK-6442> > > The design doc is here: http://goo.gl/sf5LCE > > We would very much appreciate your feedback and input. > > Best, > Burak > > On Thu, Mar 19, 2015 at 3:06 PM,

Re: mllib.recommendation Design

2015-03-25 Thread Debasish Das
that ALM will support MAP (and may be KL divergence loss) with sparsity constraints (probability simplex and bounds are fine for what I am focused at right now)... Thanks. Deb On Tue, Feb 17, 2015 at 4:40 PM, Debasish Das wrote: > There is a usability difference...I am not sure if recommenda

LogisticGradient Design

2015-03-25 Thread Debasish Das
Hi, Right now LogisticGradient implements both binary and multi-class in the same class using an if-else statement which is a bit convoluted. For Generalized matrix factorization, if the data has distinct ratings I want to use LeastSquareGradient (regression has given best results to date) but if

Re: LogisticGradient Design

2015-03-25 Thread Debasish Das
rmance hit we take from combining > > binary & multiclass logistic loss/gradient. If it's not a big hit, then > it > > might be simpler from an outside API perspective to keep them in 1 class > > (even if it's more complicated within). > > Joseph > >

Re: mllib.recommendation Design

2015-03-30 Thread Debasish Das
as I see the result. I am not sure if it is supported by public packages like graphlab or scikit but the plsa papers show interesting results. On Mar 30, 2015 2:31 PM, "Xiangrui Meng" wrote: > On Wed, Mar 25, 2015 at 7:59 AM, Debasish Das > wrote: > > Hi Xiangrui, > >

ADMM based proximal flow

2015-03-31 Thread Debasish Das
Hi, We recently added ADMM based proximal algorithm in breeze.optimize.proximal.NonlinearMinimizer which uses a combination of BFGS and proximal algorithms (soft thresholding for L1 for example) to solve large scale constrained optimization problem of form f(x) + g(z). Its usage is similar to curr

Re: How can I do pair-wise computation between RDD feature columns?

2015-05-16 Thread Debasish Das
I opened it up today but it should help you: https://github.com/apache/spark/pull/6213 On Sat, May 16, 2015 at 6:18 PM, Chunnan Yao wrote: > Hi all, > Recently I've ran into a scenario to conduct two sample tests between all > paired combination of columns of an RDD. But the networking load and

IndexedRowMatrix semantics

2015-05-20 Thread Debasish Das
Hi, For indexedrowmatrix and rowmatrix, both take RDD(vector)is it possible that it has intermixed dense and sparse vectorbasically I am considering a gemv flow when indexedrowmatrix has dense flag true, dot flow otherwise... Thanks. Deb

Kryo option changed

2015-05-23 Thread Debasish Das
Hi, I am on last week's master but all the examples that set up the following .set("spark.kryoserializer.buffer", "8m") are failing with the following error: Exception in thread "main" java.lang.IllegalArgumentException: spark.kryoserializer.buffer must be less than 2048 mb, got: + 8192 mb. loo

spark packages

2015-05-23 Thread Debasish Das
Hi, Is it possible to add GPL/LGPL code on spark packages or it must be licensed under Apache as well ? I want to expose Professor Tim Davis's LGPL library for sparse algebra and ECOS GPL library through the package. Thanks. Deb

Re: Kryo option changed

2015-05-23 Thread Debasish Das
Tried "8mb"...still I am failing on the same error... On Sat, May 23, 2015 at 6:10 PM, Ted Yu wrote: > bq. it shuld be "8mb" > > Please use the above syntax. > > Cheers > > On Sat, May 23, 2015 at 6:04 PM, Debasish Das > wrote: > >> Hi,

Power iteration clustering

2015-05-23 Thread Debasish Das
Hi, What was the motivation to write power iteration clustering using graphx and not a vector matrix multiplication over similarity matrix represented as say coordinate matrix ? We can use gemv in that flow to block the computation. Over graphx can we do all k eigen vector computation together b

Re: spark packages

2015-05-24 Thread Debasish Das
, May 23, 2015, Patrick Wendell wrote: >> >>> Yes - spark packages can include non ASF licenses. >>> >>> On Sat, May 23, 2015 at 6:16 PM, Debasish Das >>> wrote: >>> > Hi, >>> > >>> > Is it possible to add GPL/LGPL c

Re: Kryo option changed

2015-05-24 Thread Debasish Das
23, 2015 at 6:37 PM, Ted Yu wrote: > >> Pardon me. >> >> Please use '8192k' >> >> Cheers >> >> On Sat, May 23, 2015 at 6:24 PM, Debasish Das >> wrote: >> >>> Tried "8mb"...still I am failing on the s

Re: Power iteration clustering

2015-05-26 Thread Debasish Das
5:53 PM, "Joseph Bradley" wrote: > That's a good question; I could imagine it being much more efficient if > kept in a BlockMatrix and using BLAS2 ops. > > On Sat, May 23, 2015 at 8:09 PM, Debasish Das > wrote: > >> Hi, >> >> What was the m

Re: GraphX implementation of ALS?

2015-05-26 Thread Debasish Das
In general for implicit feedback in als you have to do a blocked gram matrix calculation which might not fit in graphx flow and lot of blocked operations can be used...but if your loss is likelihood or kl divergence or just simple sgd update rules and not least square then graphx idea makes sense..

Streaming data + Blocked Model

2015-05-28 Thread Debasish Das
Hi, We want to keep the model created and loaded in memory through Spark batch context since blocked matrix operations are required to optimize on runtime. The data is streamed in through Kafka / raw sockets and Spark Streaming Context. We want to run some prediction operations with the streaming

Impala created parquet tables

2015-06-20 Thread Debasish Das
Hi, I have some impala created parquet tables which hive 0.13.2 can read fine. Now the same table when I want to read using Spark SQL 1.3 I am getting exception class exception that parquet.hive.serde.ParquetHiveSerde not found. I am assuming that hive somewhere is putting the parquet-hive-bundl

Re: Welcoming some new committers

2015-06-20 Thread Debasish Das
Congratulations to All. DB great work in bringing quasi newton methods to Spark ! On Wed, Jun 17, 2015 at 3:18 PM, Chester Chen wrote: > Congratulations to All. > > DB and Sandy, great works ! > > > On Wed, Jun 17, 2015 at 3:12 PM, Matei Zaharia > wrote: > >> Hey all, >> >> Over the past 1.5 m

Velox Model Server

2015-06-20 Thread Debasish Das
Hi, The demo of end-to-end ML pipeline including the model server component at Spark Summit was really cool. I was wondering if the Model Server component is based upon Velox or it uses a completely different architecture. https://github.com/amplab/velox-modelserver We are looking for an open s

Spark SQL 1.3 Exception

2015-06-24 Thread Debasish Das
, 2015 at 12:21 AM, Debasish Das wrote: > Hi, > > I have some impala created parquet tables which hive 0.13.2 can read fine. > > Now the same table when I want to read using Spark SQL 1.3 I am getting > exception class exception that parquet.hive.serde.ParquetHiveSerde not

Gossip protocol in Master selection

2015-06-28 Thread Debasish Das
Hi, Akka cluster uses gossip protocol for Master election. The approach in Spark right now is to use Zookeeper for high availability. Interestingly Cassandra and Redis clusters are both using Gossip protocol. I am not sure what is the default behavior right now. If the master dies and zookeeper

Re: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-07-22 Thread Debasish Das
Does it also support insert operations ? On Jul 22, 2015 4:53 PM, "Bing Xiao (Bing)" wrote: > We are happy to announce the availability of the Spark SQL on HBase > 1.0.0 release. > http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase > > The main features in this package, dubbed “As

Confidence in implicit factorization

2015-07-25 Thread Debasish Das
Hi, Implicit factorization is important for us since it drives recommendation when modeling user click/no-click and also topic modeling to handle 0 counts in document x word matrices through NMF and Sparse Coding. I am a bit confused on this code: val c1 = alpha * math.abs(rating) if (rating > 0

Re: Confidence in implicit factorization

2015-07-26 Thread Debasish Das
Instead the rating matrix > is the thing being factorized directly. > > On Sun, Jul 26, 2015 at 6:45 AM, Debasish Das > wrote: > > Hi, > > > > Implicit factorization is important for us since it drives recommendation > > when modeling user click/no-click and

Re: Confidence in implicit factorization

2015-07-26 Thread Debasish Das
I will think further but in the current implicit formulation with confidence, looks like I am factorizing a 0/1 matrix with weights 1 + alpha*rating for observed (1) values and 1 for unobserved (0) values. It's a bit different from LSA model. >> On Sun, Jul 26, 2015 at 6:45 AM, D

Re: Confidence in implicit factorization

2015-07-26 Thread Debasish Das
10x more > than the latter. It's very heavily skewed to pay attention to the > high-count instances. > > > On Sun, Jul 26, 2015 at 9:19 AM, Debasish Das > wrote: > > Yeah, I think the idea of confidence is a bit different than what I am > > looking for using

Re: Confidence in implicit factorization

2015-07-26 Thread Debasish Das
unt instances. > > > On Sun, Jul 26, 2015 at 9:19 AM, Debasish Das > wrote: > > Yeah, I think the idea of confidence is a bit different than what I am > > looking for using implicit factorization to do document clustering. > > > > I basically need (r_ij - w_ih_j)^

RE: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-07-27 Thread Debasish Das
Hi Yan, Is it possible to access the hbase table through spark sql jdbc layer ? Thanks. Deb On Jul 22, 2015 9:03 PM, "Yan Zhou.sc" wrote: > Yes, but not all SQL-standard insert variants . > > > > *From:* Debasish Das [mailto:debasish.da...@gmail.com] > *Sent:* Wedn

Re: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-07-28 Thread Debasish Das
t; > > Graphically, the access path is as follows: > > > > Spark SQL JDBC Interface -> Spark SQL Parser/Analyzer/Optimizer->Astro > Optimizer-> HBase Scans/Gets -> … -> HBase Region server > > > > > > Regards, > > > > Yan > > >

Re: RDD API patterns

2015-09-17 Thread Debasish Das
Rdd nesting can lead to recursive nesting...i would like to know the usecase and why join can't support it...you can always expose an api over a rdd and access that in another rdd mappartition...use a external data source like hbase cassandra redis to support the api... For ur case group by and th

Re: Using spark MLlib without installing Spark

2015-11-26 Thread Debasish Das
Decoupling mlllib and core is difficult...it is not intended to run spark core 1.5 with spark mllib 1.6 snapshot...core is more stabilized due to new algorithms getting added to mllib and sometimes you might be tempted to do that but its not recommend. On Nov 21, 2015 8:04 PM, "Reynold Xin" wrote:

Development methodology

2014-03-01 Thread Debasish Das
Hi, We have a mirror repo of spark at our internal stash. We are adding changes to a fork of the mirror so that down the line we can push the contributions back to Spark git. I am not sure what's the exact the development methodology we should follow as things are a bit complicated due to enterp

Re: Development methodology

2014-03-01 Thread Debasish Das
mean by enterprise stash. > > But PR is a concept unique to Github. There is no PR model in normal git or > the git ASF maintains. > > > On Sat, Mar 1, 2014 at 11:28 AM, Debasish Das >wrote: > > > Hi, > > > > We have a mirror repo of spark at our internal sta

Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-03-02 Thread Debasish Das
Hi DB, 1. Could you point to the BFGS repositories used to publish artifacts to maven central ? What's the best way to add changes to it ? I fork the repo at my github ? Basically as I mentioned before I need to add lbfgs-b, orthant wise for L1 handling and few variants of line search to lbfgs...

Re: Development methodology

2014-03-02 Thread Debasish Das
l requests has to come through github ? I could merge for example @dbtsai github lbfgs branch to my branch at stash... Thanks. Deb On Sat, Mar 1, 2014 at 12:43 PM, Debasish Das wrote: > Stash is an enterprise git from atlassian.. > > I got it...Basically the PRs are managed by gith

Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-03-03 Thread Debasish Das
se support as well. Do we have good infrastructure > around this? > > Thanks. > > Sincerely, > > DB Tsai > Machine Learning Engineer > Alpine Data Labs > ------ > Web: http://alpinenow.com/ > > > On Sun, Mar 2, 2014 at 10:23 AM

Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-03-04 Thread Debasish Das
Yeah we should move f2j L-BFGS and L-BFGS-B to breeze..they already have 2 line searches..also the OWL-QN outline... Hi Xiangrui, What's the plan on the PR ? https://github.com/apache/incubator-spark/pull/575 Will you add breeze as a dependency for the sparse support ? I looked at your branch h

Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-03-05 Thread Debasish Das
Hi David, Few questions on breeze solvers: 1. I feel the right place of adding useful things from RISO LBFGS (based on Professor Nocedal's fortran code) will be breeze. It will involve stress testing breeze LBFGS on large sparse datasets and contributing fixes to existing breeze LBFGS with the le

Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-03-05 Thread Debasish Das
David, There used to be standard BFGS testcases in Professor Nocedal's package...did you stress test the solver with them ? If not I will shoot him an email for them. Thanks. Deb On Wed, Mar 5, 2014 at 2:00 PM, David Hall wrote: > On Wed, Mar 5, 2014 at 1:57 PM, DB Tsai wrote: > > > Hi Dav

ALS Solve.solvePositive

2014-03-06 Thread Debasish Das
Hi, I am running ALS on a sparse problem (10M x 1M) and I am getting the following error: org.jblas.exceptions.LapackArgumentException: LAPACK DPOSV: Leading minor of order i of A is not positive definite. at org.jblas.SimpleBlas.posv(SimpleBlas.java:373) at org.jblas.Solve.solvePositive(Solve.ja

ALS solve.solvePositive

2014-03-06 Thread Debasish Das
Hi, I am running ALS on a sparse problem (10M x 1M) and I am getting the following error: org.jblas.exceptions.LapackArgumentException: LAPACK DPOSV: Leading minor of order i of A is not positive definite. at org.jblas.SimpleBlas.posv(SimpleBlas.java:373) at org.jblas.Solve.solvePositive(Solve.ja

QR decomposition in Spark ALS

2014-03-06 Thread Debasish Das
> definite. Therefore, we chose QR decomposition to solve the linear system. > > > --sebastian > > > On 03/06/2014 03:44 PM, Debasish Das wrote: > >> Hi, >> >> I am running ALS on a sparse problem (10M x 1M) and I am getting the >> following error: >&

Re: QR decomposition in Spark ALS

2014-03-06 Thread Debasish Das
- > Sean Owen | Director, Data Science | London > > > On Thu, Mar 6, 2014 at 3:05 PM, Debasish Das > wrote: > > Hi Sebastian, > > > > Yes Mahout ALS and Oryx runs fine on the same matrix because Sean calls > QR > > decomposition. > > > >

Re: QR decomposition in Spark ALS

2014-03-06 Thread Debasish Das
those are for first order solves... On Thu, Mar 6, 2014 at 9:21 AM, Debasish Das wrote: > Yes that will be really cool if the data has linearly independent rows ! I > have to debug it more but I got it running with jblas Solve.solve.. > > I will try breeze QR decomposition next.

Re: QR decomposition in Spark ALS

2014-03-06 Thread Debasish Das
guarantee 100% > > that I haven't missed something there.) > > > > Even though your data is huge, if it was generated by some synthetic > > process, maybe it is very low rank? > > > > QR decomposition is pretty good here, yes. > > -- > > Sean Owen | Direct

Re: ALS solve.solvePositive

2014-03-07 Thread Debasish Das
Hi Xiangrui, I used lambda = 0.1...It is possible that 2 users ranked in movies in a very similar way... I agree that increasing lambda will solve the problem but you agree this is not a solution...lambda should be tuned based on sparsity / other criteria and not to make a linearly dependent hess

Re: ALS solve.solvePositive

2014-03-11 Thread Debasish Das
iangrui Meng wrote: > > Choosing lambda = 0.1 shouldn't lead to the error you got. This is > > probably a bug. Do you mind sharing a small amount of data that can > > re-produce the error? -Xiangrui > > > > On Fri, Mar 7, 2014 at 8:24 AM, Debasish Das > wrote

Maximum memory limits

2014-03-16 Thread Debasish Das
Hi, I gave my spark job 16 gb of memory and it is running on 8 executors. The job needs more memory due to ALS requirements (20M x 1M matrix) On each node I do have 96 gb of memory and I am using 16 gb out of it. I want to increase the memory but I am not sure what is the right way to do that...

Re: ALS solve.solvePositive

2014-03-19 Thread Debasish Das
a lot. Thanks. Deb On Wed, Mar 19, 2014 at 10:11 AM, Xiangrui Meng wrote: > Another question: do you have negative or out-of-range user or product > ids or? -Xiangrui > > On Tue, Mar 11, 2014 at 8:00 PM, Debasish Das > wrote: > > Nope..I did not test implicit feedback yet...w

Re: new Catalyst/SQL component merged into master

2014-03-21 Thread Debasish Das
Awesome news ! It will be great if there are any examples or usecases to look at ? We are looking into shark/ooyala job server to give in memory sql analytics, model serving/scoring features for dashboard apps... Does this feature has different usecases than shark or more cleaner as hive depende

ALS memory limits

2014-03-25 Thread Debasish Das
StackOverflow >> and >> > ALS -- that's why I snuck in a relatively paltry 40 features and pruned >> > questions with <4 tags :) ) >> > >> > I don't think jblas has anything to do with it per se, and the >> allocation >> > fails in Java cod

Re: ALS memory limits

2014-03-26 Thread Debasish Das
iles. You may have to use ulimit to increase the number of open files > allowed. > > On Wed, Mar 26, 2014 at 6:06 AM, Debasish Das >wrote: > > > Hi, > > > > For our usecases we are looking into 20 x 1M matrices which comes in the > > similar ranges as outlined by the

Re: ALS memory limits

2014-03-26 Thread Debasish Das
useful as we deploy the solution... On Wed, Mar 26, 2014 at 7:31 AM, Debasish Das wrote: > Thanks Sean. Looking into executor memory options now... > > I am at incubator_spark head. Does that has all the fixes or I need spark > head ? I can deploy the spark head as well... > >

Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with too many iterations"?

2014-03-27 Thread Debasish Das
Hi Matei, I am hitting similar problems with 10 ALS iterations...I am running with 24 gb executor memory on 10 nodes for 20M x 3 M matrix with rank =50 The first iteration of flatMaps run fine which means that the memory requirements are good per iteration... If I do check-pointing on RDD, most

Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-03-30 Thread Debasish Das
Hi David, I have started to experiment with BFGS solvers for Spark GLM over large scale data... I am also looking to add a good QP solver in breeze that can be used in Spark ALS for constraint solves...More details on that soon... I could not load up breeze 0.7 code onto eclipse...There is a fol

Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-03-31 Thread Debasish Das
will need...In ALS for example X^TX = I and Y^Y=I are interesting constraints for orthogonality...and they are quadratic constraints...With BFGS and CG, it is difficult to handle quadratic constraints... On Sun, Mar 30, 2014 at 4:40 PM, David Hall wrote: > On Sun, Mar 30, 2014 at 2:01 PM,

Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-03-31 Thread Debasish Das
el it will work fine...On Mar 31, 2014 10:34 AM, "David Hall" wrote: > On Mon, Mar 31, 2014 at 10:15 AM, Debasish Das >wrote: > > > I added eclipse support in my qp branch: > > > > https://github.com/debasish83/breeze/tree/qp > > > Ok, great. Totally fin

Recent heartbeats

2014-04-04 Thread Debasish Das
Hi, Also posted it on user but then I realized it might be more involved. In my ALS runs I am noticing messages that complain about heart beats: 14/04/04 20:43:09 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(17, machine1, 53419, 0) with no recent heart beats: 48476ms exceed

Re: Recent heartbeats

2014-04-05 Thread Debasish Das
Thanks Patrick...I searched in the archives and found the answer...tuning the akka and gc params On Fri, Apr 4, 2014 at 10:35 PM, Patrick Wendell wrote: > I answered this over on the user list... > > > On Fri, Apr 4, 2014 at 6:13 PM, Debasish Das >wrote: > > > Hi

Master compilation

2014-04-05 Thread Debasish Das
I am synced with apache/spark master but getting error in spark/sql compilation... Is the master broken ? [info] Compiling 34 Scala sources to /home/debasish/spark_deploy/sql/core/target/scala-2.10/classes... [error] /home/debasish/spark_deploy/sql/core/src/main/scala/org/apache/spark/sql/parquet

  1   2   3   >