Re: Is there any way to control the parallelism in LogisticRegression

2014-09-04 Thread Jiusheng Chen
ly, > > DB Tsai > --- > My Blog: https://www.dbtsai.com > LinkedIn: https://www.linkedin.com/in/dbtsai > > > On Wed, Sep 3, 2014 at 9:28 PM, Jiusheng Chen > wrote: > >> Thanks DB and Xiangrui. Glad to know you guys are actively working on i

Re: Is there any way to control the parallelism in LogisticRegression

2014-09-03 Thread Jiusheng Chen
7:34 PM, Xiangrui Meng wrote: > >> +DB & David (They implemented QWLQN on Spark today.) >> On Sep 3, 2014 7:18 PM, "Jiusheng Chen" wrote: >> >>> Hi Xiangrui, >>> >>> A side-by question about MLLib. >>> It looks current LBFGS

Re: Is there any way to control the parallelism in LogisticRegression

2014-09-03 Thread Jiusheng Chen
gt;> Assuming that your data is very sparse, I would recommend >> >> RDD.repartition. But if it is not the case and you don't want to >> >> shuffle the data, you can try a CombineInputFormat and then parse the >> >> lines into labeled points. Coale

Re: Is there any way to control the parallelism in LogisticRegression

2014-08-11 Thread Jiusheng Chen
How about increase HDFS file extent size? like current value is 128M, we make it 512M or bigger. On Tue, Aug 12, 2014 at 11:46 AM, ZHENG, Xu-dong wrote: > Hi all, > > We are trying to use Spark MLlib to train super large data (100M features > and 5B rows). The input data in HDFS has ~26K partit

LabeledPoint with weight

2014-07-21 Thread Jiusheng Chen
It seems MLlib right now doesn't support weighted training, training samples have equal importance. Weighted training can be very useful to reduce data size and speed up training. Do you have plan to support it in future? The data format will be something like: label:*weight * index1:value1 inde