Now in JIRA form: https://issues.apache.org/jira/browse/SPARK-5844
On Tue, Feb 17, 2015 at 3:12 PM, Xiangrui Meng <men...@gmail.com> wrote: > There are three different regParams defined in the grid and there are > tree folds. For simplicity, we didn't split the dataset into three and > reuse them, but do the split for each fold. Then we need to cache 3*3 > times. Note that the pipeline API is not yet optimized for > performance. It would be nice to optimize its perforamnce in 1.4. > -Xiangrui > > On Wed, Feb 11, 2015 at 11:13 AM, Peter Rudenko <petro.rude...@gmail.com> > wrote: > > Hi i have a problem. Using spark 1.2 with Pipeline + GridSearch + > > LogisticRegression. I’ve reimplemented LogisticRegression.fit method and > > comment out instances.unpersist() > > > > |override def fit(dataset:SchemaRDD, > > paramMap:ParamMap):LogisticRegressionModel = { > > println(s"Fitting dataset ${dataset.take(1000).toSeq.hashCode()} with > > ParamMap $paramMap.") > > transformSchema(dataset.schema, paramMap, logging =true) > > import dataset.sqlContext._ > > val map = this.paramMap ++ paramMap > > val instances = dataset.select(map(labelCol).attr, > > map(featuresCol).attr) > > .map { > > case Row(label:Double, features:Vector) => > > LabeledPoint(label, features) > > } > > > > if (instances.getStorageLevel ==StorageLevel.NONE) { > > println("Instances not persisted") > > instances.persist(StorageLevel.MEMORY_AND_DISK) > > } > > > > val lr = (new LogisticRegressionWithLBFGS) > > .setValidateData(false) > > .setIntercept(true) > > lr.optimizer > > .setRegParam(map(regParam)) > > .setNumIterations(map(maxIter)) > > val lrm = new LogisticRegressionModel(this, map, > > lr.run(instances).weights) > > //instances.unpersist() > > // copy model params > > Params.inheritValues(map,this, lrm) > > lrm > > } > > | > > > > CrossValidator feeds the same SchemaRDD for each parameter (same hash > code), > > but somewhere cache being flushed. The memory is enough. Here’s the > output: > > > > |Fitting dataset 2051470010 with ParamMap { > > DRLogisticRegression-f35ae4d3-regParam: 0.1 > > }. > > Instances not persisted > > Fitting dataset 2051470010 with ParamMap { > > DRLogisticRegression-f35ae4d3-regParam: 0.01 > > }. > > Instances not persisted > > Fitting dataset 2051470010 with ParamMap { > > DRLogisticRegression-f35ae4d3-regParam: 0.001 > > }. > > Instances not persisted > > Fitting dataset 802615223 with ParamMap { > > DRLogisticRegression-f35ae4d3-regParam: 0.1 > > }. > > Instances not persisted > > Fitting dataset 802615223 with ParamMap { > > DRLogisticRegression-f35ae4d3-regParam: 0.01 > > }. > > Instances not persisted > > | > > > > I have 3 parameters in GridSearch and 3 folds for CrossValidation: > > > > | > > val paramGrid = new ParamGridBuilder() > > .addGrid(model.regParam,Array(0.1,0.01,0.001)) > > .build() > > > > crossval.setEstimatorParamMaps(paramGrid) > > crossval.setNumFolds(3) > > | > > > > I assume that the data should be read and cached 3 times (1 to > > numFolds).combinations(2) and be independent from number of parameters. > But > > i have 9 times data being read and cached. > > > > Thanks, > > Peter Rudenko > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >