Re: [ml] Lost persistence for fold in crossvalidation.

Joseph Bradley Wed, 18 Feb 2015 14:55:17 -0800

Now in JIRA form: https://issues.apache.org/jira/browse/SPARK-5844


On Tue, Feb 17, 2015 at 3:12 PM, Xiangrui Meng <men...@gmail.com> wrote:

> There are three different regParams defined in the grid and there are
> tree folds. For simplicity, we didn't split the dataset into three and
> reuse them, but do the split for each fold. Then we need to cache 3*3
> times. Note that the pipeline API is not yet optimized for
> performance. It would be nice to optimize its perforamnce in 1.4.
> -Xiangrui
>
> On Wed, Feb 11, 2015 at 11:13 AM, Peter Rudenko <petro.rude...@gmail.com>
> wrote:
> > Hi i have a problem. Using spark 1.2 with Pipeline + GridSearch +
> > LogisticRegression. I’ve reimplemented LogisticRegression.fit method and
> > comment out instances.unpersist()
> >
> > |override  def  fit(dataset:SchemaRDD,
> > paramMap:ParamMap):LogisticRegressionModel  = {
> >     println(s"Fitting dataset ${dataset.take(1000).toSeq.hashCode()} with
> > ParamMap $paramMap.")
> >     transformSchema(dataset.schema, paramMap, logging =true)
> >     import  dataset.sqlContext._
> >     val  map  =  this.paramMap ++ paramMap
> >     val  instances  =  dataset.select(map(labelCol).attr,
> > map(featuresCol).attr)
> >       .map {
> >         case  Row(label:Double, features:Vector) =>
> >           LabeledPoint(label, features)
> >       }
> >
> >     if  (instances.getStorageLevel ==StorageLevel.NONE) {
> >       println("Instances not persisted")
> >       instances.persist(StorageLevel.MEMORY_AND_DISK)
> >     }
> >
> >      val  lr  =  (new  LogisticRegressionWithLBFGS)
> >       .setValidateData(false)
> >       .setIntercept(true)
> >     lr.optimizer
> >       .setRegParam(map(regParam))
> >       .setNumIterations(map(maxIter))
> >     val  lrm  =  new  LogisticRegressionModel(this, map,
> > lr.run(instances).weights)
> >     //instances.unpersist()
> >     // copy model params
> >     Params.inheritValues(map,this, lrm)
> >     lrm
> >   }
> > |
> >
> > CrossValidator feeds the same SchemaRDD for each parameter (same hash
> code),
> > but somewhere cache being flushed. The memory is enough. Here’s the
> output:
> >
> > |Fitting dataset 2051470010 with ParamMap {
> >     DRLogisticRegression-f35ae4d3-regParam: 0.1
> > }.
> > Instances not persisted
> > Fitting dataset 2051470010 with ParamMap {
> >     DRLogisticRegression-f35ae4d3-regParam: 0.01
> > }.
> > Instances not persisted
> > Fitting dataset 2051470010 with ParamMap {
> >     DRLogisticRegression-f35ae4d3-regParam: 0.001
> > }.
> > Instances not persisted
> > Fitting dataset 802615223 with ParamMap {
> >     DRLogisticRegression-f35ae4d3-regParam: 0.1
> > }.
> > Instances not persisted
> > Fitting dataset 802615223 with ParamMap {
> >     DRLogisticRegression-f35ae4d3-regParam: 0.01
> > }.
> > Instances not persisted
> > |
> >
> > I have 3 parameters in GridSearch and 3 folds for CrossValidation:
> >
> > |
> > val  paramGrid  =  new  ParamGridBuilder()
> >   .addGrid(model.regParam,Array(0.1,0.01,0.001))
> >   .build()
> >
> > crossval.setEstimatorParamMaps(paramGrid)
> > crossval.setNumFolds(3)
> > |
> >
> > I assume that the data should be read and cached 3 times (1 to
> > numFolds).combinations(2) and be independent from number of parameters.
> But
> > i have 9 times data being read and cached.
> >
> > Thanks,
> > Peter Rudenko
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: [ml] Lost persistence for fold in crossvalidation.

Reply via email to