MLUtil.kfold generates overlapped training and validation set?

Nan Zhu Thu, 09 Oct 2014 10:50:59 -0700

Hi, all  

When we use MLUtils.kfold to generate training and validation set for cross 
validation


we found that there is overlapped part in two sets….

from the code, it does sampling for twice for the same dataset

 @Experimental
  def kFold[T: ClassTag](rdd: RDD[T], numFolds: Int, seed: Int): Array[(RDD[T], 
RDD[T])] = {
    val numFoldsF = numFolds.toFloat
    (1 to numFolds).map { fold =>
      val sampler = new BernoulliSampler[T]((fold - 1) / numFoldsF, fold / 
numFoldsF,
        complement = false)
      val validation = new PartitionwiseSampledRDD(rdd, sampler, true, seed)
      val training = new PartitionwiseSampledRDD(rdd, 
sampler.cloneComplement(), true, seed)
      (training, validation)
    }.toArray
  }


the sampler is complement, there is still possibility to generate overlapped 
training and validation set  

because the sampling method looks like :

override def sample(items: Iterator[T]): Iterator[T] = {
    items.filter { item =>
      val x = rng.nextDouble()
      (x >= lb && x < ub) ^ complement
    }
  }


I’m not a machine learning guy, so I guess I must fall into one of the 
following three situations

1. does it mean actually we allow overlapped training and validation set ? 
(counter intuitive to me)

2. I had some misunderstanding on the code?  

3. it’s a bug?

Anyone can explain it to me?

Best,  

--  
Nan Zhu

MLUtil.kfold generates overlapped training and validation set?

Reply via email to