I am training a boosted trees model on a couple million input samples (with around 300 features) and am noticing that the input size of each stage is increasing each iteration. For each new tree, the first step seems to be building the decision tree metadata, which does a .count() on the input data, so this is the step I've been using to track the input size changing. Here is what I'm seeing:
count at DecisionTreeMetadata.scala:111 1. Input Size / Records: 726.1 MB / 1295620 2. Input Size / Records: 106.9 GB / 64780816 3. Input Size / Records: 160.3 GB / 97171224 4. Input Size / Records: 214.8 GB / 129680959 5. Input Size / Records: 268.5 GB / 162533424 .... Input Size / Records: 1912.6 GB / 1382017686 .... This step goes from taking less than 10s up to 5 minutes by the 15th or so iteration. I'm not quite sure what could be causing this. I am passing a memory-only cached RDD[LabeledPoint] to GradientBoostedTrees.train Does anybody have some insight? Is this a bug or could it be an error on my part?