I have my data in two colors and excluded_colors.

colors contains all colors excluded_colors contains some colors that I wish
to exclude from my trainingset.

I am trying to split the data into a training and testing set and ensure
that the colors in excluded_colors are not in my training set but exist in
the testing set.
In order to achieve the above, I did this

var colors = spark.sql("""
   select colors.*
   from colors
   LEFT JOIN excluded_colors
   ON excluded_colors.color_id = colors.color_id
   where excluded_colors.color_id IS NULL
""")val trainer: (Int => Int) = (arg:Int) => 0val sqlTrainer =
udf(trainer)val tester: (Int => Int) = (arg:Int) => 1val sqlTester =
udf(tester)
val rsplit = colors.randomSplit(Array(0.7, 0.3))  val train_colors =
splits(0).select("color_id").withColumn("test",sqlTrainer(col("color_id")))val
test_colors = 
splits(1).select("color_id").withColumn("test",sqlTester(col("color_id")))


However, I'm realizing that by doing the above the colors in
excluded_colors are
completely ignored. They are not even in my testing set.

How can I split the data in 70/30 while also ensuring that the colors in
excluded_colorsare not in training but are present in testing.

Reply via email to