I have my data in two colors and excluded_colors. colors contains all colors excluded_colors contains some colors that I wish to exclude from my trainingset.
I am trying to split the data into a training and testing set and ensure that the colors in excluded_colors are not in my training set but exist in the testing set. In order to achieve the above, I did this var colors = spark.sql(""" select colors.* from colors LEFT JOIN excluded_colors ON excluded_colors.color_id = colors.color_id where excluded_colors.color_id IS NULL """)val trainer: (Int => Int) = (arg:Int) => 0val sqlTrainer = udf(trainer)val tester: (Int => Int) = (arg:Int) => 1val sqlTester = udf(tester) val rsplit = colors.randomSplit(Array(0.7, 0.3)) val train_colors = splits(0).select("color_id").withColumn("test",sqlTrainer(col("color_id")))val test_colors = splits(1).select("color_id").withColumn("test",sqlTester(col("color_id"))) However, I'm realizing that by doing the above the colors in excluded_colors are completely ignored. They are not even in my testing set. How can I split the data in 70/30 while also ensuring that the colors in excluded_colorsare not in training but are present in testing.