yes, Spark needs to create the RDD first(loads all the data) to create the sample. You can split the files into two sets outside of spark in order to load only the sample set. Thank youDhiraj
-- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Behaviour-of-RDD-sampling-tp27052p27057.html Sent from the Apache Spark User List mailing list archive at Nabble.com.