One observation is that: if fraction is big, say 50% - 80%, sampling is good, everything run as expected. But if fraction is small, for example, 5%, sampled data contains wrong rows which should have been filtered.
The workaround is materializing t1 first: t1.cache t1.count These operations make sure that t1 is materialized correctly so that the following sample will work. This approach is tested, and works fine. But still dont know why SchemaRDD.sample will cause the problem when fraction is small. Any help is appreciated. Hao Hao. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-sample-problem-tp20741p20835.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org