One observation is that:
if fraction is big, say 50% - 80%, sampling is good, everything run as
expected.
But if fraction is small, for example, 5%, sampled data contains wrong rows
which should have been filtered.

The workaround is materializing t1 first:
t1.cache
t1.count

These operations make sure that t1 is materialized correctly so that the
following sample will work.

This approach is tested, and works fine. But still dont know why
SchemaRDD.sample will cause the problem when fraction is small.

Any help is appreciated.

Hao

Hao.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-sample-problem-tp20741p20835.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to