Here is a more cleaned up version, can be used in |./sbt/sbt
hive/console| to easily reproduce this issue:
|sql("SELECT * FROM src WHERE key % 2 = 0").
sample(withReplacement =false, fraction =0.05).
registerTempTable("sampled")
println(table("sampled").queryExecution)
val query = sql("
One observation is that:
if fraction is big, say 50% - 80%, sampling is good, everything run as
expected.
But if fraction is small, for example, 5%, sampled data contains wrong rows
which should have been filtered.
The workaround is materializing t1 first:
t1.cache
t1.count
These operations make
update:
t1 is good. After collecting on t1, I find that all row is ok (is_new = 0)
Just after sampling, there are some rows where is_new = 1 which should have
been filtered by Where clause.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-sample-
Hi,
Can you clean up the code lil bit better, it's hard to read what's going
on. You can use pastebin or gist to put the code.
On Wed, Dec 17, 2014 at 3:58 PM, Hao Ren wrote:
>
> Hi,
>
> I am using SparkSQL on 1.2.1 branch. The problem comes froms the following
> 4-line code:
>
> *val t1: SchemaR
Hi,
I am using SparkSQL on 1.2.1 branch. The problem comes froms the following
4-line code:
*val t1: SchemaRDD = hiveContext hql "select * from product where is_new =
0"
val tb1: SchemaRDD = t1.sample(withReplacement = false, fraction = 0.05)
tb1.registerTempTable("t1_tmp")
(hiveContext sql "sele