Re: SchemaRDD.sample problem

2014-12-23 Thread Cheng Lian
Here is a more cleaned up version, can be used in |./sbt/sbt hive/console| to easily reproduce this issue: |sql("SELECT * FROM src WHERE key % 2 = 0"). sample(withReplacement =false, fraction =0.05). registerTempTable("sampled") println(table("sampled").queryExecution) val query = sql("

Re: SchemaRDD.sample problem

2014-12-23 Thread Hao Ren
One observation is that: if fraction is big, say 50% - 80%, sampling is good, everything run as expected. But if fraction is small, for example, 5%, sampled data contains wrong rows which should have been filtered. The workaround is materializing t1 first: t1.cache t1.count These operations make

Re: SchemaRDD.sample problem

2014-12-23 Thread Hao Ren
update: t1 is good. After collecting on t1, I find that all row is ok (is_new = 0) Just after sampling, there are some rows where is_new = 1 which should have been filtered by Where clause. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-sample-

Re: SchemaRDD.sample problem

2014-12-18 Thread madhu phatak
Hi, Can you clean up the code lil bit better, it's hard to read what's going on. You can use pastebin or gist to put the code. On Wed, Dec 17, 2014 at 3:58 PM, Hao Ren wrote: > > Hi, > > I am using SparkSQL on 1.2.1 branch. The problem comes froms the following > 4-line code: > > *val t1: SchemaR

SchemaRDD.sample problem

2014-12-17 Thread Hao Ren
Hi, I am using SparkSQL on 1.2.1 branch. The problem comes froms the following 4-line code: *val t1: SchemaRDD = hiveContext hql "select * from product where is_new = 0" val tb1: SchemaRDD = t1.sample(withReplacement = false, fraction = 0.05) tb1.registerTempTable("t1_tmp") (hiveContext sql "sele