subject:"SchemaRDD.sample problem"

Re: SchemaRDD.sample problem

2014-12-23 Thread Cheng Lian

Here is a more cleaned up version, can be used in |./sbt/sbt hive/console| to easily reproduce this issue: |sql("SELECT * FROM src WHERE key % 2 = 0"). sample(withReplacement =false, fraction =0.05). registerTempTable("sampled") println(table("sampled").queryExecution) val query = sql("

Re: SchemaRDD.sample problem

2014-12-23 Thread Hao Ren

One observation is that: if fraction is big, say 50% - 80%, sampling is good, everything run as expected. But if fraction is small, for example, 5%, sampled data contains wrong rows which should have been filtered. The workaround is materializing t1 first: t1.cache t1.count These operations make

Re: SchemaRDD.sample problem

2014-12-23 Thread Hao Ren

update: t1 is good. After collecting on t1, I find that all row is ok (is_new = 0) Just after sampling, there are some rows where is_new = 1 which should have been filtered by Where clause. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-sample-

Re: SchemaRDD.sample problem

2014-12-18 Thread madhu phatak

Hi, Can you clean up the code lil bit better, it's hard to read what's going on. You can use pastebin or gist to put the code. On Wed, Dec 17, 2014 at 3:58 PM, Hao Ren wrote: > > Hi, > > I am using SparkSQL on 1.2.1 branch. The problem comes froms the following > 4-line code: > > *val t1: SchemaR

SchemaRDD.sample problem

2014-12-17 Thread Hao Ren

Hi, I am using SparkSQL on 1.2.1 branch. The problem comes froms the following 4-line code: *val t1: SchemaRDD = hiveContext hql "select * from product where is_new = 0" val tb1: SchemaRDD = t1.sample(withReplacement = false, fraction = 0.05) tb1.registerTempTable("t1_tmp") (hiveContext sql "sele

Re: SchemaRDD.sample problem

Re: SchemaRDD.sample problem

Re: SchemaRDD.sample problem

Re: SchemaRDD.sample problem

SchemaRDD.sample problem

5 matches

Site Navigation

Mail list logo

Footer information