can you post the explain too ? Le mar. 12 mai 2015 à 12:11, Jianshi Huang <jianshi.hu...@gmail.com> a écrit :
> Hi, > > I have a SQL query on tables containing big Map columns (thousands of > keys). I found it to be very slow. > > select meta['is_bad'] as is_bad, count(*) as count, avg(nvar['var1']) as > avg > from test > where date between '2014-04-01' and '2014-04-30' > group by meta['is_bad'] > > => > > +---------+-----------+-----------------------+ > | is_bad | count | avg | > +---------+-----------+-----------------------+ > | 0 | 17024396 | 0.16257395850742645 | > | 1 | 179729 | -0.37626256661125485 | > | 2 | 28128 | 0.11674427263203344 | > | 3 | 116327 | -0.6398689187187386 | > | 4 | 87715 | -0.5349632960030563 | > | 5 | 169771 | 0.40812641191854626 | > | 6 | 542447 | 0.5238256418341465 | > | 7 | 160324 | 0.29442847034840386 | > | 8 | 2099 | -0.9165701665162977 | > | 9 | 3104 | 0.3845685004598235 | > +---------+-----------+-----------------------+ > 10 rows selected (130.5 seconds) > > > The total number of rows is less than 20M. Why so slow? > > I'm running on Spark 1.4.0-SNAPSHOT with 100 executors each having 4GB ram > and 2 CPU core. > > Looks like https://issues.apache.org/jira/browse/SPARK-5446 is still > open, when can we have it fixed? :) > > -- > Jianshi Huang > > LinkedIn: jianshi > Twitter: @jshuang > Github & Blog: http://huangjs.github.com/ >