I am working with Spark SQL and the Thrift server. I ran into an
interesting bug, and I am curious on what information/testing I can provide
to help narrow things down.
My setup is as follows:
Hive 0.12 with a table that has lots of columns (50+) stored as rcfile.
Spark-1.1.0-SNAPSHOT with Hive Built in (and Thrift Server)
My query is only selecting one STRING column from the data, but only
returning data based on other columns .
Types:
col1 = STRING
col2 = STRING
col3 = STRING
col4 = Partition Field (TYPE STRING)
Queries
cache table table1;
--Run some other queries on other data
select col1 from table1
where col2 = 'foo' and col3 = 'bar' and col4 = 'foobar' and col1 is not
null limit 100
Fairly simple query.
When I run this in SQL Squirrel I get no results. When I remove the and
col1 is not null I get 100 rows of <null>
When I run this in beeline (the one that is in the spark-1.1.0-SNAPSHOT) I
get no results and when I remove 'and col1 is not null' I gett 100 rows of
<null>
Note: Both of these are after I ran some other queries.. .i.e. on other
columns, after I ran CACHE TABLE TABLE1 first before any queries. That
seemed interesting to me...
So I went to the spark-shell to determine if it was a spark issue, or a
thrift issue.
I ran:
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
import hiveContext._
cacheTable("table1")
Then I ran the same "other" queries" got results, and then I ran the query
above, and I got results as expected.
Interestingly enough, if I don't cache the table through cache table table1
in thrift, I get results for all queries. If I uncache, I start getting
results again.
I hope I was clear enough here, I am happy to help however I can.
John