Hello, Short description: A SQL query sent via Thrift server returns an inexplicable response. Running the same (exact same) query inside Apache Zeppelin or submitting a job returns the correct result. Furthermore, a similar table returns the correct response in both cases.
Details: I'm using Spark 2.0.0 on a Cloudera 5.7 distribution. I did not test in on Spark 2.1.0 but if helpful I can install it and test. I have 2 Hive metastore tables which are saved by a spark batch job. orders_2years is processed and saveAsTable() orders_30days is filtered from _2years based on a column containing date and saveAsTable() (same batch job creates both) I checked and both tables contain this record and it is the only record that satisfies all the conditions in the query: | customer_id | order_id | doc_id | category | vendor_id| +-------------+----------+---------+----------+----------+ | 916339 | 25144502 | 5596579 | 1455 | 1 | The exact same query is run using a PHP application connecting to Thrift and via Zeppelin/Spark using sqlContext.sql("") The query on table _30days, in Zeppelin > good result The query on table _30days, via Thrift > bad result The query on table _2years, via Thrift > good result Furthermore, changing _30days in _2years in b and leaving _30days to create a > good result. The query is: SELECT a.customer_id_custom, collect_list(b.id_comanda) AS b__orderIds, collect_list(b.doc_id) AS b__docIds, collect_list(b.category_id) AS b__cat, collect_list(b.vendor_id) AS b__vendor FROM ((SELECT customer_id AS customer_id_custom FROM orders_30days WHERE 1 = 1 AND category_id IN (1455) AND vendor_id IN (1) AND (fullDate > '2017-06-08 12:07' AND fullDate < '2017-07-07 12:07') GROUP BY customer_id HAVING count(1) >= 1 AND SUM(total_price_with_vat) BETWEEN 1 AND 999) a ) INNER JOIN orders_30days AS b ON b.customer_id = a.customer_id_custom AND 1 = 1 AND b.category_id IN (1455) AND b.doc_id IN (5596579) AND b.vendor_id IN (1) AND (b.fullDate > '2017-06-08 12:07' AND b.fullDate < '2017-07-07 12:07') WHERE 1 = 1 GROUP BY a.customer_id_custom If you're wondering why I'm so specific with my query, the original is a lot more complex. For example a is actually obtained by joining 5 tables but I tried simplifying it as much as I could while obtaining the same effect. Good result is: | customer_id_custom | b__orderIds | b__docIds | b__cat | b__vendor | +--------------------+---------------+-----------+--------+-----------+ | 916339 | [25144502] | [5596579] | [1455] | [1] | Bad result: | customer_id_custom | b__orderIds | b__docIds | b__cat | b__vendor | +--------------------+---------------+-----------+--------+-----------+ | 916339 | [null] | [1] | [null] | [1455] | Notice how some columns appear to be misplaced while others just return null