Query via Spark Thrift Server return wrong result.

Valentin Ursu Tue, 11 Jul 2017 10:19:06 -0700

Hello,

Short description: A SQL query sent via Thrift server returns an
inexplicable response. Running the same (exact same) query inside Apache
Zeppelin or submitting a job returns the correct result. Furthermore, a
similar table returns the correct response in both cases.


Details:
I'm using Spark 2.0.0 on a Cloudera 5.7 distribution. I did not test in on
Spark 2.1.0 but if helpful I can install it and test.

I have 2 Hive metastore tables which are saved by a spark batch job.
orders_2years is processed and saveAsTable()
orders_30days is filtered from _2years based on a column containing date
and saveAsTable() (same batch job creates both)

I checked and both tables contain this record and it is the only record
that satisfies all the conditions in the query:

| customer_id | order_id | doc_id | category | vendor_id|
+-------------+----------+---------+----------+----------+
| 916339 | 25144502 | 5596579 | 1455 | 1 |

The exact same query is run using a PHP application connecting to Thrift
and via Zeppelin/Spark using sqlContext.sql("")

The query on table _30days, in Zeppelin > good result
The query on table _30days, via Thrift > bad result
The query on table _2years, via Thrift > good result

Furthermore, changing _30days in _2years in b and leaving _30days to create
a > good result.

The query is:

SELECT a.customer_id_custom,
collect_list(b.id_comanda) AS b__orderIds,
collect_list(b.doc_id) AS b__docIds,
collect_list(b.category_id) AS b__cat,
collect_list(b.vendor_id) AS b__vendor
FROM ((SELECT customer_id AS customer_id_custom
FROM orders_30days
WHERE 1 = 1 AND category_id IN (1455) AND vendor_id IN (1) AND
(fullDate > '2017-06-08 12:07' AND
fullDate < '2017-07-07 12:07')
GROUP BY customer_id
HAVING count(1) >= 1 AND
SUM(total_price_with_vat) BETWEEN 1 AND 999) a )
INNER JOIN orders_30days AS b
ON b.customer_id = a.customer_id_custom AND
1 = 1 AND b.category_id IN (1455) AND
b.doc_id IN (5596579) AND b.vendor_id IN (1) AND
(b.fullDate > '2017-06-08 12:07' AND
b.fullDate < '2017-07-07 12:07')
WHERE 1 = 1
GROUP BY a.customer_id_custom

If you're wondering why I'm so specific with my query, the original is a
lot more complex. For example a is actually obtained by joining 5 tables
but I tried simplifying it as much as I could while obtaining the same
effect.

Good result is:

| customer_id_custom | b__orderIds | b__docIds | b__cat | b__vendor |
+--------------------+---------------+-----------+--------+-----------+
| 916339 | [25144502] | [5596579] | [1455] | [1] |

Bad result:

| customer_id_custom | b__orderIds | b__docIds | b__cat | b__vendor |
+--------------------+---------------+-----------+--------+-----------+
| 916339 | [null] | [1] | [null] | [1455] |

Notice how some columns appear to be misplaced while others just return null

Query via Spark Thrift Server return wrong result.

Reply via email to