Re: Bug in ElasticSearch and Spark SQL: Using SQL to query out data, from JSON documents is totally wrong!

Costin Leau Wed, 11 Feb 2015 01:46:57 -0800


Aris, if you encountered a bug, it's best to raise an issue with the 
es-hadoop/spark project, namely here [1].


When using SparkSQL the underlying data needs to be present - this is mentioned 
in the docs as well [2]. As for the order,
that does look like a bug and shouldn't occur. Note the reason why the keys are 
re-arranged is in JSON, the object/map
doesn't
guarantee order. To give some predictability, the keys are arranged 
alphabetically.

I suggest continuing this discussion in the issue issue tracker mentioned at 
[1].

[1] https://github.com/elasticsearch/elasticsearch-hadoop/issues
[2] 
http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/master/spark.html#spark-sql

On 2/11/15 3:17 AM, Aris wrote:

I'm using ElasticSearch with elasticsearch-spark-BUILD-SNAPSHOT and 
Spark/SparkSQL 1.2.0, from Costin Leau's advice.

I want to query ElasticSearch for a bunch of JSON documents from within 
SparkSQL, and then use a SQL query to simply
query for a column, which is actually a JSON key -- normal things that SparkSQL 
does using the
SQLContext.jsonFile(filePath) facility. The difference I am using the 
ElasticSearch container.

The big problem: when I do something like

SELECT jsonKeyA from tempTable;

I actually get the WRONG KEY out of the JSON documents! I discovered that if I 
have JSON keys physically in the order D,
C, B, A in the json documents, the elastic search connector discovers those 
keys BUT then sorts them alphabetically as
A,B,C,D - so when I SELECT A from tempTable, I actually get column D (because 
the physical JSONs had key D in the first
position). This only happens when reading from elasticsearch and SparkSQL.

It gets much worse: When a key is missing from one of the documents and that 
key should be NULL, the whole application
actually crashes and gives me a java.lang.IndexOutOfBoundsException -- the 
schema that is inferred is totally screwed up.

In the above example with physical JSONs containing keys in the order D,C,B,A, 
if one of the JSON documents is missing
the key/column I am querying for, I get that 
java.lang.IndexOutOfBoundsException exception.

I am using the BUILD-SNAPSHOT because otherwise I couldn't build the 
elasticsearch-spark project, Costin said so.

Any clues here? Any fixes?


--
Costin



--
Costin

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Bug in ElasticSearch and Spark SQL: Using SQL to query out data, from JSON documents is totally wrong!

Reply via email to