Re: Bug in ElasticSearch and Spark SQL: Using SQL to query out data from JSON documents is totally wrong!

Aris Wed, 11 Feb 2015 13:44:27 -0800

Thank you Costin. I wrote out to the user list, I got no replies there.

I will take this exact message and put it on the Github bug tracking system.


One quick clarification: I read the elasticsearch documentation thoroughly,
and I saw the warning about structured data vs. unstructured data, but it's
a bit confusing. Normally, when SparkSQL sees a lot of JSON documents of
the "same schema", but perhaps a few instances of those documents might be
missing a key X, a query which SELECT's X will just come back as NULL for
the JSON documents which don't have that key X.

Are you saying this behavior does not happen when I connect to
ElasticSearch? Every single JSON document must contain every single key, or
else the application crashes? So If one single JSON document is missing key
X from the elasticsearch data, the application throws an Exception?

Thank you!
Aris



On Wed, Feb 11, 2015 at 1:21 AM, Costin Leau <costin.l...@elasticsearch.com>
wrote:

> Aris, if you encountered a bug, it's best to raise an issue with the
> es-hadoop/spark project, namely here [1].
>
> When using SparkSQL the underlying data needs to be present - this is
> mentioned in the docs as well [2]. As for the order,
> that does look like a bug and shouldn't occur. Note the reason why the
> keys are re-arranged is in JSON, the object/map doesn't
> guarantee order. To give some predictability, the keys are arranged
> alphabetically.
>
> I suggest continuing this discussion in the issue issue tracker mentioned
> at [1].
>
> [1] https://github.com/elasticsearch/elasticsearch-hadoop/issues
> [2] http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/
> master/spark.html#spark-sql
>
>
> On 2/11/15 3:17 AM, Aris wrote:
>
>> I'm using ElasticSearch with elasticsearch-spark-BUILD-SNAPSHOT and
>> Spark/SparkSQL 1.2.0, from Costin Leau's advice.
>>
>> I want to query ElasticSearch for a bunch of JSON documents from within
>> SparkSQL, and then use a SQL query to simply
>> query for a column, which is actually a JSON key -- normal things that
>> SparkSQL does using the
>> SQLContext.jsonFile(filePath) facility. The difference I am using the
>> ElasticSearch container.
>>
>> The big problem: when I do something like
>>
>> SELECT jsonKeyA from tempTable;
>>
>> I actually get the WRONG KEY out of the JSON documents! I discovered that
>> if I have JSON keys physically in the order D,
>> C, B, A in the json documents, the elastic search connector discovers
>> those keys BUT then sorts them alphabetically as
>> A,B,C,D - so when I SELECT A from tempTable, I actually get column D
>> (because the physical JSONs had key D in the first
>> position). This only happens when reading from elasticsearch and SparkSQL.
>>
>> It gets much worse: When a key is missing from one of the documents and
>> that key should be NULL, the whole application
>> actually crashes and gives me a java.lang.IndexOutOfBoundsException --
>> the schema that is inferred is totally screwed up.
>>
>> In the above example with physical JSONs containing keys in the order
>> D,C,B,A, if one of the JSON documents is missing
>> the key/column I am querying for, I get that 
>> java.lang.IndexOutOfBoundsException
>> exception.
>>
>> I am using the BUILD-SNAPSHOT because otherwise I couldn't build the
>> elasticsearch-spark project, Costin said so.
>>
>> Any clues here? Any fixes?
>>
>>
>>
>>
> --
> Costin
>

Re: Bug in ElasticSearch and Spark SQL: Using SQL to query out data from JSON documents is totally wrong!

Reply via email to