Thank you Costin. I wrote out to the user list, I got no replies there. I will take this exact message and put it on the Github bug tracking system.
One quick clarification: I read the elasticsearch documentation thoroughly, and I saw the warning about structured data vs. unstructured data, but it's a bit confusing. Normally, when SparkSQL sees a lot of JSON documents of the "same schema", but perhaps a few instances of those documents might be missing a key X, a query which SELECT's X will just come back as NULL for the JSON documents which don't have that key X. Are you saying this behavior does not happen when I connect to ElasticSearch? Every single JSON document must contain every single key, or else the application crashes? So If one single JSON document is missing key X from the elasticsearch data, the application throws an Exception? Thank you! Aris On Wed, Feb 11, 2015 at 1:21 AM, Costin Leau <costin.l...@elasticsearch.com> wrote: > Aris, if you encountered a bug, it's best to raise an issue with the > es-hadoop/spark project, namely here [1]. > > When using SparkSQL the underlying data needs to be present - this is > mentioned in the docs as well [2]. As for the order, > that does look like a bug and shouldn't occur. Note the reason why the > keys are re-arranged is in JSON, the object/map doesn't > guarantee order. To give some predictability, the keys are arranged > alphabetically. > > I suggest continuing this discussion in the issue issue tracker mentioned > at [1]. > > [1] https://github.com/elasticsearch/elasticsearch-hadoop/issues > [2] http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/ > master/spark.html#spark-sql > > > On 2/11/15 3:17 AM, Aris wrote: > >> I'm using ElasticSearch with elasticsearch-spark-BUILD-SNAPSHOT and >> Spark/SparkSQL 1.2.0, from Costin Leau's advice. >> >> I want to query ElasticSearch for a bunch of JSON documents from within >> SparkSQL, and then use a SQL query to simply >> query for a column, which is actually a JSON key -- normal things that >> SparkSQL does using the >> SQLContext.jsonFile(filePath) facility. The difference I am using the >> ElasticSearch container. >> >> The big problem: when I do something like >> >> SELECT jsonKeyA from tempTable; >> >> I actually get the WRONG KEY out of the JSON documents! I discovered that >> if I have JSON keys physically in the order D, >> C, B, A in the json documents, the elastic search connector discovers >> those keys BUT then sorts them alphabetically as >> A,B,C,D - so when I SELECT A from tempTable, I actually get column D >> (because the physical JSONs had key D in the first >> position). This only happens when reading from elasticsearch and SparkSQL. >> >> It gets much worse: When a key is missing from one of the documents and >> that key should be NULL, the whole application >> actually crashes and gives me a java.lang.IndexOutOfBoundsException -- >> the schema that is inferred is totally screwed up. >> >> In the above example with physical JSONs containing keys in the order >> D,C,B,A, if one of the JSON documents is missing >> the key/column I am querying for, I get that >> java.lang.IndexOutOfBoundsException >> exception. >> >> I am using the BUILD-SNAPSHOT because otherwise I couldn't build the >> elasticsearch-spark project, Costin said so. >> >> Any clues here? Any fixes? >> >> >> >> > -- > Costin >