Hi All, I connected pyspark under Zeppelin to my Elasticsearch DB and I am able to do this:
%pyspark es_rdd = sc.newAPIHadoopRDD( inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat", keyClass="org.apache.hadoop.io.NullWritable", valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", conf={ "es.resource" : "logstash-uni-*" }) es_rdd.toDF().registerTempTable("elk") and then %sql select * from elk And then what I get is a table with just two columns. One is some objectID, I guess and the other is a string with a mapping of all the fields in the ES record into values ( " Map(@timestamp -> 2016-03-16T14:31:12.861Z, host -> ..." ). My question is how do I create a spark table, or even just a python object ( probably a dict ), that will enable me to access each filed seperatly? Thanks, Oren