Hi All,

I connected pyspark under Zeppelin to my Elasticsearch DB and I am able to do 
this:

%pyspark
es_rdd = sc.newAPIHadoopRDD(
    inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
    keyClass="org.apache.hadoop.io.NullWritable",
    valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
    conf={ "es.resource" : "logstash-uni-*" })

es_rdd.toDF().registerTempTable("elk")

and then

%sql select * from elk

And then what I get is a table with just two columns. One is some objectID, I 
guess and the other is a string with a mapping of all the fields in the ES 
record into values ( " Map(@timestamp -> 2016-03-16T14:31:12.861Z, host -> ..." 
).

My question is how do I create a spark table, or even just a python object ( 
probably a dict ), that will enable me to access each filed seperatly?

Thanks,

Oren

Reply via email to