filtering a SchemaRDD

Daniel, Ronald (ELS-SDG) Fri, 14 Nov 2014 21:24:33 -0800

Hi all,

I have a SchemaRDD that Is loaded from a file. Each Row contains 7 fields, one 
of which holds the text for a sentence from a document.


  # Load sentence data table
  sentenceRDD = sqlContext.parquetFile('s3n://some/path/thing')
  sentenceRDD.take(3)
Out[20]: [Row(annotID=118, annotSet=u'ge', annotType=u'sentence', 
endOffset=20194, pii=u'0094576587900440', startOffset=20062, text=u'Paper 
IAF-86-85 presented at the 37th Congress of the International Astronautical 
Federation, Innsbruck, Austria, 4-11 October 1986.'), Row(annotID=163, 
annotSet=u'ge', annotType=u'sentence', endOffset=20249, 
pii=u'0094576587900440', startOffset=20194, text=u"The landsat sensors: Eosat's 
plans for landsats 6 and 7"), Row(annotID=190, annotSet=u'ge', 
annotType=u'sentence', endOffset=20342, pii=u'0094576587900440', 
startOffset=20334, text=u'Abstract')]

I have this registered as a table and can query it with SQL select statments. I 
would also like to filter the RDD using text operations like regexps that have 
greated capabilities than SQL's LIKE operator. However, the code below does not 
work. Instead I get a runtime error.

    openProbsRDD = sentenceRDD.filter(lambda row: "remains unknown" in 
row["text"] )
    openProbsRDD.take(5)
...
TypeError: tuple indices must be integers, not str
...

If I use row[6] instead of row["text"] I get what I am looking for. However, 
finding the right numeric index could be a pain.

Can I access the fields in a Row of a SchemaRDD by name, so that I can map, 
filter, etc. without a trial and error process of finding the right int for the 
fieldname?

Thanks,
Ron Daniel

filtering a SchemaRDD

Reply via email to