Hi all,
I have a SchemaRDD that Is loaded from a file. Each Row contains 7 fields, one
of which holds the text for a sentence from a document.
# Load sentence data table
sentenceRDD = sqlContext.parquetFile('s3n://some/path/thing')
sentenceRDD.take(3)
Out[20]: [Row(annotID=118, annotSet=u'ge', annotType=u'sentence',
endOffset=20194, pii=u'0094576587900440', startOffset=20062, text=u'Paper
IAF-86-85 presented at the 37th Congress of the International Astronautical
Federation, Innsbruck, Austria, 4-11 October 1986.'), Row(annotID=163,
annotSet=u'ge', annotType=u'sentence', endOffset=20249,
pii=u'0094576587900440', startOffset=20194, text=u"The landsat sensors: Eosat's
plans for landsats 6 and 7"), Row(annotID=190, annotSet=u'ge',
annotType=u'sentence', endOffset=20342, pii=u'0094576587900440',
startOffset=20334, text=u'Abstract')]
I have this registered as a table and can query it with SQL select statments. I
would also like to filter the RDD using text operations like regexps that have
greated capabilities than SQL's LIKE operator. However, the code below does not
work. Instead I get a runtime error.
openProbsRDD = sentenceRDD.filter(lambda row: "remains unknown" in
row["text"] )
openProbsRDD.take(5)
...
TypeError: tuple indices must be integers, not str
...
If I use row[6] instead of row["text"] I get what I am looking for. However,
finding the right numeric index could be a pain.
Can I access the fields in a Row of a SchemaRDD by name, so that I can map,
filter, etc. without a trial and error process of finding the right int for the
fieldname?
Thanks,
Ron Daniel