(sorry about the previous spam... google inbox didn't allowed me to cancel
the miserable sent action :-/)
So what I was about to say: it's a real PAIN tin the ass to parse the
wikipedia articles in the dump due to this mulitline articles...
However, there is a way to manage that "quite" easily, although I found it
rather slow.
*1/ use XML reader*
Use the "org.apache.hadoop" % "hadoop-streaming" % "1.0.4"
*2/ configure the hadoop job*
import org.apache.hadoop.streaming.StreamXmlRecordReader
import org.apache.hadoop.mapred.JobConf
val jobConf = new JobConf()
jobConf.set("stream.recordreader.class",
"org.apache.hadoop.streaming.StreamXmlRecordReader")
jobConf.set("stream.recordreader.begin", "<page")
jobConf.set("stream.recordreader.end", "</page>")
org.apache.hadoop.mapred.FileInputFormat.addInputPaths(jobConf,
s"hdfs://$master:9000/data.xml")
// Load documents (one per line).
val documents = sparkContext.hadoopRDD(jobConf,
classOf[org.apache.hadoop.streaming.StreamInputFormat],
classOf[org.apache.hadoop.io.Text],
classOf[org.apache.hadoop.io.Text])
*3/ use the result as XML doc*
import scala.xml.XML
val texts = documents.map(_._1.toString)
.map{ s =>
val xml = XML.loadString(s)
val id = (xml \ "id").text.toDouble
val title = (xml \ "title").text
val text = (xml \ "revision" \
"text").text.replaceAll("\\W", " ")
val tknzed = text.split("\\W").filter(_.size >
3).toList
(id, title, tknzed )
}
HTH
andy
On Tue Nov 18 2014 at 2:52:00 AM Tobias Pfeiffer <[email protected]> wrote:
> Hi,
>
> see https://www.mail-archive.com/[email protected]/msg03520.html for
> one solution.
>
> One issue with those XML files is that they cannot be processed line by
> line in parallel; plus you inherently need shared/global state to parse XML
> or check for well-formedness, I think. (Same issue with multi-line JSON, by
> the way.)
>
> Tobias
>
>