Hi everyone,
I'm processing XML files, around 500MB each with several documents,
for the map() function I pass a document from the XML file, which
takes some time to process depending on the size - I'm applying NER to
texts.
Each document has a unique identifier, so I'm using that identifier as
a key and the results of parsing the document in one string as the
output:
so at the end of the map function():
output.collect( new Text(identifier), new Text(outputString));
usually the outputString is around 1k-5k size
reduce():
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output, Reporter reporter) {
while (values.hasNext()) {
Text text = values.next();
try {
output.collect(key, text);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
I did a test using only 1 machine with 8 cores, and only 1 XML file,
it took around 3 hours to process all maps and ~12hours for the
reduces!
the XML file has 139 945 documents
I set the jobconf for 1000 maps() and 200 reduces()
I did took a look at graphs on the web interface during the reduce
phase, and indeed its the copy phase that's taking much of the time,
the sort and reduce phase are done almost instantly.
Why does the copy phase takes so long? I understand that the copies
are made using HTTP, and the data was in really small chunks 1k-5k
size, but even so, being everything in the same physical machine
should have been faster, no?
Any suggestions on what might be causing the copies in reduce to take so long?
--
./david