Bjoern Hoehrmann wrote: > * Khalida BEN SIDI AHMED wrote: >> In the html code of a Wikipedia article how to recognise the >> *first*sentence of this article? > > It's not marked up and probably differs among language versions. On the > english version the first `p` child of a `mw-content-ltr` element is a > good bet, as I pointed out earlier, to identify the first paragraph. It > would then be necessary to find the full stop at the end of a sentence; > criteria for that include that a space or the end of a paragraph follows > and that it is not included in some nesting construct like parentheses; > http://en.wikipedia.org/wiki/Sentence_boundary_disambiguation discusses > some of the problems and includes pointers to some solutions.
I've found you have to be careful with the <p> trick. Sometimes geo coordinates in the article will mistakenly use a <p> or one will slip into an infobox or a hatnote. There are also other edge cases like disambiguation pages (where the first "sentence" often ends in a colon, at least on the English Wikipedia). I'm not sure if anyone has put together a comprehensive set of edge cases. The real answer here seems to be switching to an architecture that makes the distinction explicit. I don't imagine that'll be happening any time soon, though. MZMcBride _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
