Hi Chris, There are some empirical aspects of USFM that are not specified in the *USFM User Reference*. They seemed to be defined /de facto/ by UBS Paratext.
Note that the USFM standard specifies the syntax for verse tags as: *\v_#* and does not have an underscore after the #. In fact, some white space is required there for all software that reads USFM files, but the white space could be space, tab, or one or both of the line end characters (carriage return, line feed). A poorly documented fact about USFM is that within a text field, line end is equivalent to a space, and multiple spaces (that aren't part of the markup) are equivalent to one space. Thus the following are equivalent: \v1 In the beginning, God ... and \v1 In the beginning, God ... Three poorly defined uses of USFM verse tags that we often encounter are as follows: Spanned verses (translators use a verse range), thus \v 7-11 Text for these five verses or (worse still, a naughty use of the comma delimiter) \v 3,4 Text for two consecutive verses Split verses (translators using composite verse tags with parts a and b of the text), e.g. \v 19a Text for the first part of verse nineteen \v 19b Text for the second part of verse nineteen and when this is done, there are sometimes extra USFM tags between parts a and b, e.g. \v 19a Tarus, dia makang la jadi kuat kombali. \s1 Saulus kasi tau Kabar Bae soal Yesus par orang-orang di Damsik \sr 9:19b-25 \p \v 19b Saulus tinggal deng orang-orang yang iko Yesus di kota Damsik kurang labe dua tiga hari bagitu. That's a real world example that I just encountered. Michael J. and I discussed these things recently (May 4). Further he writes: Unlike OSIS, USFM leaves little wiggle room for interpretation, and when it does, the master reference implementation, Paratext, rules, for pragmatic reasons. In general, I try to read USFM with reasonable tolerance, and write it with reasonable consistency. My USFM reader, for example, always accepts \mt as being equivalent to \mt1, no matter if \mt2 is present or not. When writing USFM, though, it is better practice to always include the "1". The examples of *\v_#* in the USFM reference manual all include a space after the numeric verse number. Empty verses: (with no verse text at all) Paratext generates them with no trailing space when creating an empty verse template. Some software that reads USFM expects to find a trailing space, because the USFM user reference examples are all with real text. Something else therefore for your Python script to be flexible about. Last week, I came across an unexpected use of the \imt tag in the same line as the \mt tag text. \mt Paulus pung Surat Kadua par Jamaat di Tesalonika\imt Kata-Kata Partama If the examples in the user reference were definitive, there would be a line break before the \imt yet apparently Paratext had not complained when there wasn't. This sort of thing does seem to crop up quite often, and I've no idea whether all of them would be detected by the USFM <==> USX processes that are now done 'under the hood' by Paratext. btw. Peter and I have collected a substantial body of real world USFM suites which you could probably use for testing your conversion script. Best regards, David -- View this message in context: http://sword-dev.350566.n4.nabble.com/USFM-conformance-in-usfm2osis-py-tp4650705p4650707.html Sent from the SWORD Dev mailing list archive at Nabble.com. _______________________________________________ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page