I've added a large number of new classes over the last few days and made a few other minor adjustments.
ThML option filters were added (ThMLFootnotes, ThMLStrongs, ThMLHeadings, ThMLMorph, ThMLLemma, ThMLScripref). They act just like the GBF counterparts but work with ThML. SWModule and all of its descendants now take a language value passed to their contructors. You can call the Lang() method to retrieve the value. We needed this for BibleCS because WinNT & Win9x handle right to left texts differently depending on the language, but there are other good uses for language information like sorting/filtering by language. The SWModule contructor is getting quite large and I'm ready to suggest we start passing a module info struct instead of separate arguments so that new information can be added to the module less painfully. (But we should retain the current contructor for backwards compatability.) The other four classes I added all require ICU and are all SWFilter descendants: UTF8NFC is normalizes according to Normalization Form C (NFC) which should turn text into it's most composed form. In other words combining accents will compose with the letters they follow such as an "a" followed by an "umlaut" will turn into an "a-umlaut" character. Since this is how all our texts should be distributed anyway, it may not be that useful to anyone. UTF8NFKD normalizes according to Normalization Form KD (NKFD) which is compatability decomposition. That means an "a-umlaut" turns into an "a" followed by a combining "umlaut". This filter should be used as a strip filter when performing searches because searches are best performed on strings in NFD or NFKD. UTF8BiDiReordering will reorder text according to visual order. So passing it Hebrew, Arabic, or Syriac should return a reversed string. Passing it English should return the same string. (And it should be able to handle any reasonable mix of scripts/directionalities.) UTF8arShaping will perform Arabic shaping on a string. Arabic text is encoded with an abstract character, which is usually represented by the isolated form glyph in fonts. This class will convert the abstract character codepoint to a codepoint in the Arabic presentation forms area corresponding to the initial, medial, final, or isolated form of the glyph, depending on its position in the word. --Chris