[libreoffice-l10n] Re: Develop hyphenation extension
Dear list members, On Fri, Apr 1, 2016 at 10:05 AM, Aleksandr Andreev wrote: > Now, it's not clear to me what I need to do with the resulting .dic > file. The documentation says I need to bundle it as a Dictionary > Extension. But is there documentation on how that needs to be done? To update, after some hacking around, I was able to bundle and instlall the hyphenation patterns. It involved creating a dictionaries.xcu file and a description.xml file. The simplest way appears to be to crack open an existing extension with an archive manager and take a look at how it is organized. > > Second question. The character commonly used in Church Slavonic for > hyphenation is the underscore, not the hyphen (e.g., hyphe_nation). In > TeX, I can simply set the hyphenchar to be _. Is this possible in > LibreOffice? If yes, where do I specify it? Does anyone know the answer to this question? Can I set the hyphenation character to be _ instead of -? Maybe in the locale data files? If not, is this a bug against LO or a bug against Hunspell? Thanks, Aleksandr -- To unsubscribe e-mail to: l10n+unsubscr...@global.libreoffice.org Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/ Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette List archive: http://listarchives.libreoffice.org/global/l10n/ All messages sent to this list will be publicly archived and cannot be deleted
Re: [libreoffice-l10n] Re: Develop hyphenation extension
Hello Aleksandr, 2016-04-03 12:16, Aleksandr Andreev wrote: >> Second question. The character commonly used in Church Slavonic for >> hyphenation is the underscore, not the hyphen (e.g., hyphe_nation). In >> TeX, I can simply set the hyphenchar to be _. Is this possible in >> LibreOffice? If yes, where do I specify it? > Does anyone know the answer to this question? Can I set the > hyphenation character to be _ instead of -? Maybe in the locale data > files? If not, is this a bug against LO or a bug against Hunspell? I think this belongs to either locale data, as you are suggesting, or perhaps even to the actual fonts you're using. The reason why I suspect it might belong to fonts is because there is only one Unicode codepoint I know of serving this exact purpose (U+00AD SOFT HYPHEN), and OpenType has a feature called "Localized forms", which is designed exactly for cases like this (where glyph representation in particular language is supposed to be different than usual). In combination, these features seem to provide means to solve your problem. Rimas -- To unsubscribe e-mail to: l10n+unsubscr...@global.libreoffice.org Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/ Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette List archive: http://listarchives.libreoffice.org/global/l10n/ All messages sent to this list will be publicly archived and cannot be deleted
Re: [libreoffice-l10n] Re: Develop hyphenation extension
Dear Rimas, On Sun, Apr 3, 2016 at 2:52 PM, Rimas Kudelis wrote: > > The reason why I suspect it might belong to fonts is because there is > only one Unicode codepoint I know of serving this exact purpose (U+00AD > SOFT HYPHEN), and OpenType has a feature called "Localized forms", which > is designed exactly for cases like this (where glyph representation in > particular language is supposed to be different than usual). In > combination, these features seem to provide means to solve your problem. > You may be right that it belongs to the realm of Font Features (although that sounds like a terrible design flaw IMHO, given that LO has no mechanism currently to turn simple OpenType features on and off IIUC). But it certainly has nothing to do with the Soft Hyphen. According to the Unicode documentation (p. 268), Despite its name, U+00AD soft hyphen is not a hyphen, but rather an invisible format character used to indicate optional intraword breaks. And on p. 812 of the Standard: U+00AD soft hyphen (SHY) indicates an intraword break point, where a line break is preferred if a word must be hyphenated or otherwise broken across lines. Such break points are generally determined by an automatic hyphenator. SHY can be used with any script, but its use is generally limited to situations where users need to override the behavior of such a hyphenator. So, the SHY: * has no visible glyph, despite what some font manufacturers are doing; * is not a graphic character, but rather a format control character; * is not supposed to be used by an automatic hyphenator for hyphenation; * is supposed to be used by a user to *override* the behavior of an automatic hyphenator. Cordially, Aleksandr -- To unsubscribe e-mail to: l10n+unsubscr...@global.libreoffice.org Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/ Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette List archive: http://listarchives.libreoffice.org/global/l10n/ All messages sent to this list will be publicly archived and cannot be deleted
Re: [libreoffice-l10n] Re: Develop hyphenation extension
Hi Aleksandr, 2016-04-03 16:17, Aleksandr Andreev wrote: > On Sun, Apr 3, 2016 at 2:52 PM, Rimas Kudelis wrote: >> The reason why I suspect it might belong to fonts is because there is >> only one Unicode codepoint I know of serving this exact purpose (U+00AD >> SOFT HYPHEN), and OpenType has a feature called "Localized forms", which >> is designed exactly for cases like this (where glyph representation in >> particular language is supposed to be different than usual). In >> combination, these features seem to provide means to solve your problem. >> > You may be right that it belongs to the realm of Font Features > (although that sounds like a terrible design flaw IMHO, given that LO > has no mechanism currently to turn simple OpenType features on and off > IIUC). But it certainly has nothing to do with the Soft Hyphen. > > According to the Unicode documentation (p. 268), > > Despite its name, U+00AD soft hyphen is not a hyphen, but rather an > invisible format character used to indicate optional intraword breaks. > > And on p. 812 of the Standard: > > U+00AD soft hyphen (SHY) indicates an intraword break point, where a > line break is preferred if a word must be hyphenated or otherwise > broken across lines. Such > break points are generally determined by an automatic hyphenator. SHY > can be used with > any script, but its use is generally limited to situations where users > need to override the > behavior of such a hyphenator. > > So, the SHY: > * has no visible glyph, despite what some font manufacturers are doing; > * is not a graphic character, but rather a format control character; > * is not supposed to be used by an automatic hyphenator for hyphenation; > * is supposed to be used by a user to *override* the behavior of an > automatic hyphenator. I see you've done your homework and did a bit more research than me. Great! :) With all the data you shared, I'm even more certain that this belongs to the locale data, much like quotation characters and number formatting characters. I'm not sure if this locale property is readily available for inclusion in locale data though. It might be that Slavonic is a very rare exception to the common rule of using hyphens for that, and that this hasn't been accounted for anywhere. At least I couldn't find anything about this neither in the LDML standard, nor in our DTD for locale definition files (https://cgit.freedesktop.org/libreoffice/core/plain/i18npool/source/localedata/data/locale.dtd). Regards, Rimas -- To unsubscribe e-mail to: l10n+unsubscr...@global.libreoffice.org Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/ Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette List archive: http://listarchives.libreoffice.org/global/l10n/ All messages sent to this list will be publicly archived and cannot be deleted
Re: [libreoffice-l10n] Re: Develop hyphenation extension
Dear Rimas, On Sun, Apr 3, 2016 at 11:55 PM, Rimas Kudelis wrote: > > With all the data you shared, I'm even more certain that this belongs to > the locale data, much like quotation characters and number formatting > characters. I'm not sure if this locale property is readily available > for inclusion in locale data though. A while ago, I created the LO locale XML files for Church Slavonic ("Church Slavic"). Please see here: https://gerrit.libreoffice.org/#/c/15540/ There was nothing there about hyphenation characters, AFAICT. > It might be that Slavonic is a very > rare exception to the common rule of using hyphens for that, and that > this hasn't been accounted for anywhere. I don't think it's that uncommon. Unicode includes a number of script-specific hyphenation characters, for example U+058A Armenian Hyphen, U+1400 Canadian Syllabics Hyphen, etc. How are users supposed to use those? Also, some Indic scripts, IIUC, do not use a hyphen character at all; they just split a word across line. What if users are using a legacy codepage where the Hyphen is encoded somewhere other that U+002D? (BTW, strictly speaking U+002D is *not* a hyphen, and LO should really be using U+2010 for hyphenation). Or the user wants to set some decorative character to be a hyphen. IMHO, a hyphenation character should be settable from the user interface, for example, together with the "Characters at line end" and "Characters at line begin" in Format->Paragraph -> Text Flow. It should not involve having to hack an XML file and rebuild LO from source. BTW, despite setting LEFTHYPHMIN and RIGHTHYPHMIN in the hyphenation dictionary, "Characters at line end" and "Characters at line begin" cannot be set lower than 2. But Church Slavic uses LEFTHYPHEMIN = 1 (Ancient Greek uses both LEFTHYPHMIN and RIGHTHYPHMIN = 1). Is this a bug? Or a feature? > anything about this neither in the LDML standard, nor in our DTD for > locale definition files > (https://cgit.freedesktop.org/libreoffice/core/plain/i18npool/source/localedata/data/locale.dtd). > So, I guess as a first step, LO should support changing the hyphen character in the XML locale files. Or, is this really a Hunspell issue, and it should be specified from the hyphenation dictionary extension? Could someone confirm this? Cordially, Aleksandr -- To unsubscribe e-mail to: l10n+unsubscr...@global.libreoffice.org Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/ Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette List archive: http://listarchives.libreoffice.org/global/l10n/ All messages sent to this list will be publicly archived and cannot be deleted