Mark, Thank you so much!!!!
On 12/28/2017 12:45 PM, Mark Waddingham via use-livecode wrote: > On 2017-12-19 19:43, Mark Waddingham via use-livecode wrote: >> I'm pretty sure it would be possible to write a handler which takes >> the styledText array of a field in 6.7.11 and a list of old indicies, >> returning a list of new char indicies... Would that help? > > Paul expressed an interest in how this might work - and he provided > some more background: > > -*- > > Our main application, HyperRESEARCH, a tool for academics and others > doing qualitative research, relies completely on chunk ranges. It is > essentially a bookmarking tool where users can select some content from > a document, the character position (chunk) is grabbed and the user gives > it a text label and HyperRESEARCH remembers that label "Early Childhood > Behavior X" points to char S to T of document "ABC". All documents, > native text, unicode (utf8 or utf16), rtf, docx, odt, etc. are read into > a LiveCode field, from which the selection is made and the chunk > obtained. HyoperRESEARCH saves a "Study" file that contains a LOT of > these labels and chunks and documents names. > > As part of our migration from LC464, which is what the current release > of HyperRESEARCH is based on, we needed a way to convert a character > range created under LC4.6.4 to a range under LC6.7.11 that point to the > exact same string for the same file. Curry Kenworthy, whose libraries we > license for reading MS-Word and Open Office documents built a library > based on an algorithm I came up with to send the original LC464 ranges > to a helper application using sockets or IPC. The helper application > retrieves the strings associated with each chunk, strips white space and > passes the string back to the LC6.7.11 version of the main app, which > then finds the whitespace stripped strings in the same file loaded under > LC6.7.11 with an indexing mechanism to adjust the positions for the > stripped whitespace. It is a bit complicated, but it works reliably. > > -*- > > From this I infer the following: > > 1) The study file is a list of triples - label, char chunk, document > filename > > 2) When using the study file, the original document is loaded into a > field, and the char chunks are used to display labels which the user > can jump to. > > 3) The char chunks are old-style (pre-5.5) byte indicies not codeunit > indicies > > The crux of the problem Paul is having comes down to (3) which has > some background to explain. > > Before 7.0, the field was the only part of the engine which naturally > handled Unicode. In these older versions the field stored text as > mixed sequence of style runs of either single bytes (native text) or > double bytes (unicode text). > > Between 5.5 and 7.0, indicies used when referencing chars in fields > corresponded to codeunits - this meant that the indicies were > independent of the encoding of the runs in the field. In this case > char N referred to the Nth codeunit in the field, whether up until > that point was all unicode, all native or a mixture of both. > > Before 5.5, indicies used when referencing chars in fields > corresponded to bytes - this meant that you had to take into account > the encoding of the runs in the field. In this case, char N referred > to the Nth byte in the field. So if your field had: > > abcXYZabc (where XYZ are two byte unicode chars) > > Then char 4 would refer to the first byte of the X unicode char and > *not* the two bytes it would have actually taken up. > > Now, importantly, the internal structure of the field did not change > between 4.6.4 and 5.5, just how the 'char' chunk worked - in 6.7.11, > the internal structure of the field is still the mixed runs of > unicode/native bytes just as it was in 4.6.4 - the only difference is > what happens if you reference char X to Y of the field. > > So solving this problem comes down to finding a means to 'get at' the > internal encoding style runs of a field in 6.7.11. We want a handler: > > mapByteRangeToCharRange(pFieldId, pByteFrom, pByteTo) > > Returning a pair pCharFrom, pCharTo - where pByteFrom, pByteTo are a > char X to Y range from 4.6.4 and pCharFrom, pCharTo are a char X to Y > range *for the same range* in 6.7.11. > > -*- > > Before going into the details, an easy way to see the internal mixed > encoding of a field containing unicode in 6.7.11, is to put some text > which is a mixture of native text and unicode text in a field and then > look at its 'text' property. Putting: > > Лорем ипсум Lorem ipsum dolor sit amet, pr долор сит амет, вел татион > игнота сцрибентур еи. Вих еа феугиат doctus necessitatibus ассентиор > пхилосопхиа. Феугаитconsulatu disputando comprehensam вивендум вис > ет, мел еррем малорум ат. Хас но видерер лобортис, suscipit detraxit > interesset eum аппетере инсоленс салутатус усу не. Еи дуо лудус > яуаеяуе, ет елитр цорпора пер. > > Into a 6.7.11 field and then doing 'put the text of field 1' gives: > > ????? ????? Lorem ipsum dolor sit amet, pr ????? ??? ????, ??? ?????? > ?????? ?????????? ??. ??? ?? ??????? doctus necessitatibus ????????? > ???????????. ???????consulatu disputando comprehensam ???????? ??? > ??, ??? ????? ??????? ??. ??? ?? ??????? ????????, suscipit detraxit > interesset eum ???????? ???????? ????????? ??? ??. ?? ??? ????? > ???????, ?? ????? ??????? ???. > > Here we see some of how 6.7.11 fields handled unicode. The '?' > indicate that the 'char' being fetched at that index is a unicode > codeunit (i.e. not representable in the native encoding). It is > relatively easy to see by inspection that these match up quite easily > - for each cyrillic letter there is a '?', and the roman letters come > through directly. > > In contrast if I do the same thing with 4.6.4, I get this: > > ??>?@?5?<? 8???A?C?<? Lorem ipsum dolor sit amet, pr 4?>?;?>?@? A?8?B? > 0?<?5?B?, 2?5?;? B?0?B?8?>?=? 8?3?=?>?B?0? A?F?@?8?1?5?=?B?C?@? 5?8?. > ??8?E? 5?0? D?5?C?3?8?0?B? doctus necessitatibus 0?A?A?5?=?B?8?>?@? > ??E?8?;?>?A?>???E?8?0?. $?5?C?3?0?8?B?consulatu disputando > comprehensam 2?8?2?5?=?4?C?<? 2?8?A? 5?B?, <?5?;? 5?@?@?5?<? > <?0?;?>?@?C?<? 0?B?. %?0?A? =?>? 2?8?4?5?@?5?@? ;?>?1?>?@?B?8?A?, > suscipit detraxit interesset eum 0?????5?B?5?@?5? 8?=?A?>?;?5?=?A? > A?0?;?C?B?0?B?C?A? C?A?C? =?5?. ??8? 4?C?>? ;?C?4?C?A? O?C?0?5?O?C?5?, > 5?B? 5?;?8?B?@? F?>?@???>?@?0? ??5?@?. > > In order to make sure this came through vaguely sanely, I've replaced > all bytes < 32 with ?. If you compare with 6.7.11 output you can see > that for each '?' present in 'the text' of the 6.7.11 field, there are > *two* chars in the 4.6.4 output: > > Лорем (original) -> ????? (6.7.11) -> ??>?@?5?<? (4.6.4) > > This shows quite clearly the difference between 4.6.4 and 6.7.11 in > handling text/char ranges - in 6.7.11 whilst internally each unicode > codeunit takes up two bytes you don't see that, instead you see only a > single 'char'. In comparison in 4.6.4, all the gory details are laid > bare - you see the individual bytes making up the unicode codeunits. > > -*- > > Now, the above is only a rough way to see the internals of the field - > the ? char in any one place in the text could be an actual '?' or a > '?' which comes about because there is a non-native codeunit there. > However, you can tell the encoding of any one char in a field by > looking at the 'encoding' property of the char. > > put the encoding of char 1 of field 1 -> unicode > put the encoding of char 30 of field 1 -> native > > We can use this information (in 6.7.11) to implement the required > handler (which uses an auxillary handler to map one index): > > -- Map a 4.6.4 char (byte) range to a 5.5+ char range. > function mapByteRangeToCharRange pFieldId, pByteFrom, pByteTo > -- Convert the index of the beginning of the range. > local tCharFrom > put mapByteIndexToCharIndex(pFieldId, pByteFrom) into tCharFrom > > -- Convert the index of the end of the range. We add 1 to the end > -- offset here so that we find the index of the char after the end > -- char. We need to do this as the byte range of a single unicode > -- char is always 2 bytes long. > local tCharTo > put mapByteIndexToCharIndex(pFieldId, pByteTo + 1) into tCharTo > > -- If the range is a singleton, charFrom and charTo will be the > -- same. > if tCharFrom is tCharTo then > return tCharFrom,tCharTo > end if > > -- Otherwise it is a multi-char range, and tCharTo will actually > -- be the char after the end of the range (due to the adjustment > -- above). > return tCharFrom, tCharTo - 1 > end mapByteRangeToCharRange > > -- Map a 4.6.4 char (byte) offset to a 5.5+ char offset. > private function mapByteIndexToCharIndex pFieldId, pByteIndex > -- Char indicies start from 1 > local tCharIndex > put 1 into tCharIndex > > -- We iterate over the 5.5+ notion of chars until the original 4.6.4 > -- byte index is exhausted. > repeat while pByteIndex > 1 > -- If the encoding of the char at the 5.5+ index is native, then it > -- will have required 1 byte in 4.6.4; otherwise it will have > required > -- 2 bytes in 4.6.4. > if the encoding of char tCharIndex of pFieldId is "native" then > subtract 1 from pByteIndex > else > subtract 2 from pByteIndex > end if > -- We've consumed a single 5.5+ char, and either 1 or 2 4.6.4 > -- bytes at this point. > add 1 to tCharIndex > end repeat > > -- The final char index we computed is the char corresponding to > -- the byte index in 4.6.4. > return tCharIndex > end mapByteIndexToCharIndex > > Now, this isn't the most efficient method of doing it - for example, > you could scan from the start offset to the end offset rather than > from the beginning again; or use the styledText array of the field > which gives you the encoding of each style run in the field - this > would save the by-char lookup. Perhaps an interesting exercise to see > how fast it can be made? > > -*- > > So this is the solution for 4.6.4->6.7.11. In 7+ the internal > structure of the field *did* change, it moved to using a string for > each paragraph rather than a mixed style-run approach - i.e. the > internal data structure for each paragraph is either a unicode string > or a native string (although you can't tell the difference in 7 as > that's an internal detail). In order for the approach to work in 7.x, > the 4.6.4 internal structure would need to be recreated from the text > of the field. This is definitely possible to do - basically the > approach 4.6.4 used was to convert all chars it could to native, > leaving the rest as unicode. So: > > xxxXyZwww (uppercase are unicode only chars, lowercase are > can-be-native unicode chars) > > Would end up with: > > xxx - native > X - unicode > y - native > Z - unicode > www - native > > Once split up like this, rather than accessing the encoding property > of the field you would use the encoding derived by splitting up the > text content field in the above manner. > > -*- > > Of course, having said that (and testing in 7.0) - the encoding > property of char ranges in the field should probably return 'unicode' > for unicode only chars, and native for can-be-native chars. I'd need > to look into why it doesn't currently - but if it did, I *think* the > above code would work in 7+ as well as 5.5+. (I've filed > http://quality.livecode.com/show_bug.cgi?id=20811 so I don't forget to > have a look!). > > Warmest Regards, > > Mark. > _______________________________________________ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode