Using the following sample from a git patch one can see one way in which the current counting method comes up with fewer words than other methods do. +1747,9 1.7.0.4 14 characters on two lines: either 2, 3 or 6 words depending on how you count
Gedit says: 2 lines 6 words 15 chars 14 chars(no spaces) LibOdev says: 2 words 14 chars 14 chars excl spaces - (no stat line for lines tho it has para counts) Gedit takes each number as a word breaking the words on punctuation Gedit also counts the new line as whitespace LibOdev counts all of any block of contiguous characters as a word LibOdev in node word counter never sees the newline Over the diff part (from qgit) of Mattias' part 1 - sw patch file showing gedit / LibOdev Words: 2418 / 2414 Chars: 24241 / 24241 Chars – 16830 / 16830 (excl. spaces) Now a near match in words and perfect match on chars excl spaces. Testing with a different entire patch file, the major difference is in words 1338 to 1533 or ~200 out of 1400 words, but the total char and char excl. spaces agree completely 13 459 and 10 157 Taking into account the different word handling (top) and the way they match then don't match I suspect a second difference in the counting method tween gedit and LibOdev and differences in the line breaks in the files after cut and paste. So far gedit and LibOdev agree completely ONLY on the non-space counts. I didn't check results on your reference odt because gedit wont open odt and cut and paste just dumps the XML into the text... Words 3997 / 18 Chars 33429 / 125 Chars – 28469 / 107 Where the second smaller numbers are a page footer's counts. AFAIR - LibOdev doesn't count the footer content and that might be the difference. there are 20+ pages so thats 360+ words ~2500 chars in the footers I also saw how the LibOdev count is zero at load of the odt. Perhaps the count is made somewhere else and saved on the doc without this code or it is stored in the doc and loaded – either way the word count is marked clean so it is not re-counted when the dialog box calls updateStats and the excl. spaces count remains zero. Just clicking in the document causes a full recount tho and that seems too busy somehow.. <-- more than enough guessing there.... All these tests are with the aScanner.GetLen() > 1 check in place. With that Len >=2 check, the new counting routine has no problem with single letter words like A, a, 1, -, or just , It is puzzling that Mattias removed the check to handle single char words on his machine but a build out of master/LibOdev works (at least for me) with that same check in … I will test changing back to Mattias simpler submission. (building now). I must note that the block immediately after this count area word counts the outline numbers (and counts the bullets as words!?!) - it does not have any such length check at all... I think all the len=1 strings that the scanner might give back are just CH_TXTATR_BREAKWORD = 0x01. And they are probably Scanner's zero length string. Scanner's GetEnd points one slot past the end of the string – i.e. for SwScanner GetEnd() = GetBegin() + GetLen() (no -1 there) And that end spot likely has a break marker. Again gedit and LibOdev agree completely ONLY on the non-space counts. -- View this message in context: http://nabble.documentfoundation.org/PATCH-Fix-for-bug-feature-request-30550-Character-count-without-spaces-tp1778667p1782965.html Sent from the Dev mailing list archive at Nabble.com. _______________________________________________ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice