Hi Jens, > OK, I attached my script. It's really far far away from beeing perfect > and every perl hacker knows better ways to do it but nevertheless I > found it very useful. >
You probably haven't seen my scripts yet :) > I implemented three tests: > * swapped characters (helol) > * duplicated character (helllo) > * removed characters (helo) > (a check for doubled words is missing missing) cool, the spellchecker uses aspell for spotting such typos, but obviously only for the languages that have an aspell dictionary; for the other levels, the spellchecker can still be useful, but not for syntax checking, so your script might fill that gap! > > Since all words which where found in a subdirectory (single files cannot > be tested, sorry) are considered, there is no need for a wordlist or a > special file format -- text, HTML or XML all work. > > The words which occur most often where checked for one of the specified > kind of typo. > > Please apply it using > $ ./check_typos.pl -d directory -t test-number > (Test 3 has many wrong possitives). > The script doesn't modify anything it just outputs found typos, > similar to: > > bseoin (1) ==> besoin (990) > > This means bseoin was found once but besoin was found 990 times so it's > likely that the first is a typo. Now I search for bseoin using grep -rw. > > This script is much more efficient than aspell or other spell checker. > It also finds typos in names and URLs (Meyer vs. Mayer, > php382&tzd_d vs. php381&tzd_d) > > I created my last patch by running my script against the full packages/po/ > directory. This has the advantage that strings in msgid's and msgstr's > are compared at the same time and I was able to find even consistent > typos accross a language file, such as etx2 and boostrap. > Nevertheless it's also suggested to restrict tests to only one language. > That's why I usually extract msgid strings using (is there really no > msg* command to do this??) > > cat packages/po/de.po | msgconv | \ > awk '/^msgstr/ {t=1}; > /^msgid/ {t=0}; { > if (t==1 && index($0, "#")==0) { > gsub("^msgstr ", ""); > gsub("^\"", ""); > gsub("\"$", ""); > gsub("\\\\n", " "); > print > } > }' > /tmp/check/de > (not yet tested with plural forms of PO files). > well, my scripts take care of stripping all the unneeded stuff, so I can use text files containing *only* translated strings ( look at any of the files in the "messages" column at http://d-i.alioth.debian.org/spellcheck/) > Attention: Since I do not know perl good enough I explictely wrote the > word separators into the code (\W seems to be not locale specific). So I > suggest you add common accents for other languages to the script, line 48. > Do you know a solution for this? > I don't know perl, but I'll have a look at this and in the worst case I'm sure somebody will be happy to take a look at it > Davide, you still have to iterate accross all languages and to do other > stuff. But I'm sure you know the required shell snippets, right? > I'll take a look in the next few days; I think I won't have any problem with this > PS: Another script I run once per year is pattern-match > http://alioth.debian.org/snippet/detail.php?type=snippet&id=2 This > script checks for matching patterns in the specified file. It was > written mainly to revise parenthesis, braces, brackets, ... in my math > documents. > > Examples: > "([x])", "{\|x\||y|}", ... correct > "([x)]", "\|||", "{()", ... incorrect oh yes! I tried it a while back and thought about integrating it with the spellchecker; I think I'd like to focus more on the syntax before taking care of such specific stuff, but I'm sure sooner or later it'll be very useful ciao Davide -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]