On Sun, Nov 29, 2009 at 12:00:33AM +0200, Dotan Cohen wrote: > > ISTM that because the output of strings is not discrete list of > > potential words, but is instead a long list of concatenated > > characters, this problem is really rather daunting. The output should > > probably be first broken up into something resembling words by perhaps > > breaking on non-alphabetic characters. That should do two things: 1) > > get you somthing that resembles words to actually test and 2) somewhat > > smaller set of "stuff" to check. > > > > This won't necessarily handle "compound" words though where two > > word-like things are jammed together, or an actual word is embedded > > within a string of nonsense. > > > > I think this problem is potentially rather harder than I thought when > > I saw OP's original question. > > > > It does not need to be comprehensive. Would it be possible to only > show lines that have "words" (continuous strings) of alpha characters > that are all lowercase except for the first character? That would > handle about 90% of the work by eliminating lines line these: > pDuf > #k0H}g) > GoV5 > rLeY1 > TMlq,*
well, something simple in sed would help: sed 's/[^a-zA-Z]\+/\n/g' splits "words" at non-alphas and inserts a newline to make each a separate line. or leave out the '\n' to leave the "line" structure as it is. Then you can grep with something like: grep ^[A-Z] will get the ones that start with capital alphas. if you want initial caps *only* then: grep "^[A-Z][a-z]*$" would match those. I'm sure someone can do better. But that gets you down to maybe a very truncated dataset, then you can somehow look each of those up in aspell. A
signature.asc
Description: Digital signature