On Monday 21 December 2015 15:22, Chris Angelico wrote: > On Mon, Dec 21, 2015 at 2:01 PM, Steven D'Aprano <st...@pearwood.info> > wrote: >> I have a large number of strings (originally file names) which tend to >> fall into two groups. Some are human-meaningful, but not necessarily >> dictionary words e.g.: [...]
> The first thing that comes to my mind is poking the string into a > search engine and seeing how many results come back. You might need to > do some preprocessing to recognize multi-word forms (maybe a handful > of recognized cases like snake_case, CamelCase, > CamelCasewiththeLittleWordsLeftUnchanged, etc), I could possibly split the string into "words", based on CamelCase, spaces, hyphens or underscores. That would cover most of the cases. > How many of these keywords would you be looking up, and would a > network transaction (a search engine API call) for each one be too > expensive? Tens or hundreds of thousands of strings, and yes a network transaction probably would be a bit much. I'd rather not have Google or Bing be a dependency :-) -- Steve -- https://mail.python.org/mailman/listinfo/python-list