Hello, I’m doing a word stemmer for a non-English language. A stemmer parses a word into its word parts: prefixes, roots, suffixes. The input word is at least a root word (English example would be ‘cloud’), but can be any combination of prefix(es) and a root (e.g., 'pre-nuptial'), or a root and suffix(es) (‘cloudy’), or all three ('unidirection'). A sequence of more than one prefix in a word is considered one occurrence of a prefix, and similarly for complex prefixes, thus, ‘directional’ is considered to have the ‘single’ suffix ‘ional’. The prefixes, roots, and suffixes are in their own set data structure.
The approach I am pursuing is to create a set of potential suffixes that the input word contains. Asssume, for simplicity, that the suffix set consists of #{-or, -er, -al, -ion, -ional, able}. The input ‘directional’ would have the candidate suffix set #{-al –ional}. Now, drop the longest suffix (‘ional’) from the input then check the remaining string (‘direct’) if it is a root; if it is, done. If not, try the next suffix (‘-al’) in the potential suffix set. Prefixes will be similarly processed. Input words with both prefixes and affixes will be fun to do ;) I’m having a hard time thinking through the process of generating the candidate suffix set using set forms, and I’m beginning to think I have selected an arduous path (for me). Thoughts? Thanks. Tuba -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en