Bengt Richter wrote: > On Wed, 10 Aug 2005 16:51:55 +0200, Paolino <[EMAIL PROTECTED]> wrote: > > >>I have a self organizing net which aim is clustering words. >>Let's think the clustering is about their 2-grams set. >>Words then are instances of this class. >> >>class clusterable(str): >> def __abs__(self):# the set of q-grams (to be calculated only once) >> return set([(self+self[0])[n:n+2] for n in range(len(self))]) >> def __sub__(self,other): # the q-grams distance between 2 words >> set1=abs(self) >> set2=abs(other) >> return len(set1|set2)-len(set1&set2) >> >>I'm looking for the medium of a set of words, as the word which >>minimizes the sum of the distances from those words. >> >>Aka:sum([medium-word for word in words]) >> >> >>Thanks for ideas, Paolino >> > > Just wondering if this is a desired result: > > >>> clusterable('banana')-clusterable('bananana') > 0
Yes, the clustering is the main filter,it's good (I hope) to cut the space of words down one or two magnitudes. Final choices must be done with the expensive Levenstain distance, or other edit-type distance. Now I'm using an empirical solution where I suppose the best set has lenght L equal the medium of the lenghts.Then I choose from the frequency distribution of 2-grams the first L 2-grams. I have no clue this is the right set and I'm sure that set is not a word as there is no chance to chain those 2-grams to form a word. Thanks for comments Paolino -- http://mail.python.org/mailman/listinfo/python-list