On May 22, 2013, at 7:08 PM, Karl Wettin <karl.wet...@kodapan.se> wrote:

>> * Use a filter after ASCIIFoldingFilter that discriminate all use of ae, oe, 
>> oo, and other combination of double vowels, just keeping the first one.
> 
> I ended up with that solution.
> 
> https://issues.apache.org/jira/browse/LUCENE-5013

Interesting problem… perhaps you could generalize your solution a bit… for 
example, in, say, German, one could substitute 'ue' for 'ü', etc… so it looks 
like what you are after is folding double vowels… irrespectively of how they 
got there…

So… assuming something along the lines of Sean M. Burke Unidecode [1] for the 
purpose of ASCII transliteration, what's left is simply to fold double vowels, 
e.g.:

print( 1, Unidecode( 'blåbærsyltetøj' ):lower():gsub( '([aeiou]?)([aeiou]?)', 
'%1' ) )
print( 2, Unidecode( 'blåbärsyltetöj' ):lower():gsub( '([aeiou]?)([aeiou]?)', 
'%1' ) )
print( 3, Unidecode( 'blaabaarsyltetoej' ):lower():gsub( 
'([aeiou]?)([aeiou]?)', '%1' ) )
print( 4, Unidecode( 'blabarsyltetoj' ):lower():gsub( '([aeiou]?)([aeiou]?)', 
'%1' ) )
print( 5, Unidecode( 'Räksmörgås' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' 
) )
print( 6, Unidecode( 'Göteborg' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
print( 7, Unidecode( 'Gøteborg' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
print( 8, Unidecode( 'Über' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
print( 9, Unidecode( 'ueber' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
print( 10, Unidecode( 'uber' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
print( 11, Unidecode( 'uuber' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )

> 1     blabarsyltetoj
> 2     blabarsyltetoj
> 3     blabarsyltetoj
> 4     blabarsyltetoj
> 5     raksmorgas
> 6     goteborg
> 7     goteborg        
> 8     uber    
> 9     uber    
> 10    uber    
> 11    uber    



[1] http://search.cpan.org/~sburke/Text-Unidecode-0.04/lib/Text/Unidecode.pm


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to