Hi Jeff,

You're right about that point. Let me redefine. I would like to drop all tokens 
which neither are the stemmed or unstemmed version of a known word. Would there 
be the possibility of putting a wordlist as a filter ahead of the stemming? Or 
do you know about a good English lexeme list that could be used to filter after 
stemming?

Thanks,
Christoph 

> On 23. Nov 2019, at 16:27, Jeff Janes <jeff.ja...@gmail.com> wrote:
> 
> On Fri, Nov 22, 2019 at 8:02 AM Christoph Gößmann <m...@goessmann.io 
> <mailto:m...@goessmann.io>> wrote:
> Hi everybody,
> 
> I am trying to get all the lexemes for a text using to_tsvector(). But I want 
> only words that english_stem -- the integrated snowball dictionary -- is able 
> to handle to show up in the final tsvector. Since snowball dictionaries only 
> remove stop words, but keep the words that they cannot stem, I don't see an 
> easy option to do this. Do you have any ideas?
> 
> I went ahead with creating a new configuration:
> 
> -- add new configuration english_led
> CREATE TEXT SEARCH CONFIGURATION public.english_led (COPY = 
> pg_catalog.english);
> 
> -- dropping any words that contain numbers already in the parser
> ALTER TEXT SEARCH CONFIGURATION english_led
>     DROP MAPPING FOR numword;
> 
> EXAMPLE:
> 
> SELECT * from to_tsvector('english_led','A test sentence with ui44 \tt 
> somejnk words');
>                    to_tsvector                    
> --------------------------------------------------
>  'sentenc':3 'somejnk':6 'test':2 'tt':5 'word':7
> 
> In this tsvector, I would like 'somejnk' and 'tt' not to be included.
> 
> I don't think the question is well defined.  It will happily stem 
> 'somejnking' to ' somejnk', doesn't that mean that it **can** handle it?  The 
> fact that 'somejnk' itself wasn't altered during stemming doesn't mean it 
> wasn't handled, just like 'test' wasn't altered during stemming.
> 
> Cheers,
> 
> Jeff

Reply via email to