I think it'd be interesting to also investigate using TypeAttribute [1] together with TypeTokenFilter [2].
Regards, Tommaso [1] : https://lucene.apache.org/core/6_5_0/core/org/apache/lucene/analysis/tokenattributes/TypeAttribute.html [2] : https://lucene.apache.org/core/6_5_0/analyzers-common/org/apache/lucene/analysis/core/TypeTokenFilter.html Il giorno mer 14 giu 2017 alle ore 23:33 Markus Jelsma < markus.jel...@openindex.io> ha scritto: > Hello Erick, no worries, i recognize you two. > > I will take a look at your references tomorrow. Although i am still fine > with eight bits, i cannot spare any more but one. If Lucene allows us to > pass longer bitsets to the BytesRef, it would be awesome and easy to encode. > > Thanks! > Markus > > -----Original message----- > > From:Erick Erickson <erickerick...@gmail.com> > > Sent: Wednesday 14th June 2017 23:29 > > To: java-user <java-user@lucene.apache.org> > > Subject: Re: Using POS payloads for chunking > > > > Markus: > > > > I don't believe that payloads are limited in size at all. LUCENE-7705 > > was done in part because there _was_ a hard-coded 256 limit for some > > of the tokenizers. The Payload (at least recent versions) just have > > some bytes after them, and (with LUCENE-7705) can be arbitrarily long. > > > > Of course if you put anything other than a number in there you have to > > provide your own decoders and the like to make sense of your > > payload.... > > > > Best, > > Erick (Erickson, not Hatcher) > > > > On Wed, Jun 14, 2017 at 2:22 PM, Markus Jelsma > > <markus.jel...@openindex.io> wrote: > > > Hello Erik, > > > > > > Using Solr, or actually more parts are Lucene, we have a CharFilter > adding treebank tags to whitespace delimited word using a delimiter, > further on we get these tokens with the delimiter and the POS-tag. It won't > work with some Tokenizers and put it before WDF, it'll split as you know. > That TokenFilter is configured with a tab delimited mapping config > containing <POS-tag>\t<bitset>, and there the bitset is encoded as payload. > > > > > > Our edismax extension rewrites queries to payload supported > equivalents, this is quite trivial, except for all those API changes in > Lucene you have to put up with. Finally a BM25 extension that has, amongst > others, a mapping of bitset to score. Nouns get a bonus, prepositions and > other useless pieces get a punishment etc. > > > > > > Payloads are really great things to use! We also use it to distinguish > between compounds and their subwords, o.a. we supply Dutch and German > speaking countries. And stemmed words and non-stemmed words. Although the > latter also benefit from IDF statistics, payloads just help to control > boosting more precisely regardless of your corpus. > > > > > > I still need to take a look at your recent payload QParsers for Solr > and see how different, probably better, they are compared to our older > implementations. Although we don't use PayloadTermQParser equivalent for > regular search, we do use it for scoring recommendations via delimited > multi valued fields. Payloads are versatile! > > > > > > The downside of payloads is that they are limited to 8 bits. Although > we can easily fit our reduced treebank in there, we also use single bits to > signal for compound/subword, and stemmed/unstemmed and some others. > > > > > > Hope this helps. > > > > > > Regards, > > > Markus > > > > > > -----Original message----- > > >> From:Erik Hatcher <erik.hatc...@gmail.com> > > >> Sent: Wednesday 14th June 2017 23:03 > > >> To: java-user@lucene.apache.org > > >> Subject: Re: Using POS payloads for chunking > > >> > > >> Markus - how are you encoding payloads as bitsets and use them for > scoring? Curious to see how folks are leveraging them. > > >> > > >> Erik > > >> > > >> > On Jun 14, 2017, at 4:45 PM, Markus Jelsma < > markus.jel...@openindex.io> wrote: > > >> > > > >> > Hello, > > >> > > > >> > We use POS-tagging too, and encode them as payload bitsets for > scoring, which is, as far as is know, the only possibility with payloads. > > >> > > > >> > So, instead of encoding them as payloads, why not index your > treebanks POS-tags as tokens on the same position, like synonyms. If you do > that, you can use spans and phrase queries to find chunks of multiple > POS-tags. > > >> > > > >> > This would be the first approach i can think of. Treating them as > regular tokens enables you to use regular search for them. > > >> > > > >> > Regards, > > >> > Markus > > >> > > > >> > > > >> > > > >> > -----Original message----- > > >> >> From:José Tomás Atria <jtat...@gmail.com> > > >> >> Sent: Wednesday 14th June 2017 22:29 > > >> >> To: java-user@lucene.apache.org > > >> >> Subject: Using POS payloads for chunking > > >> >> > > >> >> Hello! > > >> >> > > >> >> I'm not particularly familiar with lucene's search api (as I've > been using > > >> >> the library mostly as a dumb index rather than a search engine), > but I am > > >> >> almost certain that, using its payload capabilities, it would be > trivial to > > >> >> implement a regular chunker to look for patterns in sequences of > payloads. > > >> >> > > >> >> (trying not to be too pedantic, a regular chunker looks for > 'chunks' based > > >> >> on part-of-speech tags, e.g. noun phrases can be searched for with > patterns > > >> >> like "(DT)?(JJ)*(NN|NP)+", that is, an optional determinant and > zero or > > >> >> more adjectives preceding a bunch of nouns, etc) > > >> >> > > >> >> Assuming my index has POS tags encoded as payloads for each > position, how > > >> >> would one search for such patterns, irrespective of terms? I > started > > >> >> studying the spans search API, as this seemed like the natural > place to > > >> >> start, but I quickly got lost. > > >> >> > > >> >> Any tips would be extremely appreciated. (or references to this > kind of > > >> >> thing, I'm sure someone must have tried something similar > before...) > > >> >> > > >> >> thanks! > > >> >> ~jta > > >> >> -- > > >> >> > > >> >> sent from a phone. please excuse terseness and tpyos. > > >> >> > > >> >> enviado desde un teléfono. por favor disculpe la parquedad y los > erroers. > > >> >> > > >> > > > >> > > --------------------------------------------------------------------- > > >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > >> > For additional commands, e-mail: java-user-h...@lucene.apache.org > > >> > > > >> > > >> > > >> --------------------------------------------------------------------- > > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > >> > > >> > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >