Hi Markus, thanks for your response! Now I feel stupid, that is clearly a much simpler approach and it has the added benefits that it would not require me to meddle into the scoring process, which I'm still a bit terrified of. Thanks for the tip.
I guess the question is still valid though? i.e. how would one take into account payloads for scoring entire spans? Does this make sense at all? Any links to a more-or-less straightforward example? On the length of payloads: I understood that you have other restrictions, but payloads take a bytesref as value, so you can encode arbitrary data in them as long as you encode and decode properly. E.g. you could encode the long array that backs a fixed bitset as a bytesref and pass that, though I'm not sure it would be efficient unless you have at least 64 flags. thanks! jta On Wed, Jun 14, 2017 at 4:45 PM Markus Jelsma <markus.jel...@openindex.io> wrote: > Hello, > > We use POS-tagging too, and encode them as payload bitsets for scoring, > which is, as far as is know, the only possibility with payloads. > > So, instead of encoding them as payloads, why not index your treebanks > POS-tags as tokens on the same position, like synonyms. If you do that, you > can use spans and phrase queries to find chunks of multiple POS-tags. > > This would be the first approach i can think of. Treating them as regular > tokens enables you to use regular search for them. > > Regards, > Markus > > > > -----Original message----- > > From:José Tomás Atria <jtat...@gmail.com> > > Sent: Wednesday 14th June 2017 22:29 > > To: java-user@lucene.apache.org > > Subject: Using POS payloads for chunking > > > > Hello! > > > > I'm not particularly familiar with lucene's search api (as I've been > using > > the library mostly as a dumb index rather than a search engine), but I am > > almost certain that, using its payload capabilities, it would be trivial > to > > implement a regular chunker to look for patterns in sequences of > payloads. > > > > (trying not to be too pedantic, a regular chunker looks for 'chunks' > based > > on part-of-speech tags, e.g. noun phrases can be searched for with > patterns > > like "(DT)?(JJ)*(NN|NP)+", that is, an optional determinant and zero or > > more adjectives preceding a bunch of nouns, etc) > > > > Assuming my index has POS tags encoded as payloads for each position, how > > would one search for such patterns, irrespective of terms? I started > > studying the spans search API, as this seemed like the natural place to > > start, but I quickly got lost. > > > > Any tips would be extremely appreciated. (or references to this kind of > > thing, I'm sure someone must have tried something similar before...) > > > > thanks! > > ~jta > > -- > > > > sent from a phone. please excuse terseness and tpyos. > > > > enviado desde un teléfono. por favor disculpe la parquedad y los erroers. > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- sent from a phone. please excuse terseness and tpyos. enviado desde un teléfono. por favor disculpe la parquedad y los erroers.