José: Do note that, while the bytearray isn't limited, prior to LUCENE-7705 most of the tokenizers you would use limited the incoming token to 256 at most. This is not at all a _Lucene_ limitation at a low level, rather if you're indexing data with a delimited payload (say abc|your_payload_here) the tokenizer would chop it off when the whole thing reached 256 chars.
Hmmm, still confusing. Say the input to the analysis chain was abc|512_byes_of_payload_data The tokenizer would give you abc|frst_252_bytes But if you're using lower-level Lucene calls directly that limit doesn't apply. Best, Erick On Thu, Jun 15, 2017 at 8:21 AM, José Tomás Atria <jtat...@gmail.com> wrote: > Hi Markus, thanks for your response! > > Now I feel stupid, that is clearly a much simpler approach and it has the > added benefits that it would not require me to meddle into the scoring > process, which I'm still a bit terrified of. Thanks for the tip. > > I guess the question is still valid though? i.e. how would one take into > account payloads for scoring entire spans? Does this make sense at all? Any > links to a more-or-less straightforward example? > > On the length of payloads: I understood that you have other restrictions, > but payloads take a bytesref as value, so you can encode arbitrary data in > them as long as you encode and decode properly. E.g. you could encode the > long array that backs a fixed bitset as a bytesref and pass that, though > I'm not sure it would be efficient unless you have at least 64 flags. > > thanks! > jta > > > > On Wed, Jun 14, 2017 at 4:45 PM Markus Jelsma <markus.jel...@openindex.io> > wrote: > >> Hello, >> >> We use POS-tagging too, and encode them as payload bitsets for scoring, >> which is, as far as is know, the only possibility with payloads. >> >> So, instead of encoding them as payloads, why not index your treebanks >> POS-tags as tokens on the same position, like synonyms. If you do that, you >> can use spans and phrase queries to find chunks of multiple POS-tags. >> >> This would be the first approach i can think of. Treating them as regular >> tokens enables you to use regular search for them. >> >> Regards, >> Markus >> >> >> >> -----Original message----- >> > From:José Tomás Atria <jtat...@gmail.com> >> > Sent: Wednesday 14th June 2017 22:29 >> > To: java-user@lucene.apache.org >> > Subject: Using POS payloads for chunking >> > >> > Hello! >> > >> > I'm not particularly familiar with lucene's search api (as I've been >> using >> > the library mostly as a dumb index rather than a search engine), but I am >> > almost certain that, using its payload capabilities, it would be trivial >> to >> > implement a regular chunker to look for patterns in sequences of >> payloads. >> > >> > (trying not to be too pedantic, a regular chunker looks for 'chunks' >> based >> > on part-of-speech tags, e.g. noun phrases can be searched for with >> patterns >> > like "(DT)?(JJ)*(NN|NP)+", that is, an optional determinant and zero or >> > more adjectives preceding a bunch of nouns, etc) >> > >> > Assuming my index has POS tags encoded as payloads for each position, how >> > would one search for such patterns, irrespective of terms? I started >> > studying the spans search API, as this seemed like the natural place to >> > start, but I quickly got lost. >> > >> > Any tips would be extremely appreciated. (or references to this kind of >> > thing, I'm sure someone must have tried something similar before...) >> > >> > thanks! >> > ~jta >> > -- >> > >> > sent from a phone. please excuse terseness and tpyos. >> > >> > enviado desde un teléfono. por favor disculpe la parquedad y los erroers. >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> -- > > sent from a phone. please excuse terseness and tpyos. > > enviado desde un teléfono. por favor disculpe la parquedad y los erroers. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org