Re: Using POS payloads for chunking

Erick Erickson Thu, 15 Jun 2017 09:10:26 -0700

José:

Do note that, while the bytearray isn't limited, prior to LUCENE-7705
most of the tokenizers you would use limited the incoming token to 256
at most. This is not at all a _Lucene_ limitation at a low level,
rather if you're indexing data with a delimited payload (say
abc|your_payload_here) the tokenizer would chop it off when the whole
thing reached 256 chars.


Hmmm, still confusing. Say the input to the analysis chain was
abc|512_byes_of_payload_data
The tokenizer would give you

abc|frst_252_bytes

But if you're using lower-level Lucene calls directly that limit doesn't apply.

Best,
Erick

On Thu, Jun 15, 2017 at 8:21 AM, José Tomás Atria <jtat...@gmail.com> wrote:
> Hi Markus, thanks for your response!
>
> Now I feel stupid, that is clearly a much simpler approach and it has the
> added benefits that it would not require me to meddle into the scoring
> process, which I'm still a bit terrified of. Thanks for the tip.
>
> I guess the question is still valid though? i.e. how would one take into
> account payloads for scoring entire spans? Does this make sense at all? Any
> links to a more-or-less straightforward example?
>
> On the length of payloads: I understood that you have other restrictions,
> but payloads take a bytesref as value, so you can encode arbitrary data in
> them as long as you encode and decode properly. E.g. you could encode the
> long array that backs a fixed bitset as a bytesref and pass that, though
> I'm not sure it would be efficient unless you have at least 64 flags.
>
> thanks!
> jta
>
>
>
> On Wed, Jun 14, 2017 at 4:45 PM Markus Jelsma <markus.jel...@openindex.io>
> wrote:
>
>> Hello,
>>
>> We use POS-tagging too, and encode them as payload bitsets for scoring,
>> which is, as far as is know, the only possibility with payloads.
>>
>> So, instead of encoding them as payloads, why not index your treebanks
>> POS-tags as tokens on the same position, like synonyms. If you do that, you
>> can use spans and phrase queries to find chunks of multiple POS-tags.
>>
>> This would be the first approach i can think of. Treating them as regular
>> tokens enables you to use regular search for them.
>>
>> Regards,
>> Markus
>>
>>
>>
>> -----Original message-----
>> > From:José Tomás Atria <jtat...@gmail.com>
>> > Sent: Wednesday 14th June 2017 22:29
>> > To: java-user@lucene.apache.org
>> > Subject: Using POS payloads for chunking
>> >
>> > Hello!
>> >
>> > I'm not particularly familiar with lucene's search api (as I've been
>> using
>> > the library mostly as a dumb index rather than a search engine), but I am
>> > almost certain that, using its payload capabilities, it would be trivial
>> to
>> > implement a regular chunker to look for patterns in sequences of
>> payloads.
>> >
>> > (trying not to be too pedantic, a regular chunker looks for 'chunks'
>> based
>> > on part-of-speech tags, e.g. noun phrases can be searched for with
>> patterns
>> > like "(DT)?(JJ)*(NN|NP)+", that is, an optional determinant and zero or
>> > more adjectives preceding a bunch of nouns, etc)
>> >
>> > Assuming my index has POS tags encoded as payloads for each position, how
>> > would one search for such patterns, irrespective of terms? I started
>> > studying the spans search API, as this seemed like the natural place to
>> > start, but I quickly got lost.
>> >
>> > Any tips would be extremely appreciated. (or references to this kind of
>> > thing, I'm sure someone must have tried something similar before...)
>> >
>> > thanks!
>> > ~jta
>> > --
>> >
>> > sent from a phone. please excuse terseness and tpyos.
>> >
>> > enviado desde un teléfono. por favor disculpe la parquedad y los erroers.
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>> --
>
> sent from a phone. please excuse terseness and tpyos.
>
> enviado desde un teléfono. por favor disculpe la parquedad y los erroers.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Using POS payloads for chunking

Reply via email to