[ 
https://issues.apache.org/jira/browse/LUCENE-6764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14724976#comment-14724976
 ] 

Adrien Grand commented on LUCENE-6764:
--------------------------------------

bq. Payloads should be something small like a byte or two. I dont even think 
they should be variable length: its a trap that adds additional per position 
noise. We should not encourage putting the contents of moby dick per position 
nor should we suffer the complexity hassles.

Of course you want payloads to be small. My point was that there is likely a 
very finite set of unique payloads and so we could likely store these payloads 
on a couple of _bits_ instead of one or two entire _bytes_.

> Payloads should be compressed
> -----------------------------
>
>                 Key: LUCENE-6764
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6764
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>
> I think we should at least try to do something simple, eg. deduplicate or 
> apply simple LZ77 compression. For instance if you use enclosing html tags to 
> give different weights to individual terms, there might be lots of 
> repetitions as there are not that many unique html tags.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to