Hi guys, I have just recently (re-)joined the list. I have an issue with indexing; I hope someone can help me with it.
The use-case is that some of the fields in the document are made up of term:frequency pairs. What I am doing right now is to expand these with a TokenFilter, so that for e.g. "dog:3 cat:2", I return "dog dog dog cat cat", and index that. However, the problem is that when these fields contain real data (anchor text, references, etc.), the resulting field texts for some documents can be really huge; so much in fact, that I get OutOfMemory exceptions. I would be grateful if someone could tell me how this issue could be solved. I thought of circumventing the problem by maximizing the frequency I allow or using the logarithm thereof, but it would be nice to know if there is a proper solution for the problem. I have had a look at the code, but got lost in all the different Consumers. Here are a few questions I have come up with, but the real solution might be something entirely different... 1. Is there information on how much using payloads (and hence positions) slow down querying? 2. Provided that I do not want payloads, can I extend something (perhaps a Consumer) to achieve what I want? 3. Is there a documentation somewhere that describes how indexing works, which Consumer, Writer, etc. is invoked when? 4. Am I better off by just post-processing indices, perhaps by writing the frequency to a payload during indexing, and then run through the index, remove the payloads and positions and writing the posting lists myself? Thank you very much. Best, Dávid Nemeskey