On 12/1/2014 3:10 PM, Avishai Ish-Shalom wrote: > I have very large documents (as big as 1GB) which i'm indexing and planning > to store in Solr in order to use highlighting snippets. I am concerned > about possible performance issues with such large fields - does storing the > fields require additional RAM over what is required to index/fetch/search? > I'm assuming Solr reads only the required data by offset from the storage > and not the entire field. Am I correct in this assumption? > > Does anyone on this list has experience to share with such large documents?
You've gotten some excellent replies already, I just wanted to mention compression. Short answer to the question about RAM: You might need a fair amount of extra memory for the Java heap. Because of the potential for a large index size, you'll want a large amount of memory beyond the heap, for caching. More detailed info: The response that gets built to send to the user, if the fl parameter contains the field with that large data in it, will require memory to hold that data, up to the number of records in the "rows" parameter on the query. If it's a distributed index, some of that data might cross the network twice -- once from the server that stores it, and again to the client. In Solr 4.1 and later, stored fields are compressed, with no way to turn compression off. With very large stored fields, there may be performance and memory implications for both indexing (compression) and queries (decompression). Termvectors (which Michael Sokolov mentioned in his reply) have been compressed since version 4.2. More memory will probably be required for "ramBufferSizeMB" -- a temporary storage area in RAM used during indexing. That defaults to 100MB in recent Solr versions. This is normally enough for several hundred or several thousand typical documents, but just one of your documents may not fit. This will increase your heap requirements. As for whether there is a way to only retrieve specific data from the compressed information without uncompressing all of it, that I do not know. The compression is handled by the Lucene layer, not Solr itself. https://issues.apache.org/jira/browse/LUCENE-4226 Thanks, Shawn