Re: Document lazy-loading WAS [Re: Fast access to a random page of the search results.]

2005-03-08 Thread markharw00d
So this is just the old problem of avoiding reading large, less frequently accessed fields when you are trying to read just the smaller more frequently accessed fields eg titles. You can achieve this by: a) Modifying Lucene using something like the code I originally posted which stops reading

Re: Document lazy-loading WAS [Re: Fast access to a random page of the search results.]

2005-03-08 Thread Kelvin Tan
On Tue, 8 Mar 2005 18:10:26 + (GMT), mark harwood wrote:  "to be able" != "able to be" > OK, I thought you wanted to count terms within the > title field. If you want to group counts on the whole > field value change the loop in my last post to this: > > for(int i=0;i { > String fiel

Re: Document lazy-loading WAS [Re: Fast access to a random page of the search results.]

2005-03-08 Thread mark harwood
>>> "to be able" != "able to be" OK, I thought you wanted to count terms within the title field. If you want to group counts on the whole field value change the loop in my last post to this: for(int i=0;ihttp://uk.messenger.yahoo.com -

Re: Document lazy-loading WAS [Re: Fast access to a random page of the search results.]

2005-03-08 Thread Kelvin Tan
Hey Mark, thanks for the code sample. I did look into this, but for a book's title field, for example, "to be able" != "able to be" and "java programmer" != "programmer (java)" - tokenizer will remove the parentheses so in my use case at least, a field value isn't simply an array of its terms.

Re: Document lazy-loading WAS [Re: Fast access to a random page of the search results.]

2005-03-08 Thread mark harwood
Your requirement was clear but I guess my suggested solution wasn't. Here it is in detail: public class CountTest { public static void main(String[] args) throws Exception { RAMDirectory tempDir = new RAMDirectory(); Analyzer analyzer=new WhitespaceAnalyze

Re: Document lazy-loading WAS [Re: Fast access to a random page of the search results.]

2005-03-08 Thread Kelvin Tan
Ah, I apologize. My use of the word "frequency" was misleading. By that, I meant, the number of hits/documents, whose fields have that value. Once again: doc a=title:1,keyword:a,contents:somelongmemoryhoggingstring doc b=title:1,keyword:a,contents:somelongmemoryhoggingstring doc c=title:1,keyword

Re: Document lazy-loading WAS [Re: Fast access to a random page of the search results.]

2005-03-08 Thread mark harwood
The new TermFreqVector code sounds like what you need here. This gives you fast access to precomputed totals of term frequencies for each document. See IndexReader.getTermFreqVector Send instant messages to your online friends http://uk.messenger.yahoo.com

Re: Document lazy-loading WAS [Re: Fast access to a random page of the search results.]

2005-03-08 Thread Kelvin Tan
Neither. :-) 4) Top 10 fieldvalues (for some fields) returned in search results So, let's say the results of a search were: doc a=title:1,keyword:a,contents:somelongmemoryhoggingstring doc b=title:1,keyword:a,contents:somelongmemoryhoggingstring doc c=title:1,keyword:b,contents:somelongmemoryhog

Re: Document lazy-loading WAS [Re: Fast access to a random page of the search results.]

2005-03-08 Thread mark harwood
Not sure I get what the requirement is yet: >>Here's my requirement, ..I need to perform a simple >>"Top 10 most frequent occurring " from a search. Does this mean: 1)Top 10 fieldnames present in each of your matching documents? 2)Top 10 most frequent terms found in a choice of field? 3)Top 10

Document lazy-loading WAS [Re: Fast access to a random page of the search results.]

2005-03-08 Thread Kelvin Tan
Mark, On Tue, 8 Mar 2005 09:56:37 + (GMT), mark harwood wrote: >> But I suppose for Document >> has to be further subclassed so that the other >> non-initialized fields can be obtained as well, or >> > I don't think Document would be the right place for > this - as a design pattern it is cast