Re: Why does this query slow down Lucene?

2012-08-15 Thread Trejkaz
On Thu, Aug 16, 2012 at 11:27 AM, zhoucheng2008 wrote: > > +(title:21 title:a title:day title:once title:a title:month) Looks like you have a fairly big boolean query going on here, and some of the terms you're using are really common ones like "a". Are you using AND or OR for the default operat

Re: 回复: Why does this query slow down Lucene?

2012-08-15 Thread Li Li
and also try jmap -heap pid to check whether it runs out of memory or jstat -gcutil pid 1000 On Thu, Aug 16, 2012 at 10:09 AM, zhoucheng2008 wrote: > The query has been stuck for more than an hour. The total size is less than > 1G, and the number of docs is around 100,000. Hardware is ok as it w

Re: 回复: Why does this query slow down Lucene?

2012-08-15 Thread Li Li
use jstack pid to check any deadlock. On Thu, Aug 16, 2012 at 10:09 AM, zhoucheng2008 wrote: > The query has been stuck for more than an hour. The total size is less than > 1G, and the number of docs is around 100,000. Hardware is ok as it works well > with other much more demanding projects. >

?????? Why does this query slow down Lucene?

2012-08-15 Thread zhoucheng2008
The query has been stuck for more than an hour. The total size is less than 1G, and the number of docs is around 100,000. Hardware is ok as it works well with other much more demanding projects. -- -- ??: "Li Li"; : 2012??8??16??(??

Re: Why does this query slow down Lucene?

2012-08-15 Thread Li Li
how slow is it? are all your searches slow or only that query slow? how many docs are indexed and the size of the indexes? whats the hardware configuration? you should describe it clearly to get help. 在 2012-8-16 上午9:28,"zhoucheng2008" 写道: > Hi, > > > I have the string "$21 a Day Once a Month" to

Re: easy way to figure out most common tokens?

2012-08-15 Thread Shaya Potter
ok, I have no problem with filter/copy to new index, but that seems like a good start point. Would need to figure out how to extend that class correctly, but at least gives me a good starting point. On 08/15/2012 02:48 PM, Uwe Schindler wrote: You cannot modify the ternm dictionary of an inde

RE: easy way to figure out most common tokens?

2012-08-15 Thread Uwe Schindler
You cannot modify the ternm dictionary of an index, see my other eMail. You have to filter it by copying to a new index or reindexing. Document modifications are not supported in Lucene and other inverted indexes. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMai

RE: easy way to figure out most common tokens?

2012-08-15 Thread Uwe Schindler
If you found the terms to remove with e.g. HighFreqTerms, you can use the abstract class FilterIndexReader (FilterAtomicReader in Lucene 4.0) to code a filter for the term dictionary (just return a filtered TermEnum) on merging. Just wrap an IndexReader with this FilterIndexReader that hides the te

Re: easy way to figure out most common tokens?

2012-08-15 Thread Shaya Potter
On 08/15/2012 02:34 PM, Ahmet Arslan wrote: Is there an easy way to figure out the most common tokens and then remove those tokens from the documents. Probably this : http://lucene.apache.org/core/3_6_1/api/all/org/apache/lucene/misc/HighFreqTerms.html ah, that's a good part 1. Then the Q w

Re: easy way to figure out most common tokens?

2012-08-15 Thread Shaya Potter
On 08/15/2012 02:29 PM, Erick Erickson wrote: I don't see how you could without indexing everything first since you can't know what the most frequent terms until you've processed all your documents exactly If you know these terms in advance, it seems like you could just call then stopword

Re: easy way to figure out most common tokens?

2012-08-15 Thread Ahmet Arslan
> Is there an easy way to figure out > the most common tokens and then remove those tokens from the > documents. Probably this : http://lucene.apache.org/core/3_6_1/api/all/org/apache/lucene/misc/HighFreqTerms.html - To unsubscr

Re: easy way to figure out most common tokens?

2012-08-15 Thread Erick Erickson
I don't see how you could without indexing everything first since you can't know what the most frequent terms until you've processed all your documents If you know these terms in advance, it seems like you could just call then stopwords and use the common stopword processing. If you have to e

easy way to figure out most common tokens?

2012-08-15 Thread Shaya Potter
Is there an easy way to figure out the most common tokens and then remove those tokens from the documents. use case: imagine one is indexing a mailing list (such as this java-user) and is extracting all e-mail addresses in the messages and adding them to a doc. What that means is that one wi

LuceneIndex export to SQL-database

2012-08-15 Thread ANNO61
I am using lucene to produce several indexes from html-sites. To work with them i convert the lucene database into sql via a small programm. The main problem is that I take a small part of the collected datafields ( datasource, plainTextContent, title, description and keyword). But there are in mos

RE: howto run CheckIndex on huge index size

2012-08-15 Thread Uwe Schindler
Problem not fixed! I contacted infra on IRC already. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Uwe Schindler [mailto:u...@thetaphi.de] > Sent: Wednesday, August 15, 2012 4:26 PM > To: java-user@luce

RE: howto run CheckIndex on huge index size

2012-08-15 Thread Uwe Schindler
I hope the problem is fixed now; this mail is just to check! It was hard to unsubscribe because of the strange eMail. Have no idea at all... Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Uwe Schin

RE: howto run CheckIndex on huge index size

2012-08-15 Thread Uwe Schindler
I got is, too. As a moderator of this list, I will look into finding the root cause and forcefully unsubscribe the failing address! - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Bernd Fehling [mailto:b

Re: howto run CheckIndex on huge index size

2012-08-15 Thread Bernd Fehling
I guess that ulimit could be a default setting of XenServer when it was first time setup. We started with about 27G. I already raised ulimit -n when setting up XenServer because this was also limited. By the way, am I the only one getting this nasty DELIVERY FAILURE message from one on this li

RE: howto run CheckIndex on huge index size

2012-08-15 Thread Uwe Schindler
So my blog post, last section, helped? I think the ulimits came from there. What distribution do you use that ulimit was actually limited - or was it some sysadmin doing this? :-) We should maybe refer to this blog post from docs or create a copy of the page inside lucene's distribution! Uwe ---

Re: howto run CheckIndex on huge index size

2012-08-15 Thread Bernd Fehling
Hi Uwe, index size is: -rw-r--r-- 1 solr users 82G 15. Aug 07:50 _2rhe.fdt -rw-r--r-- 1 solr users 303M 15. Aug 07:50 _2rhe.fdx -rw-r--r-- 1 solr users 1,2k 15. Aug 07:36 _2rhe.fnm -rw-r--r-- 1 solr users 39G 15. Aug 09:04 _2rhe.frq -rw-r--r-- 1 solr users 757M 15. Aug 09:05 _2rhe.nrm -rw-r--r--

RE: howto run CheckIndex on huge index size

2012-08-15 Thread Uwe Schindler
You don't get a heap-related OOM in your stack trace, it is "Map failed" - caused by MMapDirectory. You don't have enough virtual memory to map the index into address space. I think your heap is way too mch (-Xmx25g is way too big for any existing index and drives GC crazy). How big is your index?

howto run CheckIndex on huge index size

2012-08-15 Thread Bernd Fehling
I'm trying to run CheckIndex as seperate tool on a large index to get nice infos about number of terms, number of tokens, ... but always get OOM exception. Already have JAVA_OPTS -d64 -Xmx25g -Xms25g -Xmn6g Any idea how to use CheckIndex on huge index size? Opening index @ /srv/www/solr/sol