Searching with too many clauses + Out of Memory

2007-08-01 Thread Harini Raghavan
 Hi Everyone,

I am using Compass 1.1 M2 which supports Lucene 2.2 to store & search huge
amount of company, executive and employment data. There are some usecases
where I need to search for executives/employments on the result set of
company search. But when I try to create a compass query to search for
executives for over 1 lac company ids, it runs out of memory as the query is
huge. Here is the exception stack trace:

java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.lucene.index.SegmentReader.termDocs(SegmentReader.java:342)
at org.apache.lucene.index.MultiTermDocs.termDocs(MultiReader.java:435)
at org.apache.lucene.index.MultiTermDocs.termDocs(MultiReader.java:428)
at org.apache.lucene.index.MultiTermDocs.read(MultiReader.java:393)
at org.apache.lucene.search.TermScorer.next(TermScorer.java:106)
at org.apache.lucene.util.ScorerDocQueue.topNextAndAdjustElsePop(
ScorerDocQueue.ja va:116)
at 
org.apache.lucene.search.DisjunctionSumScorer.advanceAfterCurrent(DisjunctionSu
mScorer.java:175)
at org.apache.lucene.search.DisjunctionSumScorer.next(
DisjunctionSumScorer.java:14 6)
at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:327)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:146)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:124)
at org.apache.lucene.search.MultiSearcher.search(MultiSearcher.java:232)
at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:74)
at org.apache.lucene.search.Hits.(Hits.java:61)
at org.apache.lucene.search.Searcher.search(Searcher.java:55)
at
org.compass.core.lucene.engine.transaction.ReadCommittedTransaction.findByQuery(
ReadCommittedTransaction.java:469)
at
org.compass.core.lucene.engine.transaction.ReadCommittedTransaction.doFind(Read
CommittedTransaction.java:426)
at 
org.compass.core.lucene.engine.transaction.AbstractTransaction.find(AbstractTra
nsaction.java:91)
at org.compass.core.lucene.engine.LuceneSearchEngine.find(
LuceneSearchEngine.java: 379)
at 
org.compass.core.lucene.engine.LuceneSearchEngineQuery.hits(LuceneSearchEngineQ
uery.java:151)
at org.compass.core.impl.DefaultCompassQuery.hits(DefaultCompassQuery.java:133)

at 
org.compass.core.support.search.CompassSearchHelper.performSearch(CompassSearch
Helper.java:144)
at 
org.compass.core.support.search.CompassSearchHelper$1.doInCompass(CompassSearch
Helper.java:89)
at org.compass.core.CompassTemplate.execute(CompassTemplate.java:137)
at 
org.compass.core.support.search.CompassSearchHelper.search(CompassSearchHelper.
java:86)

It looks like this error is actually in the lucene code. It would be great
if anyone in this group has an idea about this kind of usecase and has some
suggestions.

Thanks,
Harini


Re: Problem Search using lucene

2007-08-01 Thread Michael Wechner

Chhabra, Kapil wrote:


You just have to make sure that what you are searching is indexed (and
esp. in the same format/case).
Use Luke (http://www.getopt.org/luke/) to browse through your index.
 



Does Luke also work re to Nutch?

Thanks

Michael


This might give you an insight of what you have indexed and what you are
searching for.

Regards,
kapilChhabra

-Original Message-
From: masz-wow [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, August 01, 2007 12:13 PM

To: java-user@lucene.apache.org
Subject: Re: Problem Search using lucene


Thanks Joe

I'm using this function as my analyzer

public static Analyzer getDefaultAnalyzer() {
PerFieldAnalyzerWrapper perFieldAnalyzer = new
PerFieldAnalyzerWrapper(new
StopAnalyzer());
perFieldAnalyzer.addAnalyzer("contents", new
StopAnalyzer());
perFieldAnalyzer.addAnalyzer("fileID", new
WhitespaceAnalyzer());
perFieldAnalyzer.addAnalyzer("path", new
KeywordAnalyzer());
return perFieldAnalyzer;
}

StopAnalyzer builds an analyzer which removes words in
ENGLISH_STOP_WORDS.That might be the cause why I cannot search words
such as
'and' 'to'

BUT

I'm still having problem when I search a few words other than english
words
such as name (eg: Ghazat) or string of numbers (eg:45600).
 




--
Michael Wechner
Wyona  -   Open Source Content Management - Yanel, Yulup
http://www.wyona.com
[EMAIL PROTECTED], [EMAIL PROTECTED]
+41 44 272 91 61


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Searching with too many clauses + Out of Memory

2007-08-01 Thread Chandan Tamrakar
What is the size of heap u r allocating for your app ?

-Original Message-
From: Harini Raghavan [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, August 01, 2007 2:29 PM
To: java-user@lucene.apache.org
Subject: Searching with too many clauses + Out of Memory

 Hi Everyone,

I am using Compass 1.1 M2 which supports Lucene 2.2 to store & search huge
amount of company, executive and employment data. There are some usecases
where I need to search for executives/employments on the result set of
company search. But when I try to create a compass query to search for
executives for over 1 lac company ids, it runs out of memory as the query is
huge. Here is the exception stack trace:

java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.lucene.index.SegmentReader.termDocs(SegmentReader.java:342)
at org.apache.lucene.index.MultiTermDocs.termDocs(MultiReader.java:435)
at org.apache.lucene.index.MultiTermDocs.termDocs(MultiReader.java:428)
at org.apache.lucene.index.MultiTermDocs.read(MultiReader.java:393)
at org.apache.lucene.search.TermScorer.next(TermScorer.java:106)
at org.apache.lucene.util.ScorerDocQueue.topNextAndAdjustElsePop(
ScorerDocQueue.ja va:116)
at
org.apache.lucene.search.DisjunctionSumScorer.advanceAfterCurrent(Disjunctio
nSu
mScorer.java:175)
at org.apache.lucene.search.DisjunctionSumScorer.next(
DisjunctionSumScorer.java:14 6)
at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:327)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:146)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:124)
at org.apache.lucene.search.MultiSearcher.search(MultiSearcher.java:232)
at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:74)
at org.apache.lucene.search.Hits.(Hits.java:61)
at org.apache.lucene.search.Searcher.search(Searcher.java:55)
at
org.compass.core.lucene.engine.transaction.ReadCommittedTransaction.findByQu
ery(
ReadCommittedTransaction.java:469)
at
org.compass.core.lucene.engine.transaction.ReadCommittedTransaction.doFind(R
ead
CommittedTransaction.java:426)
at
org.compass.core.lucene.engine.transaction.AbstractTransaction.find(Abstract
Tra
nsaction.java:91)
at org.compass.core.lucene.engine.LuceneSearchEngine.find(
LuceneSearchEngine.java: 379)
at
org.compass.core.lucene.engine.LuceneSearchEngineQuery.hits(LuceneSearchEngi
neQ
uery.java:151)
at
org.compass.core.impl.DefaultCompassQuery.hits(DefaultCompassQuery.java:133)

at
org.compass.core.support.search.CompassSearchHelper.performSearch(CompassSea
rch
Helper.java:144)
at
org.compass.core.support.search.CompassSearchHelper$1.doInCompass(CompassSea
rch
Helper.java:89)
at org.compass.core.CompassTemplate.execute(CompassTemplate.java:137)
at
org.compass.core.support.search.CompassSearchHelper.search(CompassSearchHelp
er.
java:86)

It looks like this error is actually in the lucene code. It would be great
if anyone in this group has an idea about this kind of usecase and has some
suggestions.

Thanks,
Harini



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Crawling in Nutch

2007-08-01 Thread Srinivasarao Vundavalli
Hi,
 Where does (in which field)  nutch stores the content of a document
while indexing. I am using this nutch index to search in Lucene. So i want
to know the field in which the content of the document is present.

Thank You


IndexReader deletes more that expected

2007-08-01 Thread Ridwan Habbal
Hi,  I got unexpected behavior while testing lucene. To shortly address the 
problem: Using IndexWriter I add docs with fields named ID with a consecutive 
order (1,2,3,4, etc) then close that index. I get new IndexReader, and call 
IndexReader.deleteDocuments(Term). The term is simply new Term("ID", "1"). and 
then class close on IndexReader. Things work out fine. But if i add docs using 
IndexWriter, close writer, then create new IndexReader to delete one of the 
docs already inserted, but without closing index. while the indexReader that 
perform deletion is still not closed, I add more docs, and then commit the 
IndexWriter, so when i search I get all added docs in the two phases (before 
using deleteDocuments() on IndexReader and after because i haven't closed 
IndexReader, howerer, closed IndexWriter). I close IndexReader and then query 
the index, so i deletes all docs after opening it till closing it, in addition 
to the specified doc in the Term object (in this test case: ID=1). I know that 
i can avoid this by close IndexReader directly after deleting docs, but what 
about runing it on mutiThread app like web application?  There you are the 
code: 
IndexSearcher indexSearcher = new IndexSearcher(this.indexDirectory);
Hits hitsB4InsertAndClose = null;
hitsB4InsertAndClose = getAllAsHits(indexSearcher);
int beforeInsertAndClose = hitsB4InsertAndClose.length();

indexWriter.addDocument(getNewElement());
indexWriter.addDocument(getNewElement());
indexWriter.addDocument(getNewElement());
indexWriter.close();
IndexSearcher indexSearcherDel = new IndexSearcher(this.indexDirectory);
indexSearcherDel.getIndexReader().deleteDocuments(new Term("ID","1"));

indexWriter.addDocument(getNewElement());
indexWriter.addDocument(getNewElement());
indexWriter.addDocument(getNewElement());
indexWriter.addDocument(getNewElement());
indexWriter.addDocument(getNewElement());
indexWriter.addDocument(getNewElement());
indexWriter.addDocument(getNewElement());
indexWriter.addDocument(getNewElement());
indexWriter.addDocument(getNewElement());
indexWriter.addDocument(getNewElement());
indexWriter.addDocument(getNewElement());

indexWriter.close();
Hits hitsAfterInsertAndClose = getAllAsHits(indexSearcher);
int AfterInsertAndClose = hitsAfterInsertAndClose.length();//This is 14
 
indexWriter.addDocument(getNewElement());
indexWriter.close();
Hits hitsAfterInsertAndAfterCloseb4Delete = getAllAsHits(indexSearcher);
int hitsAfterInsertButAndAfterCountb4Delete = 
hitsAfterInsertAndAfterCloseb4Delete.length();//This is 15


 
indexSearcherDel.close();
Hits hitsAfterInsertAndAfterClose = getAllAsHits(indexSearcher);int 
hitsAfterInsertButAndAfterCount = hitsAfterInsertAndAfterClose.length();//This 
is 2   The two methods I Use 
private Hits getAllAsHits(IndexSearcher indexSearcher){
try{
Analyzer analyzer = new StandardAnalyzer();
String defaultSearchField = "all";
QueryParser parser = new QueryParser(defaultSearchField, analyzer);
indexSearcher = new IndexSearcher(this.indexDirectory);
Hits hits = indexSearcher.search(parser.parse("+alias:mydoc"));
indexSearcher.close();
return hits;
}catch(IOException ex){
throw new RuntimeException(ex);
}catch(org.apache.lucene.queryParser.ParseException ex){
throw new RuntimeException(ex);
}

}

private Document getNewElement(){
Map map = new HashMap();
map.put("ID", new Integer(insertCounter).toString());
map.put("name", "name"+insertCounter);
insertCounter++;
Document document = new Document();
for (Iterator iter = map.keySet().iterator(); iter.hasNext();) {
String key = (String) iter.next();
document.add(new Field(key, map.get(key), Store.YES, Index.TOKENIZED));
}
document.add(new Field("alias", "mydoc", Store.YES, Index.UN_TOKENIZED));
return document;}  any clue why it works that way? I expect it to delete only 
one doc? 
_
PC Magazine’s 2007 editors’ choice for best web mail—award-winning Windows Live 
Hotmail.
http://imagine-windowslive.com/hotmail/?locale=en-us&ocid=TXT_TAGHM_migration_HMWL_mini_pcmag_0707

More IP/MAC indexing questions

2007-08-01 Thread Joe Attardi
Hi again, everyone. First of all, I want to thank everyone for their
extremely helpful replies so far.
Also, I just started reading the book "Lucene in Action" last night. So far
it's an awesome book, so a big thanks to the authors.

Anyhow, on to my question. As I've mentioned in several of my previous
messages, I am indexing different pieces of information about servers - in
particular, my question is about indexing the IP address and MAC address.

Using the StandardAnalyzer, an IP is kept as a single token ("192.168.1.100"),
and a MAC is broken up into one token per octet ("00", "17", "fd", "14",
"d3", "2a"). Many searches will be for partial IPs or MACs ("192.168",
"00:17:fd", etc).

Are either of these methods of indexing the addresses (single token vs
per-octet token) more or less efficient than the other when indexing large
numbers of these?

-- 
Joe Attardi
[EMAIL PROTECTED]
http://thinksincode.blogspot.com/


RE: IndexReader deletes more that expected

2007-08-01 Thread Steven Parkes
If I'm reading this correctly, there's something a little wonky here. In
your example code, you close the IndexWriter and then, without creating
a new IndexWriter, you call addDocument again. This shouldn't be
possible (what version of Lucene are you using?)

Assuming for the time being that you are creating the IndexWriter again,
the other issue here is that you shouldn't be able to have a reader and
a writer changing an index at the same time. There should be a lock
failure. This should occur either in the Index 

Might you be creating your IndexWriters (which you don't show) with the
create flag always set to true? That will wipe your index each time,
ignoring the locks and cause all sorts of weird results.

-Original Message-
From: Ridwan Habbal [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, August 01, 2007 8:48 AM
To: java-user@lucene.apache.org
Subject: IndexReader deletes more that expected

Hi,  I got unexpected behavior while testing lucene. To shortly address
the problem: Using IndexWriter I add docs with fields named ID with a
consecutive order (1,2,3,4, etc) then close that index. I get new
IndexReader, and call IndexReader.deleteDocuments(Term). The term is
simply new Term("ID", "1"). and then class close on IndexReader. Things
work out fine. But if i add docs using IndexWriter, close writer, then
create new IndexReader to delete one of the docs already inserted, but
without closing index. while the indexReader that perform deletion is
still not closed, I add more docs, and then commit the IndexWriter, so
when i search I get all added docs in the two phases (before using
deleteDocuments() on IndexReader and after because i haven't closed
IndexReader, howerer, closed IndexWriter). I close IndexReader and then
query the index, so i deletes all docs after opening it till closing it,
in addition to the specified doc in the Term object (in this test case:
ID=1). I know that i can avoid this by close IndexReader directly after
deleting docs, but what about runing it on mutiThread app like web
application?  There you are the code: 
IndexSearcher indexSearcher = new IndexSearcher(this.indexDirectory);
Hits hitsB4InsertAndClose = null;
hitsB4InsertAndClose = getAllAsHits(indexSearcher);
int beforeInsertAndClose = hitsB4InsertAndClose.length();

indexWriter.addDocument(getNewElement());
indexWriter.addDocument(getNewElement());
indexWriter.addDocument(getNewElement());
indexWriter.close();
IndexSearcher indexSearcherDel = new IndexSearcher(this.indexDirectory);
indexSearcherDel.getIndexReader().deleteDocuments(new Term("ID","1"));

indexWriter.addDocument(getNewElement());
indexWriter.addDocument(getNewElement());
indexWriter.addDocument(getNewElement());
indexWriter.addDocument(getNewElement());
indexWriter.addDocument(getNewElement());
indexWriter.addDocument(getNewElement());
indexWriter.addDocument(getNewElement());
indexWriter.addDocument(getNewElement());
indexWriter.addDocument(getNewElement());
indexWriter.addDocument(getNewElement());
indexWriter.addDocument(getNewElement());

indexWriter.close();
Hits hitsAfterInsertAndClose = getAllAsHits(indexSearcher);
int AfterInsertAndClose = hitsAfterInsertAndClose.length();//This is 14
 
indexWriter.addDocument(getNewElement());
indexWriter.close();
Hits hitsAfterInsertAndAfterCloseb4Delete = getAllAsHits(indexSearcher);
int hitsAfterInsertButAndAfterCountb4Delete =
hitsAfterInsertAndAfterCloseb4Delete.length();//This is 15


 
indexSearcherDel.close();
Hits hitsAfterInsertAndAfterClose = getAllAsHits(indexSearcher);int
hitsAfterInsertButAndAfterCount =
hitsAfterInsertAndAfterClose.length();//This is 2   The two methods I
Use 
private Hits getAllAsHits(IndexSearcher indexSearcher){
try{
Analyzer analyzer = new StandardAnalyzer();
String defaultSearchField = "all";
QueryParser parser = new QueryParser(defaultSearchField, analyzer);
indexSearcher = new IndexSearcher(this.indexDirectory);
Hits hits = indexSearcher.search(parser.parse("+alias:mydoc"));
indexSearcher.close();
return hits;
}catch(IOException ex){
throw new RuntimeException(ex);
}catch(org.apache.lucene.queryParser.ParseException ex){
throw new RuntimeException(ex);
}

}

private Document getNewElement(){
Map map = new HashMap();
map.put("ID", new Integer(insertCounter).toString());
map.put("name", "name"+insertCounter);
insertCounter++;
Document document = new Document();
for (Iterator iter = map.keySet().iterator(); iter.hasNext();) {
String key = (String) iter.next();
document.add(new Field(key, map.get(key), Store.YES, Index.TOKENIZED));
}
document.add(new Field("alias", "mydoc", Store.YES,
Index.UN_TOKENIZED));
return document;}  any clue why it works that way? I expect it to delete
only one doc? 
_
PC Magazine's 2007 editors' choice for best web mail-award-winning
Windows Live Hotmail.
http://imagine-windowslive.com/hotmail/?locale=en-us&ocid=TXT_TAGHM_migr
ation_HMWL_mini_pcmag_0707

---

Re: IndexReader deletes more that expected

2007-08-01 Thread Mark Miller
On 8/1/07, Ridwan Habbal <[EMAIL PROTECTED]> wrote:
>
>  but what about runing it on mutiThread app like web application?  There
> you are the code:


If you are targeting a multi threaded webapp than I strongly suggest you
look into using either Solr or the LuceneIndexAccessor code. You will want
to use some form of reference counting to manage your Readers and Writers.

Also, you can now use IndexWriter (Lucene 2.0 and greater I think) to
delete. This allows for efficient mixing of deletes and adds by buffering
the deletes, and then opening an IndexReader to commit them later. This is
much more efficient than IndexModifier.

- Mark


Re: More IP/MAC indexing questions

2007-08-01 Thread Erick Erickson
First, consider using your own analyzer and/or breaking the IP addresses
up by substituting ' ' for '.' upon input. Otherwise, you'll have endless
issues as time passes..

But on to your question. Please post what you mean by
"a large number". 10,000? 1,000,000,000? we have no clue
from your posts so far...

That said, efficiency is hugely overrated at this stage of your
design. I'd personally use whatever is easiest and run some
tests.

Just index them as single (unbroken) tokens to start and search
your partial address with PrefixQuery. Or index them as
individual tokens and create a SpanFirstQuery. Or...

And measure .

Best
Erick

On 8/1/07, Joe Attardi <[EMAIL PROTECTED]> wrote:
>
> Hi again, everyone. First of all, I want to thank everyone for their
> extremely helpful replies so far.
> Also, I just started reading the book "Lucene in Action" last night. So
> far
> it's an awesome book, so a big thanks to the authors.
>
> Anyhow, on to my question. As I've mentioned in several of my previous
> messages, I am indexing different pieces of information about servers - in
> particular, my question is about indexing the IP address and MAC address.
>
> Using the StandardAnalyzer, an IP is kept as a single token ("
> 192.168.1.100"),
> and a MAC is broken up into one token per octet ("00", "17", "fd", "14",
> "d3", "2a"). Many searches will be for partial IPs or MACs ("192.168",
> "00:17:fd", etc).
>
> Are either of these methods of indexing the addresses (single token vs
> per-octet token) more or less efficient than the other when indexing large
> numbers of these?
>
> --
> Joe Attardi
> [EMAIL PROTECTED]
> http://thinksincode.blogspot.com/
>


Re: More IP/MAC indexing questions

2007-08-01 Thread Joe Attardi
Hi Erick,

First, consider using your own analyzer and/or breaking the IP addresses
> up by substituting ' ' for '.' upon input.

Do you mean breaking the IP up into one token for each segment, like ["192",
"168", "1", "100"] ?



> But on to your question. Please post what you mean by
> "a large number". 10,000? 1,000,000,000? we have no clue
> from your posts so far...

I apologize for the lack of details. A large part of the data will be
wireless MAC addresses detected over the air, so it depends on the site. But
I suppose, worst case, we're looking at thousands or tens of thousands.
Comparatively speaking, then, I guess it's not such a large number compared
to some of the other questions discussed on the list.

That said, efficiency is hugely overrated at this stage of your
> design. I'd personally use whatever is easiest and run some
> tests.
>
> Just index them as single (unbroken) tokens to start and search
> your partial address with PrefixQuery.

This is what I was thinking originally, too. Although there could be times
where they are searching for a piece at the end of the address, which is why
my original post had me building a WildcardQuery.

The system will be searching log messages, too, and for that I'll use the
more normal StandardAnalyzer/QueryParser approach.

So what I am thinking of doing going forward is creating a custom query
parser class, that basically has special cases (IP addresses, MAC addresses)
where the query must be more customized, and in the other cases fall through
to the standard QueryParser class. Does this sound like a good idea?

Thanks again for your continued help!


Re: More IP/MAC indexing questions

2007-08-01 Thread Erick Erickson
Think of a custom analyzer class rather than an custom query parser. The
QueryParser uses your analyzer, so it all just "comes along".

Here's the approach I'd try first, off the top of my head

Yes, break the IP and etc. up into octets and index them
tokenized.

Use a SpanNearQuery with a slop of 0 and specify true for ordering.
What that will do is require that the segments you specify must appear
in order with no gaps. You have to construct this yourself since there's
no support for SpanQueries in the QueryParser yet. This'll avoid having
to deal with Wildcards, which have their own issues (try searching on
a thread "I just don't understand wildcards at all" for an exposition from
"the guys" on this.

Best
Erick

On 8/1/07, Joe Attardi <[EMAIL PROTECTED]> wrote:
>
> Hi Erick,
>
> First, consider using your own analyzer and/or breaking the IP addresses
> > up by substituting ' ' for '.' upon input.
>
> Do you mean breaking the IP up into one token for each segment, like
> ["192",
> "168", "1", "100"] ?
>
>
>
> > But on to your question. Please post what you mean by
> > "a large number". 10,000? 1,000,000,000? we have no clue
> > from your posts so far...
>
> I apologize for the lack of details. A large part of the data will be
> wireless MAC addresses detected over the air, so it depends on the site.
> But
> I suppose, worst case, we're looking at thousands or tens of thousands.
> Comparatively speaking, then, I guess it's not such a large number
> compared
> to some of the other questions discussed on the list.
>
> That said, efficiency is hugely overrated at this stage of your
> > design. I'd personally use whatever is easiest and run some
> > tests.
> >
> > Just index them as single (unbroken) tokens to start and search
> > your partial address with PrefixQuery.
>
> This is what I was thinking originally, too. Although there could be times
> where they are searching for a piece at the end of the address, which is
> why
> my original post had me building a WildcardQuery.
>
> The system will be searching log messages, too, and for that I'll use the
> more normal StandardAnalyzer/QueryParser approach.
>
> So what I am thinking of doing going forward is creating a custom query
> parser class, that basically has special cases (IP addresses, MAC
> addresses)
> where the query must be more customized, and in the other cases fall
> through
> to the standard QueryParser class. Does this sound like a good idea?
>
> Thanks again for your continued help!
>


Re: More IP/MAC indexing questions

2007-08-01 Thread Joe Attardi
On 8/1/07, Erick Erickson <[EMAIL PROTECTED]> wrote:
>
> Use a SpanNearQuery with a slop of 0 and specify true for ordering.
> What that will do is require that the segments you specify must appear
> in order with no gaps. You have to construct this yourself since there's
> no support for SpanQueries in the QueryParser yet. This'll avoid having
> to deal with Wildcards, which have their own issues (try searching on
> a thread "I just don't understand wildcards at all" for an exposition from
> "the guys" on this.


Thanks Erick, I'll try this. My only other question here though, is what if
they specify an incomplete octet of an address? For example, I want '
192.168.10' to match 192.168.10.1 and 192.168.100.1. How can I do this
without wildcards, is there a way to put a PrefixQuery into the Span Query?

Sorry if I don't make any sense


Re: Size of field?

2007-08-01 Thread Eduardo Botelho
Hi Erick!!

You're right, I just use setMaxFieldLength() and all work fine.

You save my life, thanks! (y)

On 7/30/07, Erick Erickson <[EMAIL PROTECTED]> wrote:
>
> See IndexWriter.setMaxFieldLength(). 87,300 is odd, since the default
> max field length, last I knew, was 10,000. But this sounds like
> it might relate to your issue.
>
> Best
> Erick
>
> On 7/27/07, Eduardo Botelho <[EMAIL PROTECTED]> wrote:
> >
> > Hi guys,
> >
> > I would like to know if exist some limit of size for the fields of a
> > document.
> >
> > I'm with the following problem:
> > When a term is after a certain amount of characters (approximately
> 87300)
> > in
> > a field, the search does not find de occurrency.
> > If I divide my field in pages, the terms are found normally.
> > This problem occours when I make an exact query (query between quotes)
> >
> > What can be happening?
> >
> > I'm using BrazilianAnalyzer and StandardAnalyzer(for tests only) for
> both,
> > search and indexation.
> >
> > thanks...
> >
> > Sorry for my poor english...
> >
>


Re: More IP/MAC indexing questions

2007-08-01 Thread Erick Erickson
I suspect you're going to have to deal with wildcards if you really want
this functionality.

Erick

On 8/1/07, Joe Attardi <[EMAIL PROTECTED]> wrote:
>
> On 8/1/07, Erick Erickson <[EMAIL PROTECTED]> wrote:
> >
> > Use a SpanNearQuery with a slop of 0 and specify true for ordering.
> > What that will do is require that the segments you specify must appear
> > in order with no gaps. You have to construct this yourself since there's
> > no support for SpanQueries in the QueryParser yet. This'll avoid having
> > to deal with Wildcards, which have their own issues (try searching on
> > a thread "I just don't understand wildcards at all" for an exposition
> from
> > "the guys" on this.
>
>
> Thanks Erick, I'll try this. My only other question here though, is what
> if
> they specify an incomplete octet of an address? For example, I want '
> 192.168.10' to match 192.168.10.1 and 192.168.100.1. How can I do this
> without wildcards, is there a way to put a PrefixQuery into the Span
> Query?
>
> Sorry if I don't make any sense
>


Re: More IP/MAC indexing questions

2007-08-01 Thread Mike Klaas


On 1-Aug-07, at 11:34 AM, Joe Attardi wrote:


On 8/1/07, Erick Erickson <[EMAIL PROTECTED]> wrote:


Use a SpanNearQuery with a slop of 0 and specify true for ordering.
What that will do is require that the segments you specify must  
appear
in order with no gaps. You have to construct this yourself since  
there's
no support for SpanQueries in the QueryParser yet. This'll avoid  
having

to deal with Wildcards, which have their own issues (try searching on
a thread "I just don't understand wildcards at all" for an  
exposition from

"the guys" on this.



Thanks Erick, I'll try this. My only other question here though, is  
what if

they specify an incomplete octet of an address? For example, I want '
192.168.10' to match 192.168.10.1 and 192.168.100.1. How can I do this
without wildcards, is there a way to put a PrefixQuery into the  
Span Query?


If 192 168 10 1 are separate tokens, then a phrase query on "192 168  
10" will find it.  If it is a single token, then a wildcard or regex  
query is necessary.


-Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Size of field?

2007-08-01 Thread Erick Erickson
Glad it worked out for you Did you ever have any insight into what
was magical about 87,300? Although now that I re-read your mail, that
was the number of characters, so I can imagine that your corpus
averaged 8.73 characters/word 

Best
Erick

On 8/1/07, Eduardo Botelho <[EMAIL PROTECTED]> wrote:
>
> Hi Erick!!
>
> You're right, I just use setMaxFieldLength() and all work fine.
>
> You save my life, thanks! (y)
>
> On 7/30/07, Erick Erickson <[EMAIL PROTECTED]> wrote:
> >
> > See IndexWriter.setMaxFieldLength(). 87,300 is odd, since the default
> > max field length, last I knew, was 10,000. But this sounds like
> > it might relate to your issue.
> >
> > Best
> > Erick
> >
> > On 7/27/07, Eduardo Botelho <[EMAIL PROTECTED]> wrote:
> > >
> > > Hi guys,
> > >
> > > I would like to know if exist some limit of size for the fields of a
> > > document.
> > >
> > > I'm with the following problem:
> > > When a term is after a certain amount of characters (approximately
> > 87300)
> > > in
> > > a field, the search does not find de occurrency.
> > > If I divide my field in pages, the terms are found normally.
> > > This problem occours when I make an exact query (query between quotes)
> > >
> > > What can be happening?
> > >
> > > I'm using BrazilianAnalyzer and StandardAnalyzer(for tests only) for
> > both,
> > > search and indexation.
> > >
> > > thanks...
> > >
> > > Sorry for my poor english...
> > >
> >
>


Re: High CPU usage duing index and search

2007-08-01 Thread karl wettin

It sounds like you have a fairly busy system, perhaps 100% load on the
process is not that strange, at least not during short periods of time.

A simpler solution would be to nice the process a little bit in order to
give your background jobs some more time to think.

Running a profiler is still the best advice I can think of. It should
clearly show you what is going on when you run out of CPU.

--  
karl


1 aug 2007 kl. 04.29 skrev Chew Yee Chuang:


Hi,

Thanks for the link provided, actually I've go through those  
article when I
developing the index and search function for my application. I  
haven’t try
profiler yet, but I monitor the CPU usage and notice that whatever  
index or

search performing, the CPU usage raise to 100%. Below I will try to
elaborate more on what my application is doing and how I index and  
search.


There are many concurrent process running, first, the application  
will write
records that received into a text file with tab separated each  
different
field. Application will point to a new file every 10mins and start  
writing
to it. So every file will contains only 10mins record, approximate  
600,000
records per file. Then, the indexing process will check whether  
there is a
text file to be index, if it is, the thread will wake up and start  
perform

indexing.

The indexing process will first add documents to RAMDir, Then  
later, add
RAMDir into FSDir by calling addIndexNoOptimize() when there is  
100,000
documents(32 fields per doc) in RAMDir. There is only 1 IndexWriter 
(FSDir)

was created but a few IndexWriter(RAMDir) was created during the whole
process. Below are some configuration for IndexWriters that I  
mentioned:-


IndexWriter (RAMDir)
- SimpleAnalyzer
- setMaxBufferedDocs(1)
- Filed.Store.YES
- Field.Index.NO_NORMS

IndexWriter (FSDir)
- SimpleAnalyzer
- setMergeFactor(20)
- addIndexesNoOptimize()

For the searching, because there are many queries(20,000) run  
continuously
to generate the aggregate table for reporting purpose. All this  
queries is
run in nested loop, and there is only 1 Searcher created, I try  
searcher and
filter as well, filter give me better result, but both also utilize  
lots of

CPU resources.

Hope this info will help and sorry for my bad English.

Thanks
eChuang, Chew

-Original Message-
From: karl wettin [mailto:[EMAIL PROTECTED]
Sent: Tuesday, July 31, 2007 5:54 PM
To: java-user@lucene.apache.org
Subject: Re: High CPU usage duing index and search


31 jul 2007 kl. 05.25 skrev Chew Yee Chuang:

But just notice that when Lucene performing search or index,
the CPU usage on my machine raise to 100%, because of this issue,
some of my
others backend process will slow down eventually. Just want to know
does
anyone face this problem before ? and is it any idea on how to
overcome this
problem ?


Did you run a profiler to see what it is that consume all the  
resources?

It is very hard to guess based on the information you supplied. Start
here:

http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
http://wiki.apache.org/lucene-java/ImproveSearchingSpeed


--
karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.5.476 / Virus Database: 269.11.0/927 - Release Date:  
7/30/2007

5:02 PM


No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.476 / Virus Database: 269.11.0/929 - Release Date:  
7/31/2007

5:26 PM




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Can I do boosting based on term postions?

2007-08-01 Thread Cedric Ho
Thanks for the quick response =)

On 8/1/07, Shailendra Sharma <[EMAIL PROTECTED]> wrote:
> Yes, it is easily doable through "Payload" facility. During indexing process
> (mainly tokenization), you need to push this extra information in each
> token. And then you can use BoostingTermQuery for using Payload value to
> include Payload in the score. You also need to implement Similarity for this
> (mainly scorePayload method).

If I store, say a custom boost factor as Payload, does it means that I
will store one more byte per term per document in the index file? So
the index file would be much larger?

>
> Other way can be to extend SpanTermQuery, this already calculates the
> position of match. You just need to do something to use this position value
> in the score calculation.

I see that SpanTermQuery takes a TermPositions from the indexReader
and I can get the term position from there. However I am not sure how
to incorporate it into the score calculation. Would you mind give a
little more detail on this?

>
> One possible advantage of SpanTermQuery approach is that you can play
> around, without re-creating indices everytime.
>
> Thanks,
> Shailendra Sharma,
> CTO, Ver se' Innovation Pvt. Ltd.
> Bangalore, India
>
> On 8/1/07, Cedric Ho <[EMAIL PROTECTED]> wrote:
> >
> > Hi all,
> >
> > I was wondering if it is possible to do boosting by search terms'
> > position in the document.
> >
> > for example:
> > search terms appear in the first 100 words, or first 10% words, or in
> > first two paragraphs would be given higher score.
> >
> > Is it achievable through using the new Payload function in lucene 2.2?
> > Or are there any easier ways to achieve these ?
> >
> >
> > Regards,
> > Cedric
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>

Thanks,
Cedric

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Solr's NumberUtils doesnt work

2007-08-01 Thread Mohammad Norouzi
Hi
I am using NumberUtils to encode and decode numbers while indexing and
searching, when I am going to decode the number retrieved from an index it
throws exception for some fields
the exception message is:

Caused by: java.lang.StringIndexOutOfBoundsException: String index out of
range: 1
at java.lang.String.charAt(Unknown Source)
at org.apache.solr.util.NumberUtils.SortableStr2int(NumberUtils.java
:125)
at org.apache.solr.util.NumberUtils.SortableStr2int(NumberUtils.java:37)
at com.payvand.lucene.util.ExtendedNumberUtils.decodeInteger(
ExtendedNumberUtils.java:123)


I dont know why this happen, I am wondering if it has something to do with
character encoding. have you had such problem?

thanks

-- 
Regards,
Mohammad Norouzi
--
see my blog: http://brainable.blogspot.com/
another in Persian: http://fekre-motefavet.blogspot.com/


je-analysis.jar

2007-08-01 Thread Jun.Chen

Dear All,

Who has the je-analysis.jar?

If somebody has, can you send it to me? I don't have the access to
download something in my computer now.

Thank you very much!

Yours truly,
Daniel


This e-mail and any files transmitted with it are for the sole use of the 
intended recipient(s) and may contain confidential and privileged information.
If you are not the intended recipient, please contact the sender by reply 
e-mail and destroy all copies of the original message.
Any unauthorized review, use, disclosure, dissemination, forwarding, printing 
or copying of this email or any action taken in reliance on this e-mail is 
strictly
prohibited and may be unlawful.

  Visit us at http://www.cognizant.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Using Nutch APIs in Lucene

2007-08-01 Thread Srinivasarao Vundavalli
How can we use nutch APIs in Lucene? For example using FetchedSegments , we
can get ParseText from which we can
get the content of the document. So can we use these classes
(FetchedSegments, ParseText ) in lucene. If so, how to use them?
Thank You