Re: Tomcat Threads are BLOCKED after some time

2009-03-05 Thread Varun Dhussa

Hi,

I think it might be a case of the allowed open files at the OS. Try 
setting a higher ulimit and run the program. Also, what are the gc 
parameters you have set on the jvm?


Regards

Varun Dhussa
Product Architect
CE InfoSystems (P) Ltd
http://www.mapmyindia.com



damu_verse wrote:

Hi Thanx for the reply..
   we have not tested this against the versions
(both java-1.6.12 and lucene-2.4) mentioned and more over we can not move to
those verions right away... So we need a solution for this particular
version only..

thanx & regards
damu

damu_verse wrote:
  

Hi All,

 We Have used Lucene as our Search Engine and all our applications
are deployed onto tomcat and running with thread pool size of 200.

Java Version - 1.6.0-rc
Lucene Version - 2.3.2
Tomcat Version - 6.0.14
OS - Red Hat Enterprise Linux ES release 4 (Nahant Update 5)
kernel - 2.6.9-55.0.2.ELsmp
RAM - 4 GB
Tomcat Memory - 1.5 GB
Index Size -  2 GB


 After 10-12 hrs of tomcat running, tomcat
becomes irresponsive. After doing core dump of tomcat process We observed
that all tomcat threads are blocked (Thread-pool size-200). none of the
tomcat threads are in runnable state.

each thread at the time of the core dump is in BLOCKED state...The
following are the stack trace of blocked.

"MultiSearcher thread #3" daemon prio=10 tid=0x337ddc00 nid=0x4827 waiting
for monitor entry [0x2f2f..0x2f2f0ea0]
   java.lang.Thread.State: BLOCKED (on object monitor)
at
org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:235)
- waiting to lock <0x45d49d88> (a
org.apache.lucene.store.FSDirectory$FSIndexInput)
at
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:152)
at
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76)
at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:63)
at 
org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:123)
at
org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:154)
at
org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:223)
at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:217)
at org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:54)
at org.apache.lucene.index.IndexReader.termDocs(IndexReader.java:668)
at
org.apache.lucene.search.ConstantScoreTermQuery$TermWeight.scorer(ConstantScoreTermQuery.java:63)
at
org.apache.lucene.search.VBooleanQuery$BooleanWeight.scorer(VBooleanQuery.java:276)
at
org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java:232)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:143)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:124)
at
org.apache.lucene.search.MultiSearcherThread.run(ParallelMultiSearcher.java:250)



"http-8080-194" daemon prio=10 tid=0x08927800 nid=0x128d waiting for
monitor entry [0x2e188000..0x2e189e20]
   java.lang.Thread.State: BLOCKED (on object monitor)
at
org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:235)
- waiting to lock <0x45d49d88> (a
org.apache.lucene.store.FSDirectory$FSIndexInput)
at
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:152)
at
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
at org.apache.lucene.store.IndexInput.readVLong(IndexInput.java:96)
at
org.apache.lucene.index.MultiLevelSkipListReader.loadSkipLevels(MultiLevelSkipListReader.java:196)
at
org.apache.lucene.index.MultiLevelSkipListReader.skipTo(MultiLevelSkipListReader.java:97)
at
org.apache.lucene.index.SegmentTermDocs.skipTo(SegmentTermDocs.java:164)
at in.verse.search.query.spans.TermSpans.skipTo(TermSpans.java:85)
at in.verse.search.query.spans.SpanScorer.skipTo(SpanScorer.java:70)
at
org.apache.lucene.search.VConjunctionScorer.doNext(VConjunctionScorer.java:78)
at
org.apache.lucene.search.VConjunctionScorer.next(VConjunctionScorer.java:71)
at
org.apache.lucene.search.VBooleanScorer2.next(VBooleanScorer2.java:456)
at
org.apache.lucene.search.VConjunctionScorer.init(VConjunctionScorer.java:136)
at
org.apache.lucene.search.VConjunctionScorer.next(VConjunctionScorer.java:65)
at
org.apache.lucene.search.VBooleanScorer2.score(VBooleanScorer2.java:412)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:146)
at
org.apache.lucene.search.ParallelMultiSearcher.search(ParallelMultiSearcher.java:173)
at org.apache.lucene.search.Searcher.search(Searcher.java:118)
at org.apache.lucene.search.Searcher.search(Searcher.java:97)
 

Re: crawler questions..

2009-03-05 Thread adasal
That's interesting.
I've been working in python recently, not crawling though.
But, as ever, the more you get into it the more curious you get.
Did you come up with a solution to a node error?
Are you really talking about a broken link, or are you just saying the
bottom of the tree has been reached?
Presumably the last one would be when every link on every page has been
followed, which means you have to track what pages have been crawled and
find a way of uniquely and correctly identifying them internally? I think
the problem is that while a URL might be unique, there can be more than one
URL pointing to the same content - for instance in struts where action a and
action b are appended to a URL but produce the same result. I believe I am
right about this.
In the site that I am working on google have told us they are unable to
crawl the whole site because some URLs result in a loop - another problem.
It would be cool if you have solved these sorts of problems, or rather can
identify where they are on a site in a quick and easy way.

Best,
Adam

2009/3/4 bruce 

> Hi...
>
> Sorry that this is a bit off track. Ok, maybe way off track!
>
> But I don't have anyone to bounce this off of..
>
> I'm working on a crawling project, crawling a college website, to extract
> course/class information. I've built a quick test app in python to crawl
> the
> site. I crawl at the top level, and work my way down to getting the
> required
> course/class schedule. The app works. I can consistently run it and extract
> the information.
>
> My issue is now that I have a "basic" app that works, i need to figure out
> how I guarantee that I'm correctly crawling the site. How do I know when
> I've got an error at a given node/branch, so that the app knows that it's
> not going to fetch the underlying branch/nodes of the tree..
>
> How do I know when I have a complete "tree"!
>
> I'm looking for someone, or some group/prof that I can talk to about these
> issues. My goal is to eventually look at using nutch/lucene if at all
> applicable.
>
> Any pointers, or people, or papers, etc... would be helpful.
>
> Thanks
>
>
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


IndexSearcher

2009-03-05 Thread liat oren
Hi,

I would like to do a search that will return documents that contain a given
word.
For example, I created the following index:

IndexWriter writer = new IndexWriter("C:/TryIndex", new StandardAnalyzer());
Document doc = new Document();
 doc.add(new Field(WordIndex.FIELD_WORLDS, "111 222 333", Field.Store.YES,
Field.Index.UN_TOKENIZED));
writer.addDocument(doc);
doc = new Document();
doc.add(new Field(WordIndex.FIELD_WORLDS, "111", Field.Store.YES,
Field.Index.UN_TOKENIZED));
writer.addDocument(doc);
  doc = new Document();
  doc.add(new Field(WordIndex.FIELD_WORLDS, "222 333", Field.Store.YES,
Field.Index.UN_TOKENIZED));
  writer.addDocument(doc);
writer.optimize();
 writer.close();

now I want to get all the documents that contain the word "222".

I tried to run  the following code but it doesn;t return any doc

  IndexSearcher searcher = new IndexSearcher(indexPath);

//  //  TermQuery mapQuery = new TermQuery(new Term(FIELD_WORLDS,
worldNum)); - this one also didn't word
Analyzer analyzer = new StandardAnalyzer();
QueryParser parser = new QueryParser(FIELD_WORLDS, analyzer);
 Query query = parser.parse(worldNum);
  Hits mapHits = searcher.search(query);


Thanks a lot,
Liat


Re: IndexSearcher

2009-03-05 Thread Erick Erickson
I think your root problem is that you're indexing UN_TOKENIZED, which
means that the tokens you're adding to your index are NOT run through
the analyzer.

So your terms are exactly "111", "222 333" and "111 222 333", none of which
match "222". I expect you wanted your tokens to be "111", "222", and "333",
each appearing twice in your index.

Try indexing them tokenized. Although note that I don't remember what
StandardAnalyzer does with numbers. WhitespaceAnalyzer does the
more intuitive thing, but beware that it doesn't fold case. But it might be
an easier place for you to start until you get more comfortable with what
various analyzers do.

Also, I *strongly* advise that you get a copy of Luke. It is a wonderful
tool
that allows you to examine your index, analyze queries, test queries, etc.

But be aware that the site that maintains Luke was having problems
yesterday,
look over the user list messages from yesterday if you have problems.

Best
Erick

On Thu, Mar 5, 2009 at 8:40 AM, liat oren  wrote:

> Hi,
>
> I would like to do a search that will return documents that contain a given
> word.
> For example, I created the following index:
>
> IndexWriter writer = new IndexWriter("C:/TryIndex", new
> StandardAnalyzer());
> Document doc = new Document();
>  doc.add(new Field(WordIndex.FIELD_WORLDS, "111 222 333", Field.Store.YES,
> Field.Index.UN_TOKENIZED));
> writer.addDocument(doc);
> doc = new Document();
> doc.add(new Field(WordIndex.FIELD_WORLDS, "111", Field.Store.YES,
> Field.Index.UN_TOKENIZED));
> writer.addDocument(doc);
>  doc = new Document();
>  doc.add(new Field(WordIndex.FIELD_WORLDS, "222 333", Field.Store.YES,
> Field.Index.UN_TOKENIZED));
>  writer.addDocument(doc);
> writer.optimize();
>  writer.close();
>
> now I want to get all the documents that contain the word "222".
>
> I tried to run  the following code but it doesn;t return any doc
>
>  IndexSearcher searcher = new IndexSearcher(indexPath);
>
> //  //  TermQuery mapQuery = new TermQuery(new Term(FIELD_WORLDS,
> worldNum)); - this one also didn't word
> Analyzer analyzer = new StandardAnalyzer();
> QueryParser parser = new QueryParser(FIELD_WORLDS, analyzer);
>  Query query = parser.parse(worldNum);
>  Hits mapHits = searcher.search(query);
>
>
> Thanks a lot,
> Liat
>


Learning Lucene

2009-03-05 Thread Tuztuz T
dear all 
I am really new to lucene
Is there anyone who can guid me learning lucene
I have lucene in action the old book, but I get hard time to understand the 
syntaxes in the book and the new lucene release (2.4)
Can anyone give me copy of the new lucen inaction book or any other material 
that i can go thru.

thanks a lot

Tuztuz
 


  

RE: Learning Lucene

2009-03-05 Thread Sudarsan, Sithu D.
Hi Tuztuz,

Please visit the book's website and the forum. You will get most queries
cleared.


Sincerely,
Sithu D Sudarsan

-Original Message-
From: Tuztuz T [mailto:tuztu...@yahoo.com] 
Sent: Thursday, March 05, 2009 9:24 AM
To: java-user@lucene.apache.org
Subject: Learning Lucene

dear all 
I am really new to lucene
Is there anyone who can guid me learning lucene
I have lucene in action the old book, but I get hard time to understand
the syntaxes in the book and the new lucene release (2.4)
Can anyone give me copy of the new lucen inaction book or any other
material that i can go thru.

thanks a lot

Tuztuz
 


  

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



public apology for company spam

2009-03-05 Thread Yonik Seeley
This morning, an apparently over-zealous marketing firm, on behalf of
the company I work for, sent out a marketing email to a large number
of subscribers of the Lucene email lists.  This was done without my
knowledge or approval, and I can assure you that I'll make all efforts
to prevent it from happening again.

Sincerest apologies,
-Yonik

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



similarity function

2009-03-05 Thread Seid Mohammed
For my work, I have read an article stating that " Answer type can be
automatically constructed by Indexing Different Questions and Answer
types. Later, when an unseen question apears, answer type for this
question will be found with the help of 'similarity function'
computation"

so I am clear with the arguement above. my problem is,
1. how can I index individual questions and Answer types as is ( not tokenized
2. how can I calculate the similarity between indexed questions and
and unseen questions (question of any type that can be asked latter)

to make things clear: the senario is
1. Who is the president of UN
  Answer type 
2. When will the presidency of Meles Zenawi hold?
  Answer Type 
these two will be indexed and
and later an unseen question like
who is the president of Kenya
  should match the first question and so that will have answer
type of 

I appricate any help

Seid M

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Learning Lucene

2009-03-05 Thread Erik Hatcher


On Mar 5, 2009, at 9:24 AM, Tuztuz T wrote:

dear all
I am really new to lucene
Is there anyone who can guid me learning lucene
I have lucene in action the old book, but I get hard time to  
understand the syntaxes in the book and the new lucene release (2.4)
Can anyone give me copy of the new lucen inaction book or any other  
material that i can go thru.


The second edition is available through Manning's MEAP program  
already. Still some writing left to do on it, and hopefully 2.9 will  
be out first, before it goes to print, but it has been updated to the  
latest API and contains lots of great new material primarily thanks to  
Mike McCandless.


   http://www.manning.com/hatcher3/

Erik


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: public apology for company spam

2009-03-05 Thread Glen Newton
Yonik,

Thank-you for your email. I appreciated and accept your apology.

Indeed the spam was annoying, but I think that you and your colleagues
have significant social capital in the Lucene and Solr communities, so
this minor but unfortunate incident should have minimal impact.

That said, you and your colleagues do not have infinite social
capital, and hopefully you will have no  reason to be forced to spend
this capital in such an unfortunate manner in the future.  :-)

sincerely,

Glen Newton

2009/3/5 Yonik Seeley :
> This morning, an apparently over-zealous marketing firm, on behalf of
> the company I work for, sent out a marketing email to a large number
> of subscribers of the Lucene email lists.  This was done without my
> knowledge or approval, and I can assure you that I'll make all efforts
> to prevent it from happening again.
>
> Sincerest apologies,
> -Yonik
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



indexing but not tokenizing

2009-03-05 Thread John Marks
Hi all,

I'm not able to see what's wrong in the following sample code.
I'm indexing a document with 5 fields, using five different indexing strategies.
I'm fine the the results for 4 of them, but field B is causing me some
trouble in understanding what's going on.

The value of field B is X (uppercase).
The analyzer is a SimpleAnalyzer, which I use on the QueryParser as well.
But when I search for X (uppercase) on field B, the X is converted to lowercase.
Now, I know that SimpleAnalyzer converts to lowercase, but I was
expecting it not to do so on field B, which is NOT_ANALYZED.

How should I fix my code?

Thank you in advance!
-John



--- code ---


package test;

import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocCollector;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.queryParser.QueryParser;



public class Test
{
  public static void main(String[] args)
  {
    try
    {
  RAMDirectory idx = new RAMDirectory();
  SimpleAnalyzer analyzer = new SimpleAnalyzer();

  IndexWriter writer = new IndexWriter(idx, analyzer, true,
  IndexWriter.MaxFieldLength.LIMITED);

  Document doc = new Document();
  doc.add(new Field("A", "X",
  Field.Store.YES, Field.Index.NO));
  doc.add(new Field("B", "X",
  Field.Store.YES, Field.Index.NOT_ANALYZED));
  doc.add(new Field("C", "X",
  Field.Store.YES, Field.Index.ANALYZED));
  doc.add(new Field("D", "x",
  Field.Store.NO, Field.Index.NOT_ANALYZED));
  doc.add(new Field("E", "X",
  Field.Store.NO, Field.Index.ANALYZED));
  writer.addDocument(doc);
  writer.close();

  IndexSearcher searcher = new IndexSearcher(idx);
  String field = "B";
  QueryParser parser = new QueryParser(field, analyzer);
  Query query = parser.parse("X");
  System.out.println("Query: " + query.toString());

  TopDocCollector collector = new TopDocCollector(1);
  searcher.search(query, collector);
  int numHits = collector.getTotalHits();
  System.out.println(numHits + " total matching documents");

  if ( numHits > 0)
  {
    ScoreDoc[] hits = collector.topDocs().scoreDocs;
    doc = searcher.doc(hits[0].doc);
    System.out.println("A: " + doc.get("A"));
    System.out.println("B: " + doc.get("B"));
    System.out.println("C: " + doc.get("C"));
    System.out.println("D: " + doc.get("D"));
    System.out.println("E: " + doc.get("E"));
  }
    }
    catch (Exception e)
    {
  System.out.println(" caught a " + e.getClass() + "\n with message: "
  + e.getMessage());
    }
  }

}

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: indexing but not tokenizing

2009-03-05 Thread Ian Lea
Hi


I think that the SimpleAnalyzer you are passing to the query parser
will be downcasing the X.  You can fix it using an analyzer that
doesn't convert to lower case, creating the query directly in code, or
by using PerFieldAnalyzerWrapper, and no doubt other ways too.

If you want a direct suggestion: use PerFieldAnalyzerWrapper,
specifying a different analyzer for field B.


--
Ian.


On Thu, Mar 5, 2009 at 3:17 PM, John Marks  wrote:
> Hi all,
>
> I'm not able to see what's wrong in the following sample code.
> I'm indexing a document with 5 fields, using five different indexing 
> strategies.
> I'm fine the the results for 4 of them, but field B is causing me some
> trouble in understanding what's going on.
>
> The value of field B is X (uppercase).
> The analyzer is a SimpleAnalyzer, which I use on the QueryParser as well.
> But when I search for X (uppercase) on field B, the X is converted to 
> lowercase.
> Now, I know that SimpleAnalyzer converts to lowercase, but I was
> expecting it not to do so on field B, which is NOT_ANALYZED.
>
> How should I fix my code?
>
> Thank you in advance!
> -John
>
>
>
> --- code ---
>
>
> package test;
>
> import org.apache.lucene.analysis.SimpleAnalyzer;
> import org.apache.lucene.store.RAMDirectory;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.search.Query;
> import org.apache.lucene.search.TopDocCollector;
> import org.apache.lucene.search.ScoreDoc;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.queryParser.QueryParser;
>
>
>
> public class Test
> {
>   public static void main(String[] args)
>   {
>     try
>     {
>   RAMDirectory idx = new RAMDirectory();
>   SimpleAnalyzer analyzer = new SimpleAnalyzer();
>
>   IndexWriter writer = new IndexWriter(idx, analyzer, true,
>   IndexWriter.MaxFieldLength.LIMITED);
>
>   Document doc = new Document();
>   doc.add(new Field("A", "X",
>   Field.Store.YES, Field.Index.NO));
>   doc.add(new Field("B", "X",
>   Field.Store.YES, Field.Index.NOT_ANALYZED));
>   doc.add(new Field("C", "X",
>   Field.Store.YES, Field.Index.ANALYZED));
>   doc.add(new Field("D", "x",
>   Field.Store.NO, Field.Index.NOT_ANALYZED));
>   doc.add(new Field("E", "X",
>   Field.Store.NO, Field.Index.ANALYZED));
>   writer.addDocument(doc);
>   writer.close();
>
>   IndexSearcher searcher = new IndexSearcher(idx);
>   String field = "B";
>   QueryParser parser = new QueryParser(field, analyzer);
>   Query query = parser.parse("X");
>   System.out.println("Query: " + query.toString());
>
>   TopDocCollector collector = new TopDocCollector(1);
>   searcher.search(query, collector);
>   int numHits = collector.getTotalHits();
>   System.out.println(numHits + " total matching documents");
>
>   if ( numHits > 0)
>   {
>     ScoreDoc[] hits = collector.topDocs().scoreDocs;
>     doc = searcher.doc(hits[0].doc);
>     System.out.println("A: " + doc.get("A"));
>     System.out.println("B: " + doc.get("B"));
>     System.out.println("C: " + doc.get("C"));
>     System.out.println("D: " + doc.get("D"));
>     System.out.println("E: " + doc.get("E"));
>   }
>     }
>     catch (Exception e)
>     {
>   System.out.println(" caught a " + e.getClass() + "\n with message: "
>   + e.getMessage());
>     }
>   }
>
> }
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: 答复: 答复: Lucene in large database contexts

2009-03-05 Thread Patrick Turcotte

mkjjyy

On 8/10/07, Askar Zaidi  wrote:

Hey Guys,

I am trying to do something similar. Make the content search-able as  
soon as
it is added to the website. The way it can work in my scenario is  
that , I

create the Index for a every new user account created.

Then, whenever a new document is uploaded, its contents are added to  
the

users Index using writer.addDocument(...)

As  for closing the writer, yes ! I'll close the writer and optimize  
after

its added to the index.

I really think this should work. Don't you ?

thanks
AZ

On 8/10/07, Erick Erickson  wrote:


Well, closing/opening an index is MUCH less expensive than
rebuilding the whole thing, so I don't understand part of your
statements

It *may* (but I haven't tried it) be possible to flush the writer  
rather

than
close/open it. But, you MUST close/reopen the reader you search with
even if flush works like I think it does.

But it's also possible to use a two tiered approach. 1G isn't all  
that

big.
Could
you read it into a RAMDir and use that for your searches? Then,  
when you

add
data, you add it to *both* indexes, but close/open the RAMdir for
searching.

It's also possible to keep the RAMdir as the delta between the  
FSdir and

"current" states of your index. Add to both and search both. Although
deletes may be a problem here.

You haven't specified how often you expect changes, though. 100/ 
second?
1/minute? How real is "real time"? You could do something like warm  
up

a new reader in the background whenever you decided you needed to be
absolutely up to date and swap your "live" reader for the newly  
warmed up

one whenever you deemed it wise.

Or you could just close/open your reader after each modification,  
fire off

a

couple of warmup queries at it and let the users live with slow  
responses

if they happen to search before your warm-up queries completed.

The point is that there are many options, but to suggest the best  
one, we
need some throughput numbers and a better definition of what "real  
time"

means. Is a one minute delay acceptable? 10 seconds? a millisecond?
the answer defines the scope of reasonable solutions.

Best
Erick

On 8/10/07, Antonello Provenzano  wrote:


Kai,

The context I'm going to work with requires a continuous addition of
documents to the indexes, since it's user-driven content, and this
would require the content to be always up-to-date.
This is the problem I'm facing, since I cannot rebuild a 1Gb (at
least) index every time a user inserts a new entry into the  
database.


I know Digg, for instance, is using Lucene as search engine: since  
the
amount of data they're dealing with is much higher than mine, I  
would

like to understand the way they used to implement this kind of
solution.

Thank you again.
Antonello


On 8/10/07, Kai Hu  wrote:

Antonello,
   You are right,I think lucene indexsearcher will search the  
old

information if IndexWriter was not closed(I think lucene release the

Lock
here),so I only add a few documents every time from buffer to  
implement

index "real time".


kai


发件人: antonellop...@gmail.com [mailto:antonellop...@gmail.co 
m] 代表

Antonello Provenzano

发送时间: 2007年8月10日 星期五 17:59
收件人: java-user@lucene.apache.org
主题: Re: 答复: Lucene in large database contexts

Kai,

Thanks. The problem I see it's that although I can add a Document
through IndexWriter or IndexModifier, this won't be searchable  
until
the index is closed and, possibly, optimized, since the score of  
the

document in the index context must be re-calculated on the basis of
the whole context.

Is this assumption true? or am I completely wrong?

Cheers.
Antonello


On 8/10/07, Kai Hu  wrote:

Hi, Antonello
   You can use IndexWriter.addDocument(Document document) to

add

single document,same to update,delete operation.


kai

-邮件原件-
发件人: Antonello Provenzano [mailto:antonellop...@gmail.com]
发送时间: 2007年8月10日 星期五 17:09
收件人: java-user@lucene.apache.org
主题: Lucene in large database contexts

Hi There!

I've been working for a while on the implementation of a website
oriented to contents that would contain millions of entries,  
most of

them indexable (such as descriptions, texts, names, etc.).
The ideal solution to make them searchable would be to use  
Lucene as

index and search engine.

The reason I'm posting the mailing list is the following: since  
all

the entries will be stored in a database (most likely MySQL InnoDB

or

Oracle), what's the best technique to implement a system that

indexes
in "real time" (eg. when an entry is inserted into the databsse)  
the
content and make it searchable? Based on my understanding of  
Lucene,
such this thing is not possible, since the index must be re- 
created

to

be able to search the indexed contents. Is this true?

Eventually, could anyone point me to a working example about how  
to

implement such a similar context?


Thank you for the support.
Antonello



--

Re: public apology for company spam

2009-03-05 Thread Erick Erickson
Let's see, you guys generously contributed your time and saved
my butt way more than once. I *think* I can stand an inadvertent
message or two ...

Best
Erick

On Thu, Mar 5, 2009 at 10:12 AM, Glen Newton  wrote:

> Yonik,
>
> Thank-you for your email. I appreciated and accept your apology.
>
> Indeed the spam was annoying, but I think that you and your colleagues
> have significant social capital in the Lucene and Solr communities, so
> this minor but unfortunate incident should have minimal impact.
>
> That said, you and your colleagues do not have infinite social
> capital, and hopefully you will have no  reason to be forced to spend
> this capital in such an unfortunate manner in the future.  :-)
>
> sincerely,
>
> Glen Newton
>
> 2009/3/5 Yonik Seeley :
> > This morning, an apparently over-zealous marketing firm, on behalf of
> > the company I work for, sent out a marketing email to a large number
> > of subscribers of the Lucene email lists.  This was done without my
> > knowledge or approval, and I can assure you that I'll make all efforts
> > to prevent it from happening again.
> >
> > Sincerest apologies,
> > -Yonik
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>
>
>
> --
>
> -
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Instantiating a RAMDirectory from a mutating directory

2009-03-05 Thread Kieran Topping

Hello,

I would like to be able to instantiate a RAMDirectory from a directory 
that an IndexWriter in another process might currently be modifying.


Ideally, I would like to do this without any synchronizing or locking. 
Kind-of like the way in which an IndexReader can open an index in a 
directory, even if it's currently being modified by an IndexWriter.


However, simply calling:
 RAMDirectory rd = new RAMDirectory("/path/to/index");
Will not work. It will periodically fail with a FileNotFoundException. 
It's fairly obvious why this happens: Directory.copy() gets a list of 
the files it needs to copy, and then copies them into the RAMDirectory 
instance one-by-one. If, in the meantime, the IndexWriter deletes one of 
these files, a FileNotFoundException occurs.


One thought that I had was that I would take advantage of the fact that 
it's possible to open an IndexReader on the mutating directory, and then 
use the "addIndexes()" method, as follows:


  // 1. create RAMDirectory.
  RAMDirectory ramDirectory = new RAMDirectory();
  // 2. create an index in the RAMDirectory.
  IndexWriter writer = new IndexWriter(ramDirectory, null/*analyzer*/, 
true /*create*/) ;

  // 3. open the (possibly mutating) source index.
  IndexReader reader = IndexReader.open("/path/to/index");
  // 4. copy the source index into the RAMDirectory index.
  writer.addIndexes(new IndexReader [] {reader});

However ... there is a fairly unambiguous warning in 
IndexWriter.addIndexes()'s documentation:


>>   NOTE: the index in each Directory must not be changed (opened by a 
writer) while this method is running. This method does not acquire a 
write lock in each input Directory, so it is up to the caller to enforce 
this.


I'm slightly confused by this warning though, as IndexReader's 
documentation implies that it is OK to open an IndexReader in this fashion.


I'm wondering whether anyone knows the internals of 
IndexWriter.addIndexes() well enough to know whether my proposed 
solution will work reliably?


Or, indeed, whether there might be another way of instantiating a 
RAMDirectory from a directory which might currently be being modified by 
an IndexWriter?


Many thanks in advance,

Kieran Topping



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: similarity function

2009-03-05 Thread Vasudevan Comandur
Hi,

   The very fact that you are trying to answer factoid questions to start
with, it is better to use OpenNLP components to identify
   NER (Named Entity recognition) in the document and use those tags as part
of your indexing process.

REgards
 Vasu


On Thu, Mar 5, 2009 at 8:19 PM, Seid Mohammed  wrote:

> For my work, I have read an article stating that " Answer type can be
> automatically constructed by Indexing Different Questions and Answer
> types. Later, when an unseen question apears, answer type for this
> question will be found with the help of 'similarity function'
> computation"
>
> so I am clear with the arguement above. my problem is,
> 1. how can I index individual questions and Answer types as is ( not
> tokenized
> 2. how can I calculate the similarity between indexed questions and
> and unseen questions (question of any type that can be asked latter)
>
> to make things clear: the senario is
> 1. Who is the president of UN
>  Answer type 
> 2. When will the presidency of Meles Zenawi hold?
>  Answer Type 
> these two will be indexed and
> and later an unseen question like
> who is the president of Kenya
>  should match the first question and so that will have answer
> type of 
>
> I appricate any help
>
> Seid M
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: public apology for company spam

2009-03-05 Thread Shashi Kant
Yes, it is good to learn that Yonik, Erik et al are also human-beings. :-)
Thanks for all your contributions to Lucene/Solr, this list and the OSS
community in general.

Best,
Shashi


On Thu, Mar 5, 2009 at 11:36 AM, Erick Erickson wrote:

> Let's see, you guys generously contributed your time and saved
> my butt way more than once. I *think* I can stand an inadvertent
> message or two ...
>
> Best
> Erick
>
> On Thu, Mar 5, 2009 at 10:12 AM, Glen Newton 
> wrote:
>
> > Yonik,
> >
> > Thank-you for your email. I appreciated and accept your apology.
> >
> > Indeed the spam was annoying, but I think that you and your colleagues
> > have significant social capital in the Lucene and Solr communities, so
> > this minor but unfortunate incident should have minimal impact.
> >
> > That said, you and your colleagues do not have infinite social
> > capital, and hopefully you will have no  reason to be forced to spend
> > this capital in such an unfortunate manner in the future.  :-)
> >
> > sincerely,
> >
> > Glen Newton
> >
> > 2009/3/5 Yonik Seeley :
> > > This morning, an apparently over-zealous marketing firm, on behalf of
> > > the company I work for, sent out a marketing email to a large number
> > > of subscribers of the Lucene email lists.  This was done without my
> > > knowledge or approval, and I can assure you that I'll make all efforts
> > > to prevent it from happening again.
> > >
> > > Sincerest apologies,
> > > -Yonik
> > >
> > > -
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
> >
> >
> >
> > --
> >
> > -
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>


Re: similarity function

2009-03-05 Thread Grant Ingersoll

Hi Seid,

Do you have a reference for the article?  I've done some QA in my day,  
but don't recall reading that one.


At any rate, I do think it is possible to do what you are after.  See  
below.


On Mar 5, 2009, at 9:49 AM, Seid Mohammed wrote:


For my work, I have read an article stating that " Answer type can be
automatically constructed by Indexing Different Questions and Answer
types. Later, when an unseen question apears, answer type for this
question will be found with the help of 'similarity function'
computation"

so I am clear with the arguement above. my problem is,
1. how can I index individual questions and Answer types as is ( not  
tokenized


I'm not sure you want this, but when constructing your Field, just use  
the NOT_ANALYZED option.




2. how can I calculate the similarity between indexed questions and
and unseen questions (question of any type that can be asked latter)


In line with #1, I think you might be better off to actually tokenize  
the question as one one field, and the answer type as a second field.   
Then, you can let Lucene calculate similarity via it's normal query  
mechanisms.  In this case, I would like try experimenting with things  
like: exact match, phrase queries with slop, etc.  That way, not only  
can you match "Who is the president of UN" but you might also match on  
things that are a bit fuzzier.  To do this, you might need to have  
several fields per document with variations.  I could also see using  
Lucene's payload mechanism as well.


But, as Vasu said, you will likely need other parts too, like OpenNLP.

HTH,
Grant

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: similarity function

2009-03-05 Thread patrick o'leary
Sounds like your most difficult part will be the question parser using POS.

This is kind of old school but use something like the AliceBot AIML library
http://en.wikipedia.org/wiki/AIML

Where the subjective terms can be extracted from the questions, and indexed
separately.

Or as Grant and others suggest use OpenNLP (which rocks) or LingPipe
(LingPipe license is a little bit of a pain)
for entity extraction.

An interesting way to look at the data would be to construct 3 fields,
Original_Question, Question_base, Subject

Doc:
Original_Question: Who is the president of the UN
Question_base: Who is the president of
Question_base: Who is
Subject: the president of the UN
Subject: the president
Subject: the UN
/Doc

And similarity can be somewhat easier to calculate with similar question
bases, subjects, etc

P


On Thu, Mar 5, 2009 at 3:05 PM, Grant Ingersoll  wrote:

> Hi Seid,
>
> Do you have a reference for the article?  I've done some QA in my day, but
> don't recall reading that one.
>
> At any rate, I do think it is possible to do what you are after.  See
> below.
>
> On Mar 5, 2009, at 9:49 AM, Seid Mohammed wrote:
>
>  For my work, I have read an article stating that " Answer type can be
>> automatically constructed by Indexing Different Questions and Answer
>> types. Later, when an unseen question apears, answer type for this
>> question will be found with the help of 'similarity function'
>> computation"
>>
>> so I am clear with the arguement above. my problem is,
>> 1. how can I index individual questions and Answer types as is ( not
>> tokenized
>>
>
> I'm not sure you want this, but when constructing your Field, just use the
> NOT_ANALYZED option.
>
>
>> 2. how can I calculate the similarity between indexed questions and
>> and unseen questions (question of any type that can be asked latter)
>>
>
> In line with #1, I think you might be better off to actually tokenize the
> question as one one field, and the answer type as a second field.  Then, you
> can let Lucene calculate similarity via it's normal query mechanisms.  In
> this case, I would like try experimenting with things like: exact match,
> phrase queries with slop, etc.  That way, not only can you match "Who is the
> president of UN" but you might also match on things that are a bit fuzzier.
>  To do this, you might need to have several fields per document with
> variations.  I could also see using Lucene's payload mechanism as well.
>
> But, as Vasu said, you will likely need other parts too, like OpenNLP.
>
> HTH,
> Grant
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Query against newly created index.. Do not work

2009-03-05 Thread Chris Hostetter

: I can now create indexes with Nutch, and see them in Luke.. this is
: fantastic news, well for me it is beyond fantastic.. 
: Now I would like to (need to) query them, and to that end I wrote the
: following code segment.
: 
:   int maxHits = 1000;
:   NutchBean nutchBean = new NutchBean(nutchConf);
:   Query nutchQuery = Query.parse(nutchSearchTerm,
: nutchConf);
:   Hits nutchHits = nutchBean.search(nutchQuery, maxHits);
:   return nutchHits.getLength();

...even though your code is written in java "java-u...@lucene" isn't the 
appropriate mailing list for this type of question, java-user is for users 
of the Lucene Java API that is the underpinninings of Nutch (it's slightly 
confusing that the sub-projecct name has java in it)

If you ask your question on the nutch-u...@lucene mailing list, i'm 
guessing you'll get a lot of feedback from people who are familiar with 
the Nutch java code.  (most people on this list probably have no idea what 
a NutchBean is)




-Hoss


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



execute on server and read from file

2009-03-05 Thread futurpc

hello.
i have data files on web server that contains some values(i need to build
from them chart).
i make applet that read information from file and build chart.
but when i upload the applet to server , it didn't find the files.
can you please suggest how can i make java program that will be executing on
server and read there from files ?

thank you
-- 
View this message in context: 
http://www.nabble.com/execute-on-server-and-read-from-file-tp22363229p22363229.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Confidence scores at search time

2009-03-05 Thread Chris Hostetter

: > Hmm, bugzilla has moved to JIRA.  I'm not sure where the mapping is
: > anymore.   There used to be a Bugzilla Id in JIRA, I think. Sorry.

FYI...

by default the jira homepage has a form for searching by legacy 
bugzilla ID...
  https://issues.apache.org/jira/
...if you create a Jira account you can customize that page (which is why 
some people might not see it if they are logged in)

Also: if you go the "Find Issues" and select a project that was migrated 
from Bugzilla, you can then click the link that apears to refresh the 
search menu to show you new options specific for that project ... a search 
by bugzilla id box will appear at the bottom of the left nav.



-Hoss


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Confidence scores at search time

2009-03-05 Thread Chris Hostetter

: That being said, I could see maybe determining a delta value such that if the
: distance between any two scores is more than the delta, you cut off the rest
: of the docs.  This takes into account the relative state of scores and is not
: some arbitrary value (although, the delta is, of course)

I read an interesting paper a while back that suggested a similar 
strategy for a related problem...

   http://www.isi.edu/integration/people/michelso/paps/ijdar2007.pdf 

...while the whole paper might be interesting to some, the relevant parts 
to this discussion are Section!2.1 and Table#1 .  the goal there is to 
identify which refrence set(s) are relevant to an input set -- they 
compute a similarty score for each set, sort them, and then compute the 
percentage difference for each successive pair.  they consider any set 
with a score above the average score for all sets *and* with a score 
percentage diff (relative the next highest scoring set) greater then some 
arbitrary delta to be a match.  (the theory being that an arbitrary 
percentage delta is better then an arbitrary score cutoff, and that you 
only want things scoring better then average, because as scores taper off 
on the lower end, they can taper off quickly and show very high percentage 
differneces.

I have no idea how well this approach would work for general search (with 
a large set of documents and a large number of matches)


To keep in mind just how diverse the appraoches to this type of problem 
can be depending on the nitty gritty specifics of your use case, consider 
the "GuardianComponent" example from my BTB talk at apachecon last year 
(slides 32-25)... 
http://people.apache.org/~hossman/apachecon2008us/btb/apache-solr-beyond-the-box.pdf

...either of the approaches mention there tackle the "sacrifice recall to 
achieve greater precision" aspect of your problem in the specific domain 
of short documents where you want to eliminate matches that are 
significantly longer then the input (even if they score well using 
traditional tf/idf metrics)


-Hoss


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: execute on server and read from file

2009-03-05 Thread Erick Erickson
Uhhhm, this is the Lucene user's list, not a general Java programming
thread, so unless this has something to do with Lucene I doubt
you'll get much help.

I'd suggest one of the Java programming language lists rather than
this one.

Best
Erick

On Thu, Mar 5, 2009 at 6:32 PM, futurpc  wrote:

>
> hello.
> i have data files on web server that contains some values(i need to build
> from them chart).
> i make applet that read information from file and build chart.
> but when i upload the applet to server , it didn't find the files.
> can you please suggest how can i make java program that will be executing
> on
> server and read there from files ?
>
> thank you
> --
> View this message in context:
> http://www.nabble.com/execute-on-server-and-read-from-file-tp22363229p22363229.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


deletion of index-files fails

2009-03-05 Thread rolarenfan
So, I have a (small) Lucene index, all fine; I use it a bit, and then (on app 
shutdown) want to delete its files and the containing directory (the index is 
intended as a temp object). At some earlier time this was working just fine, 
using java.io.File.delete(). Now however, some of the files get deleted 
(segments*) whereas others fail (no Exn is thrown, just java.io.File.delete() 
returns false: _0.cfs, _0.cfx). I've tried closing the IndexReader (no 
IndexWriter exists at shutdown), but that makes no diff. 

Any ideas? 

thanks
Paul 




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



error in code

2009-03-05 Thread nitin gopi
Hi all,

 I am getting error in running this code. Can somebody please tell me what
is the problem? The code is given below. The bold lines were giving error as
*cannot find symbol *



import java.io.File;
import java.io.FileReader;
import java.io.Reader;
import java.util.Date;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;

/**
 * This class demonstrates the process of creating an index with Lucene
 * for text files in a directory.
 */
public class TextFileIndexer {
 public static void main(String[] args) throws Exception{
   //fileDir is the directory that contains the text files to be indexed
   File   fileDir  = new File("C:\\files_to_index ");

   //indexDir is the directory that hosts Lucene's index files
   File   indexDir = new File("C:\\luceneIndex");
   Analyzer luceneAnalyzer = new StandardAnalyzer();
   IndexWriter indexWriter = new IndexWriter(indexDir,luceneAnalyzer,true);
   File[] textFiles  = fileDir.listFiles();
   long startTime = new Date().getTime();

   //Add documents to the index
   for(int i = 0; i < textFiles.length; i++){
 if(textFiles[i].isFile() > textFiles[i].getName().endsWith(".txt")){
   System.out.println("File " + textFiles[i].getCanonicalPath()
  + " is being indexed");
   Reader textReader = new FileReader(textFiles[i]);
   Document document = new Document();
   *document.add(Field.Text("content",textReader));
   document.add(Field.Text("path",textFiles[i].getPath()));*
   indexWriter.addDocument(document);
 }
   }

   indexWriter.optimize();
   indexWriter.close();
   long endTime = new Date().getTime();

   System.out.println("It took " + (endTime - startTime)
  + " milliseconds to create an index for the files in the
directory "
  + fileDir.getPath());
  }
}

Regards ,
Nitin Gopi


Re: error in code

2009-03-05 Thread Ganesh

Hello gopi,

My comments.


if(textFiles[i].isFile() > textFiles[i].getName().endsWith(".txt")){

   && should be used.


*document.add(Field.Text("content",textReader));

document.add(new Field("content", textReader);


document.add(Field.Text("path",textFiles[i].getPath()));*

   document.add(new Field("path", textFiles[i].getPath());

Regards
Ganesh

- Original Message - 
From: "nitin gopi" 

To: 
Sent: Friday, March 06, 2009 8:24 AM
Subject: error in code



Hi all,

I am getting error in running this code. Can somebody please tell me what
is the problem? The code is given below. The bold lines were giving error 
as

*cannot find symbol *



import java.io.File;
import java.io.FileReader;
import java.io.Reader;
import java.util.Date;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;

/**
* This class demonstrates the process of creating an index with Lucene
* for text files in a directory.
*/
public class TextFileIndexer {
public static void main(String[] args) throws Exception{
  //fileDir is the directory that contains the text files to be indexed
  File   fileDir  = new File("C:\\files_to_index ");

  //indexDir is the directory that hosts Lucene's index files
  File   indexDir = new File("C:\\luceneIndex");
  Analyzer luceneAnalyzer = new StandardAnalyzer();
  IndexWriter indexWriter = new IndexWriter(indexDir,luceneAnalyzer,true);
  File[] textFiles  = fileDir.listFiles();
  long startTime = new Date().getTime();

  //Add documents to the index
  for(int i = 0; i < textFiles.length; i++){
if(textFiles[i].isFile() > textFiles[i].getName().endsWith(".txt")){
  System.out.println("File " + textFiles[i].getCanonicalPath()
 + " is being indexed");
  Reader textReader = new FileReader(textFiles[i]);
  Document document = new Document();
  *document.add(Field.Text("content",textReader));
  document.add(Field.Text("path",textFiles[i].getPath()));*
  indexWriter.addDocument(document);
}
  }

  indexWriter.optimize();
  indexWriter.close();
  long endTime = new Date().getTime();

  System.out.println("It took " + (endTime - startTime)
 + " milliseconds to create an index for the files in the
directory "
 + fileDir.getPath());
 }
}

Regards ,
Nitin Gopi



Send instant messages to your online friends http://in.messenger.yahoo.com 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Using Lucene for user query parsing

2009-03-05 Thread Srinivas Bharghav
I am trying to evaluate as to whether Lucene is the right candidate for the
problem at hand.

Say I have 3 indexes:

Index 1 has street names.
Index 2 has business names.
Index 3 has area names.

All these names can be single words or a combination of words like woodward
street or marks and spencers street etc etc.

Now the use enters a query saying "mc donalds woodward street kingston
precinct".

I have to parse this query and come up with the best match possible. The
problem is, in the query I do not know which part is the business name or
area name or street name. Also the user may give the query in any order for
example he may give it as "kingston precinct mc donalds woodward street".
There might be spelling mistkaes in the query enterd by the user. Also he
might use road for street or lane for street and such things. I know that
Lucene is the right candidate for the synonym and spelling mistakes part but
am a bit hazy regarding the user query parsing part as to in which index to
search what. Any help is greatly appreciated.

Thanks,
Srini.


Re: indexing but not tokenizing

2009-03-05 Thread John Marks
Thank you Ian,

> If you want a direct suggestion: use PerFieldAnalyzerWrapper,
> specifying a different analyzer for field B.
>
>
> --
> Ian.


this makes a lot of sense.

-John

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Questions about analyzer

2009-03-05 Thread Ganesh

Hello all

1)
Which is best to use Snowball analyzer or Lucene contrib analyzer? There is 
no inbuilt stop word list for Snowball analyzer?


2)
Whether Analyzer and QueryParser are thread-free. They could created once 
and use it in as many threads?


3)
I am using Snowball Analyzer to do index and search., When i search for 
windows AND vista, QueryParser is adding AND as part of search, But i am 
expecting something like +windows +vista.


Regards
Ganesh 

Send instant messages to your online friends http://in.messenger.yahoo.com 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Using Lucene for user query parsing

2009-03-05 Thread Anshum
Hi Srinivas,

Perhaps what you need here is a query formation logic which assigns the
right keywords to the right fields. Let me know in case I got it wrong. One
way to do that could be by using index time boost for fields and then
running a query (so that a particular field is preferred over the other).
As per my knowledge lucene should be a better solution that anything else
that 'I know of' for such a thing, but there'd be a few things that you
would have to build yourself as well.

--
Anshum Gupta
Naukri Labs!
http://ai-cafe.blogspot.com

The facts expressed here belong to everybody, the opinions to me. The
distinction is yours to draw


On Fri, Mar 6, 2009 at 11:55 AM, Srinivas Bharghav  wrote:

> I am trying to evaluate as to whether Lucene is the right candidate for the
> problem at hand.
>
> Say I have 3 indexes:
>
> Index 1 has street names.
> Index 2 has business names.
> Index 3 has area names.
>
> All these names can be single words or a combination of words like woodward
> street or marks and spencers street etc etc.
>
> Now the use enters a query saying "mc donalds woodward street kingston
> precinct".
>
> I have to parse this query and come up with the best match possible. The
> problem is, in the query I do not know which part is the business name or
> area name or street name. Also the user may give the query in any order for
> example he may give it as "kingston precinct mc donalds woodward street".
> There might be spelling mistkaes in the query enterd by the user. Also he
> might use road for street or lane for street and such things. I know that
> Lucene is the right candidate for the synonym and spelling mistakes part
> but
> am a bit hazy regarding the user query parsing part as to in which index to
> search what. Any help is greatly appreciated.
>
> Thanks,
> Srini.
>