Questions about use of SpellChecker: Constructor and Simillarity...
Hi, I have two question about this GREAT tool.. (framework, library... "whatever") Well I decide put spell checker on my applications and I start to read some papers and "found out" the Lucene project... Anyway, I make it works, but I just want to know... 1º Why need I pass a Directory objecto (obligatory) on constructor of SpellChecker? 2º Suposse that in my dictonary I had these words: "The Lord of the Rings: The Two Towers" "The Lord of the Rings: The Fellowship of the Ring" "The Lord of the Rings: The Return of the King" I just want to know how can I code something to "suggest" when user query "The Lord of the Rings: The Two Towers" the application suggest: "The Lord of the Rings: The Fellowship of the Ring" "The Lord of the Rings: The Return of the King" It is possible just using the Lucene? My Test Class ## SpellChecker spell; spell= new SpellChecker(FSDirectory.getDirectory(".")); //why this... ?!! spell.indexDictionary(new Dicionario()); String[] l = spell.suggestSimilar(args[0],5); for (String vl : l ){ System.out.println("Suggested : " + vl); } ### ### My Dictionary## public class Dicionario implements org.apache.lucene.search.spell.Dictionary{ public Iterator getWordsIterator(){ List lista = new ArrayList(); lista.add("peter"); lista.add("spider man 3"); lista.add("johnny depp"); lista.add("the edge"); lista.add("monk"); lista.add("arnold schwarzenegger"); return lista.iterator(); } } ### Thanks in advance... :D
Re: Questions about use of SpellChecker: Constructor and Simillarity...
> > > 1º Why need I pass a Directory objecto (obligatory) on constructor of > > SpellChecker? > > > > Mainly because it is a nasty peice of code. But it does a good job. > Thanks. How can we suggest it (create an normal constructor without param) to the team? > > > 2º Suposse that in my dictonary I had these words: > > > > "The Lord of the Rings: The Two Towers" > > "The Lord of the Rings: The Fellowship of the Ring" > > "The Lord of the Rings: The Return of the King" > > > > I just want to know how can I code something to "suggest" when user > > query > > "The Lord of the Rings: The Two Towers" the application suggest: > > "The Lord of the Rings: The Fellowship of the Ring" > > "The Lord of the Rings: The Return of the King" > > > > It is possible just using the Lucene? > > > > There are no typos in your example so you really don't even need a spell > checker for that. Using OR clauses in your query would be enough. I guess no, because user will enter : "The Lord of the Rings: The Return of the King" ... and the system should response with: Similar: The Lord of the Rings: The Two Towers The Lord of the Rings: The Fellowship of the Ring I can't see how can I do that? (just using the OR statement) For example: name like '%the%' or name like '%Lord%' or name like '%of%' or name like '%the%' or name like '%Rings%' will produce so much results besides to be non-performatic... Perhaps you want to combine one variant with MUST clauses that has a bit > more boost than the OR clauses. > > karl Thanks so much Karl!!!
Re: Questions about use of SpellChecker: Constructor and Simillarity...
> > Mainly because it is a nasty peice of code. But it does a good job. > > > Because spellChecker use a directory to store data. It can be FSDirectory, > RAMDirectory Perfect explanation... !!! So use the RAMDirectory is better (perfomatically) spell= new SpellChecker(FSDirectory.getDirectory(".")); spell= new SpellChecker(RAMDirectory.getDirectory(".")); The second is better (fast) to little amount of data... Thanks so much, now I can understand ... It may be on real documentation... > A classical OR query will match shuffled data : "The king of lord got a > ring" should match. > With shingle, you will match title in the right order. Shingle will divide it on "couple" of words... so I can use it with OR ... (The good one I'll try this) Thanks so much!!!
Re: Questions about use of SpellChecker: Constructor and Simillarity...
> Sorry, I missunderstood your question. See other reply. > Yes I got it. thanks > Are you sure about that? Did you benchmark? Can we see the results? Hey man take it easy, I just imagine. But I guess use the ShingleFilter will help.
Re: Questions about use of SpellChecker: Constructor and Simillarity...
> > > > I'm cool :) I just think you are overcomplicating things. > > Yes... I can use two words and OR Suposse I query on this The Lord of Rings: Return of King The Lord of Rings: Fellowship The Lord of Rings: The Two towers The Lord of Weapons The Lord of War Suposse an user search: "The Lord of Rings Return of King" WHERE name like '%the lord%' or name like '%lord of%' or name like '%of rings%' or name like '%rings return%' or name like '%return of%' or name like '%of king%' So will show all lines... the question now is which is best 'ranking' ... However you all help me so much , THANKS SO MUCH!!! (now I won't say bad about the constructor of SpellChecker)
Re: Use of Lucene for DB Search
> Hi, > > We are planning to provide search functionality in the a web > base application. Can we use Lucene for it to search data from database like > oracle and MS-Sql? > Yes, you can. > > > > > Thanks and Regards >प्रशांत सराफ > (Prashant Saraf) > SE-II > Cross Country Infotech > Ext : 72543 > www.crosscountry.in > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] >
Problem when try to make a bench of indexing (a dictionary with 120.000 words)
Hello, *Sample code:* SpellChecker spell; RAMDirectory dram = new RAMDirectory(); Dicionario dic = new Dicionario(); //one implementation of spell.Dictionary spell= new SpellChecker(dram); spell.indexDictionary(dic); //indexing... *Then I got the:* machine1: Windows XP SP2, Celerom 2.66GHz e 256MB word: 60.000 (40~53 caracteres cada) memory alloc: 16 (MB) time to index: 55108 (ms) So* I tried with 120.000 words* ... when I run the program ... *Exception in thread "Thread-1" org.apache.lucene.index.MergePolicy$MergeExceptio n: java.lang.OutOfMemoryError: Java heap space at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Conc urrentMergeScheduler.java:271) Caused by: java.lang.OutOfMemoryError: Java heap space at org.apache.lucene.store.RAMFile.newBuffer(RAMFile.java:88) at org.apache.lucene.store.RAMFile.addBuffer(RAMFile.java:61) at org.apache.lucene.store.RAMOutputStream.switchCurrentBuffer(RAMOutput Stream.java:128) at org.apache.lucene.store.RAMOutputStream.writeByte(RAMOutputStream.jav a:105) ... *Why this occors?* *
Re: Problem when try to make a bench of indexing (a dictionary with 120.000 words)
> > If tye 16M means you're only giving the process that much memory, it > surprises > me that it runs at all. Especially since you're putting it all in a > RAMdir. > Sorry that 16M is dictonarySizeInBytes() I would imagine that it is the same size of index... Well when I start to use a Dictonary with more than 60.000 need I to use FSDirectory? > > Or is that 16M referring to something else? Just Dictonary size... :( > > Best > Erick >
Index evolution
Hi all. I'm very new to lucene. All I have done is read some docs about how it works, which brings to the question: How easy is to add new fields to the documents in the index? Suppose that today I can search for book title and decide that including the author in the search would be a good idea. How easy is to do that with lucene? -- Leandro Rodrigo Saad Cruz CTO - InterBusiness Technologies db.apache.org/ojb guara-framework.sf.net xingu.sf.net
Joining searches on multiple indexes
My second question is: can I join the results os multiple indexes using a common field? If I have user info in 2 different sources (index)and want to search for fields on both, but the search should join the resulting records using a common field (user id for example). Is this possible? -- Leandro Rodrigo Saad Cruz CTO - InterBusiness Technologies db.apache.org/ojb guara-framework.sf.net xingu.sf.net
Lucene usage
Hi all. I'm writting a wrapper component around Lucene (using Avalon) and I'd like to know the common api usage. How should I bootstrap the index? Should I create the IndexSearcher when I initialize the component? For how long should I let the IndexWriter open? For one document: should I create the writer, add the document and close it? -- Leandro Rodrigo Saad Cruz CTO - InterBusiness Technologies db.apache.org/ojb guara-framework.sf.net xingu.sf.net
Creating initial index using FSDirectory
Hi all. I'm writing a avalon component that wrapps lucene. My problem is that I can't start the component using FSDirectory unless the index files are already in place (segment, etc) , or I set the rewrite flag to true. I my case, I'd like to create the index file structure only the first time I initialize the component, then reuse the same index for each application run. Any help? -- Leandro Rodrigo Saad Cruz CTO - InterBusiness Technologies db.apache.org/ojb guara-framework.sf.net xingu.sf.net
Removing document from index
Hi all. I can remove a documents from the index using IndexReader.delete (Term) but the search still returns this document. What am I doing wrong? -- Leandro Rodrigo Saad Cruz CTO - InterBusiness Technologies db.apache.org/ojb guara-framework.sf.net xingu.sf.net
Re: Search with accents
Hi Eduardo. I'm using the StandardAnalyser and I can search for words with accents. In my case "saúde" -- Leandro Rodrigo Saad Cruz CTO - InterBusiness Technologies db.apache.org/ojb guara-framework.sf.net xingu.sf.net On 8/1/06, Eduardo S. Cordeiro <[EMAIL PROTECTED]> wrote: Yes...here's how I create my QueryParser: QueryParser parser = new QueryParser("text", new BrazilianAnalyzer()); 2006/8/1, Zhang, Lisheng <[EMAIL PROTECTED]>: > Hi, > > Have you used the same BrazilianAnalyzer when > searching? > > Best regards, Lisheng > > -Original Message- > From: Eduardo S. Cordeiro [mailto:[EMAIL PROTECTED] > Sent: Tuesday, August 01, 2006 1:40 PM > To: java-user@lucene.apache.org > Subject: Search with accents > > > Hello there, > > I have a brazilian portuguese index, which has been analyzed with > BrazilianAnalyzer. When searching words with accents, however, they're > not found -- for instance, if the index contains some text with the > word "maçã" and I search for that very word, I get no hits, but if I > search "maca" (which is another portuguese word) then the document > containing "maçã" is found. > > I've seen posts in the archive indicating that I should use > ISOLatin1AccentFilter to handle this, but I don't quite see how: > should I leave indexation as it is and use this filter only for search > queries or should I apply it in both cases? > > Thank you, > Eduardo Cordeiro > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Re: Search with accents
I'm using StandardAnalyser all over, so, yes, portuguese stopwords won't be eliminated -- Leandro Rodrigo Saad Cruz CTO - InterBusiness Technologies db.apache.org/ojb guara-framework.sf.net xingu.sf.net On 8/2/06, Eduardo S. Cordeiro <[EMAIL PROTECTED]> wrote: But was your index created with BrazilianAnalyzer? Because otherwise you wouldn't have portuguese stopwords eliminated, like "e", "ou", etc. 2006/8/2, Leandro Saad <[EMAIL PROTECTED]>: > Hi Eduardo. I'm using the StandardAnalyser and I can search for words with > accents. In my case "saúde" > > -- > Leandro Rodrigo Saad Cruz > CTO - InterBusiness Technologies > db.apache.org/ojb > guara-framework.sf.net > xingu.sf.net > > On 8/1/06, Eduardo S. Cordeiro <[EMAIL PROTECTED]> wrote: > > > > Yes...here's how I create my QueryParser: > > > > QueryParser parser = new QueryParser("text", new BrazilianAnalyzer()); > > > > 2006/8/1, Zhang, Lisheng <[EMAIL PROTECTED]>: > > > Hi, > > > > > > Have you used the same BrazilianAnalyzer when > > > searching? > > > > > > Best regards, Lisheng > > > > > > -Original Message- > > > From: Eduardo S. Cordeiro [mailto:[EMAIL PROTECTED] > > > Sent: Tuesday, August 01, 2006 1:40 PM > > > To: java-user@lucene.apache.org > > > Subject: Search with accents > > > > > > > > > Hello there, > > > > > > I have a brazilian portuguese index, which has been analyzed with > > > BrazilianAnalyzer. When searching words with accents, however, they're > > > not found -- for instance, if the index contains some text with the > > > word "maçã" and I search for that very word, I get no hits, but if I > > > search "maca" (which is another portuguese word) then the document > > > containing "maçã" is found. > > > > > > I've seen posts in the archive indicating that I should use > > > ISOLatin1AccentFilter to handle this, but I don't quite see how: > > > should I leave indexation as it is and use this filter only for search > > > queries or should I apply it in both cases? > > > > > > Thank you, > > > Eduardo Cordeiro > > > > > > - > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > >
Multiple lock files
Hi all. How do I remove lucene locks (startup) if there are multiple applications using lucene on the same box and all use the same lock dir? -- Leandro Rodrigo Saad Cruz CTO - InterBusiness Technologies db.apache.org/ojb guara-framework.sf.net xingu.sf.net
Re: Multiple lock files
Yeah. But how do I know if a lock file is related to an index or app? I don't want to remove a lock file that another app is using :: Leandro On 8/8/06, Michael McCandless <[EMAIL PROTECTED]> wrote: > How do I remove lucene locks (startup) if there are multiple applications > using lucene on the same box and all use the same lock dir? The lock files are just files, so you can up and remove them. However: this is in general dangerous and should not be necessary. Lucene uses the lock files to ensure index readers/writers across different JVMs, or within a single JVM, do not step on each other. If you remove them you can corrupt your index. It's fine if you have multiple Lucene indices sharing the same lock directory; each index will create a different name for its lock file. Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Multiple lock files
I'm trying to use them, and I maybe be wrong, but I can't unlock the dir before I create the Directory right? Do you know if the lock is created when I create the Directory? :: Leandro On 8/8/06, Michael Busch <[EMAIL PROTECTED]> wrote: > Yeah. But how do I know if a lock file is related to an index or app? I > don't want to remove a lock file that another app is using > Leandro, check out the static method of IndexReader: unlock(Directory). Link: http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexReader.html#unlock(org.apache.lucene.store.Directory) You can use that method to forcibly unlock a particular index directory. Furthermore you can use the method boolean isLocked(Directory) to check whether an index is actually locked. Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Leandro Rodrigo Saad Cruz CTO - InterBusiness Technologies db.apache.org/ojb guara-framework.sf.net xingu.sf.net
Re: Multiple lock files
I want to use the same lock dir, but remove only the associated lock file when I start the application. :: Leandro On 8/8/06, Simon Willnauer <[EMAIL PROTECTED]> wrote: You can start your applications with a system property set: "org.apache.lucene.lockDir" to specify your lock directory Hope that helps... regards Simon On 8/8/06, Leandro Saad <[EMAIL PROTECTED]> wrote: > Yeah. But how do I know if a lock file is related to an index or app? I > don't want to remove a lock file that another app is using > > :: Leandro > > On 8/8/06, Michael McCandless <[EMAIL PROTECTED]> wrote: > > > > > > > How do I remove lucene locks (startup) if there are multiple > > applications > > > using lucene on the same box and all use the same lock dir? > > > > The lock files are just files, so you can up and remove them. > > > > However: this is in general dangerous and should not be necessary. > > > > Lucene uses the lock files to ensure index readers/writers across > > different JVMs, or within a single JVM, do not step on each other. If > > you remove them you can corrupt your index. > > > > It's fine if you have multiple Lucene indices sharing the same lock > > directory; each index will create a different name for its lock file. > > > > Mike > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Fields with phrases
Hi all, I have a field called "location" on my index. For example, this string: "A B" "A C" D was stored on my index When I search for "location: ", these are the results that I'd like to retrieve: 1) location: D -- 1 hit 2) location: A -- no hits 3) location: "A B" -- 1 hit 4) location: "A C" -- 1 hit Is there any way I can make this work? -- Leandro Rodrigo Saad Cruz software developer - certified scrum master :: scrum.com.br :: db.apache.org/ojb :: guara-framework.sf.net :: xingu.sf.net
Fields with phrases
Hi all, I have a field called "location" on my index. For example, this string: "A B" "A C" D was stored on my index When I search for "location: ", these are the results that I'd like to retrieve: 1) location: D -- 1 hit 2) location: A -- no hits 3) location: "A B" -- 1 hit 4) location: "A C" -- 1 hit Is there any way I can make this work? -- Leandro Rodrigo Saad Cruz software developer - certified scrum master :: scrum.com.br :: db.apache.org/ojb :: guara-framework.sf.net :: xingu.sf.net