Questions about use of SpellChecker: Constructor and Simillarity...

2008-04-08 Thread Leandro
Hi,

I have two question about this GREAT tool.. (framework, library...
"whatever")
Well I decide put spell checker on my applications and I start to read some
papers and "found out" the Lucene project...

Anyway, I make it works, but I just want to know...

1º Why need I pass a Directory objecto (obligatory) on constructor of
SpellChecker?
2º Suposse that in my dictonary I had these words:

"The Lord of the Rings: The Two Towers"
"The Lord of the Rings: The Fellowship of the Ring"
"The Lord of the Rings: The Return of the King"

I just want to know how can I code something to "suggest" when user query
"The Lord of the Rings: The Two Towers" the application suggest:

"The Lord of the Rings: The Fellowship of the Ring"
"The Lord of the Rings: The Return of the King"

It is possible just using the Lucene?

 My Test Class ##
SpellChecker spell;
spell= new SpellChecker(FSDirectory.getDirectory(".")); //why this... ?!!
spell.indexDictionary(new Dicionario());

String[] l = spell.suggestSimilar(args[0],5);

for (String vl : l ){
   System.out.println("Suggested : " + vl);
}
###



### My Dictionary##
public class Dicionario implements
org.apache.lucene.search.spell.Dictionary{

public Iterator getWordsIterator(){
List lista = new ArrayList();
lista.add("peter");
lista.add("spider man 3");
lista.add("johnny depp");
lista.add("the edge");
lista.add("monk");
lista.add("arnold schwarzenegger");
return lista.iterator();
}
}
###

Thanks in advance... :D


Re: Questions about use of SpellChecker: Constructor and Simillarity...

2008-04-08 Thread Leandro
>
>
>  1º Why need I pass a Directory objecto (obligatory) on constructor of
> > SpellChecker?
> >
>
> Mainly because it is a nasty peice of code. But it does a good job.
>

Thanks.
How can we suggest it (create an normal constructor without param) to the
team?

>
>
>  2º Suposse that in my dictonary I had these words:
> >
> > "The Lord of the Rings: The Two Towers"
> > "The Lord of the Rings: The Fellowship of the Ring"
> > "The Lord of the Rings: The Return of the King"
> >
> > I just want to know how can I code something to "suggest" when user
> > query
> > "The Lord of the Rings: The Two Towers" the application suggest:
> > "The Lord of the Rings: The Fellowship of the Ring"
> > "The Lord of the Rings: The Return of the King"
> >
> > It is possible just using the Lucene?
> >
>
> There are no typos in your example so you really don't even need a spell
> checker for that. Using OR clauses in your query would be enough.


I guess no, because user will enter : "The Lord of the Rings: The Return of
the King" ... and the system should response with:


Similar:
The Lord of the Rings: The Two Towers
The Lord of the Rings: The Fellowship of the Ring

I can't see how can I do that?  (just using the OR statement)
For example:

name like '%the%'
or
name like '%Lord%'
or
name like '%of%'
or
name like '%the%'
or
name like '%Rings%'

will produce so much results besides to be non-performatic...

Perhaps you want to combine one variant with MUST clauses that has a bit
> more boost than the OR clauses.
>
> karl


Thanks so much Karl!!!


Re: Questions about use of SpellChecker: Constructor and Simillarity...

2008-04-08 Thread Leandro
>
> Mainly because it is a nasty peice of code. But it does a good job.
> >
> Because spellChecker use a directory to store data. It can be FSDirectory,
> RAMDirectory 


Perfect explanation... !!!
So use the RAMDirectory is better (perfomatically)

spell= new SpellChecker(FSDirectory.getDirectory("."));
spell= new SpellChecker(RAMDirectory.getDirectory("."));

The second is better (fast) to little amount of data...
Thanks so much, now I can understand ... It may be on real documentation...



> A classical OR query will match shuffled data : "The king of lord got a
> ring" should match.
> With shingle, you will match title in the right order.


Shingle will divide it on "couple" of words... so I can use it with OR ...
(The good one I'll try this)


Thanks so much!!!


Re: Questions about use of SpellChecker: Constructor and Simillarity...

2008-04-08 Thread Leandro
> Sorry, I missunderstood your question. See other reply.
>

Yes I got it. thanks

> Are you sure about that? Did you benchmark? Can we see the results?


Hey man take it easy, I just imagine. But I guess use the ShingleFilter will
help.


Re: Questions about use of SpellChecker: Constructor and Simillarity...

2008-04-08 Thread Leandro
>
>
>
> I'm cool :) I just think you are overcomplicating things.
>
>
Yes... I can use two words and OR
Suposse I query on this

The Lord of Rings: Return of King
The Lord of Rings: Fellowship
The Lord of Rings: The Two towers
The Lord of Weapons
The Lord of War

Suposse an user search: "The Lord of Rings Return of King"
WHERE
name like '%the lord%' or
name like '%lord of%' or
name like '%of rings%' or
name like '%rings return%' or
name like '%return of%' or
name like '%of king%'


So will show all lines... the question now is which is best 'ranking' ...
However you all help me so much , THANKS SO MUCH!!!
(now I won't say bad about the constructor of SpellChecker)


Re: Use of Lucene for DB Search

2008-04-10 Thread Leandro
>  Hi,
>
>  We are planning to provide search functionality in the a web
> base application. Can we use Lucene for it to search data from database like
> oracle and MS-Sql?
>
Yes, you can.

>
>
>
>
> Thanks and Regards
>प्रशांत सराफ
> (Prashant Saraf)
> SE-II
> Cross Country Infotech
> Ext : 72543
> www.crosscountry.in
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


Problem when try to make a bench of indexing (a dictionary with 120.000 words)

2008-04-10 Thread Leandro
Hello,

*Sample code:*
SpellChecker spell;
RAMDirectory dram = new RAMDirectory();
Dicionario dic = new Dicionario(); //one implementation of spell.Dictionary
spell= new SpellChecker(dram);
spell.indexDictionary(dic); //indexing...

*Then I got the:*
machine1: Windows XP SP2, Celerom 2.66GHz e 256MB
word: 60.000 (40~53 caracteres cada)
memory alloc: 16 (MB)
time to index: 55108 (ms)

So* I tried with 120.000 words* ... when I run the program ...

*Exception in thread "Thread-1"
org.apache.lucene.index.MergePolicy$MergeExceptio
n: java.lang.OutOfMemoryError: Java heap space
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Conc
urrentMergeScheduler.java:271)
Caused by: java.lang.OutOfMemoryError: Java heap space
at org.apache.lucene.store.RAMFile.newBuffer(RAMFile.java:88)
at org.apache.lucene.store.RAMFile.addBuffer(RAMFile.java:61)
at
org.apache.lucene.store.RAMOutputStream.switchCurrentBuffer(RAMOutput
Stream.java:128)
at
org.apache.lucene.store.RAMOutputStream.writeByte(RAMOutputStream.jav
a:105)
...

*Why this occors?*
*


Re: Problem when try to make a bench of indexing (a dictionary with 120.000 words)

2008-04-10 Thread Leandro
>
> If tye 16M means you're only giving the process that much memory, it
> surprises
> me that it runs at all. Especially since you're putting it all in a
> RAMdir.
>

Sorry that 16M is dictonarySizeInBytes() I would imagine that it is the same
size of index...

Well when I start to use a Dictonary with more than 60.000 need I to use
FSDirectory?



>
> Or is that 16M referring to something else?


Just Dictonary size...
:(


>
> Best
> Erick
>


Index evolution

2006-05-26 Thread Leandro Saad

Hi all. I'm very new to lucene. All I have done is read some docs about how
it works, which brings to the question:

How easy is to add new fields to the documents in the index?
Suppose that today I can search for book title and decide that including the
author in the search would be a good idea. How easy is to do that with
lucene?

--
Leandro Rodrigo Saad Cruz
CTO - InterBusiness Technologies
db.apache.org/ojb
guara-framework.sf.net
xingu.sf.net


Joining searches on multiple indexes

2006-05-26 Thread Leandro Saad

My second question is: can I join the results os multiple indexes using a
common field?
If I have user info in 2 different sources (index)and want to search for
fields on both, but the search should
join the resulting records using a common field (user id for example). Is
this possible?

--
Leandro Rodrigo Saad Cruz
CTO - InterBusiness Technologies
db.apache.org/ojb
guara-framework.sf.net
xingu.sf.net


Lucene usage

2006-06-13 Thread Leandro Saad

Hi all.

I'm writting a wrapper component around Lucene (using Avalon) and I'd like
to know the common api usage.

How should I bootstrap the index? Should I create the IndexSearcher when I
initialize the component?
For how long should I let the IndexWriter open? For one document: should I
create the writer, add the document and close it?

--
Leandro Rodrigo Saad Cruz
CTO - InterBusiness Technologies
db.apache.org/ojb
guara-framework.sf.net
xingu.sf.net


Creating initial index using FSDirectory

2006-06-21 Thread Leandro Saad

Hi all. I'm writing a avalon component that wrapps lucene. My problem is
that I can't start the component using FSDirectory unless the index files
are already in place (segment, etc) , or I set the rewrite flag to true.

I my case, I'd like to create the index file structure only the first time I
initialize the component, then reuse the same index for each application
run.

Any help?
--
Leandro Rodrigo Saad Cruz
CTO - InterBusiness Technologies
db.apache.org/ojb
guara-framework.sf.net
xingu.sf.net


Removing document from index

2006-06-28 Thread Leandro Saad

Hi all. I can remove a documents from the index using IndexReader.delete
(Term) but the search still returns this document.
What am I doing wrong?

--
Leandro Rodrigo Saad Cruz
CTO - InterBusiness Technologies
db.apache.org/ojb
guara-framework.sf.net
xingu.sf.net


Re: Search with accents

2006-08-02 Thread Leandro Saad

Hi Eduardo. I'm using the StandardAnalyser and I can search for words with
accents. In my case "saúde"

--
Leandro Rodrigo Saad Cruz
CTO - InterBusiness Technologies
db.apache.org/ojb
guara-framework.sf.net
xingu.sf.net

On 8/1/06, Eduardo S. Cordeiro <[EMAIL PROTECTED]> wrote:


Yes...here's how I create my QueryParser:

QueryParser parser = new QueryParser("text", new BrazilianAnalyzer());

2006/8/1, Zhang, Lisheng <[EMAIL PROTECTED]>:
> Hi,
>
> Have you used the same BrazilianAnalyzer when
> searching?
>
> Best regards, Lisheng
>
> -Original Message-
> From: Eduardo S. Cordeiro [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, August 01, 2006 1:40 PM
> To: java-user@lucene.apache.org
> Subject: Search with accents
>
>
> Hello there,
>
> I have a brazilian portuguese index, which has been analyzed with
> BrazilianAnalyzer. When searching words with accents, however, they're
> not found -- for instance, if the index contains some text with the
> word "maçã" and I search for that very word, I get no hits, but if I
> search "maca" (which is another portuguese word) then the document
> containing "maçã" is found.
>
> I've seen posts in the archive indicating that I should use
> ISOLatin1AccentFilter to handle this, but I don't quite see how:
> should I leave indexation as it is and use this filter only for search
> queries or should I apply it in both cases?
>
> Thank you,
> Eduardo Cordeiro
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



Re: Search with accents

2006-08-03 Thread Leandro Saad

I'm using StandardAnalyser all over, so, yes, portuguese stopwords won't be
eliminated

--
Leandro Rodrigo Saad Cruz
CTO - InterBusiness Technologies
db.apache.org/ojb
guara-framework.sf.net
xingu.sf.net

On 8/2/06, Eduardo S. Cordeiro <[EMAIL PROTECTED]> wrote:


But was your index created with BrazilianAnalyzer? Because otherwise
you wouldn't have portuguese stopwords eliminated, like "e", "ou",
etc.

2006/8/2, Leandro Saad <[EMAIL PROTECTED]>:
> Hi Eduardo. I'm using the StandardAnalyser and I can search for words
with
> accents. In my case "saúde"
>
> --
> Leandro Rodrigo Saad Cruz
> CTO - InterBusiness Technologies
> db.apache.org/ojb
> guara-framework.sf.net
> xingu.sf.net
>
> On 8/1/06, Eduardo S. Cordeiro <[EMAIL PROTECTED]> wrote:
> >
> > Yes...here's how I create my QueryParser:
> >
> > QueryParser parser = new QueryParser("text", new BrazilianAnalyzer());
> >
> > 2006/8/1, Zhang, Lisheng <[EMAIL PROTECTED]>:
> > > Hi,
> > >
> > > Have you used the same BrazilianAnalyzer when
> > > searching?
> > >
> > > Best regards, Lisheng
> > >
> > > -Original Message-
> > > From: Eduardo S. Cordeiro [mailto:[EMAIL PROTECTED]
> > > Sent: Tuesday, August 01, 2006 1:40 PM
> > > To: java-user@lucene.apache.org
> > > Subject: Search with accents
> > >
> > >
> > > Hello there,
> > >
> > > I have a brazilian portuguese index, which has been analyzed with
> > > BrazilianAnalyzer. When searching words with accents, however,
they're
> > > not found -- for instance, if the index contains some text with the
> > > word "maçã" and I search for that very word, I get no hits, but if I
> > > search "maca" (which is another portuguese word) then the document
> > > containing "maçã" is found.
> > >
> > > I've seen posts in the archive indicating that I should use
> > > ISOLatin1AccentFilter to handle this, but I don't quite see how:
> > > should I leave indexation as it is and use this filter only for
search
> > > queries or should I apply it in both cases?
> > >
> > > Thank you,
> > > Eduardo Cordeiro
> > >
> > >
-
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> >
>
>



Multiple lock files

2006-08-08 Thread Leandro Saad

Hi all.

How do I remove lucene locks (startup) if there are multiple applications
using lucene on the same box and all use the same lock dir?

--
Leandro Rodrigo Saad Cruz
CTO - InterBusiness Technologies
db.apache.org/ojb
guara-framework.sf.net
xingu.sf.net


Re: Multiple lock files

2006-08-08 Thread Leandro Saad

Yeah. But how do I know if a lock file is related to an index or app? I
don't want to remove a lock file that another app is using

:: Leandro

On 8/8/06, Michael McCandless <[EMAIL PROTECTED]> wrote:



> How do I remove lucene locks (startup) if there are multiple
applications
> using lucene on the same box and all use the same lock dir?

The lock files are just files, so you can up and remove them.

However: this is in general dangerous and should not be necessary.

Lucene uses the lock files to ensure index readers/writers across
different JVMs, or within a single JVM, do not step on each other.  If
you remove them you can corrupt your index.

It's fine if you have multiple Lucene indices sharing the same lock
directory; each index will create a different name for its lock file.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Multiple lock files

2006-08-08 Thread Leandro Saad

I'm trying to use them, and I maybe be wrong, but I can't unlock the dir
before I create the Directory right? Do you know if the lock is created when
I create the Directory?

:: Leandro

On 8/8/06, Michael Busch <[EMAIL PROTECTED]> wrote:



> Yeah. But how do I know if a lock file is related to an index or app? I
> don't want to remove a lock file that another app is using
>
Leandro,

check out the static method of IndexReader: unlock(Directory). Link:

http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexReader.html#unlock(org.apache.lucene.store.Directory)

You can use that method to forcibly unlock a particular index directory.
Furthermore you can use the method boolean isLocked(Directory) to check
whether an index is actually locked.

Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





--
Leandro Rodrigo Saad Cruz
CTO - InterBusiness Technologies
db.apache.org/ojb
guara-framework.sf.net
xingu.sf.net


Re: Multiple lock files

2006-08-08 Thread Leandro Saad

I want to use the same lock dir, but remove only the associated lock file
when I start the application.

:: Leandro

On 8/8/06, Simon Willnauer <[EMAIL PROTECTED]> wrote:


You can start your applications with a system property set:
"org.apache.lucene.lockDir"
to specify your lock directory

Hope that helps...

regards Simon

On 8/8/06, Leandro Saad <[EMAIL PROTECTED]> wrote:
> Yeah. But how do I know if a lock file is related to an index or app? I
> don't want to remove a lock file that another app is using
>
> :: Leandro
>
> On 8/8/06, Michael McCandless <[EMAIL PROTECTED]> wrote:
> >
> >
> > > How do I remove lucene locks (startup) if there are multiple
> > applications
> > > using lucene on the same box and all use the same lock dir?
> >
> > The lock files are just files, so you can up and remove them.
> >
> > However: this is in general dangerous and should not be necessary.
> >
> > Lucene uses the lock files to ensure index readers/writers across
> > different JVMs, or within a single JVM, do not step on each other.  If
> > you remove them you can corrupt your index.
> >
> > It's fine if you have multiple Lucene indices sharing the same lock
> > directory; each index will create a different name for its lock file.
> >
> > Mike
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Fields with phrases

2006-09-11 Thread Leandro Saad

Hi all,

I have a field called "location" on my index. For example, this string:  "A
B" "A C" D was stored on my index
When I search for "location: ", these are the results that I'd like to
retrieve:
1) location: D -- 1 hit
2) location: A -- no hits
3) location: "A B" -- 1 hit
4) location: "A C" -- 1 hit

Is there any way I can make this work?

--
Leandro Rodrigo Saad Cruz
software developer - certified scrum master
:: scrum.com.br
:: db.apache.org/ojb
:: guara-framework.sf.net
:: xingu.sf.net


Fields with phrases

2006-09-11 Thread Leandro Saad

Hi all,

I have a field called "location" on my index. For example, this string:  "A
B" "A C" D was stored on my index
When I search for "location: ", these are the results that I'd like to
retrieve:
1) location: D -- 1 hit
2) location: A -- no hits
3) location: "A B" -- 1 hit
4) location: "A C" -- 1 hit

Is there any way I can make this work?

--
Leandro Rodrigo Saad Cruz
software developer - certified scrum master
:: scrum.com.br
:: db.apache.org/ojb
:: guara-framework.sf.net
:: xingu.sf.net