Re: [ANN] General Availability of LucidWorks Enterprise

2010-12-15 Thread Andy
Congrats!

A couple questions:

1) Which version of Solr is this based on?
2) How is LWE different from standard Solr? How should one choose between the 
two?

Thanks.

--- On Wed, 12/15/10, Grant Ingersoll  wrote:

> From: Grant Ingersoll 
> Subject: [ANN] General Availability of LucidWorks Enterprise
> To: solr-u...@lucene.apache.org, java-user@lucene.apache.org
> Date: Wednesday, December 15, 2010, 4:39 PM
> Lucid Imagination is pleased to
> announce the general availability of our Apache Solr/Lucene
> powered LucidWorks Enterprise (LWE).  LWE is designed
> to make it easier for people to get up to speed on search by
> providing easier management, integration with libraries
> commonly used in building search applications (such as
> crawling) as well as value add components developed by Lucid
> Imagination all packaged on top of Apache Solr while still
> giving access to Solr.
> 
> You can get more info in the press release: 
> http://www.lucidimagination.com/About/Company-News/Lucid-Imagination-Announces-General-Availability-and-Free-Download-LucidWorks-Ent
> 
> Other Details:
> Download LucidWorks Enterprise software:
> www.lucidimagination.com/lwe/download
> View free documentation: http://lucidworks.lucidimagination.com
> View a demonstration of LucidWorks Enterprise: 
> http://www.lucidimagination.com/lwe/demos
> 
> Access LucidWorks Enterprise whitepapers and tutorials:
> www.lucidimagination.com/lwe/whitepapers
> Read further commentary on the Lucid Imagination blog
> 
> Cheers,
> Grant
> 
> --
> Grant Ingersoll
> http://www.lucidimagination.com
> 
> 




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Length of the filed does not affect the doc score accurately for chinese analyzer(SmartChineseAnalyzer)

2014-01-15 Thread andy
Hi guys,

As the topic,it seems that the length of filed does not affect the doc score
accurately for chinese analyzer in my source code

index source code

 private static Directory DIRECTORY;


@BeforeClass
public static void before() throws IOException {
  DIRECTORY = new RAMDirectory();
  Analyzer chineseanalyzer = new
SmartChineseAnalyzer(Version.LUCENE_40);
  IndexWriterConfig indexWriterConfig = new
IndexWriterConfig(Version.LUCENE_40,chineseanalyzer);
  FieldType nameType = new FieldType();
  nameType.setIndexed(true);
  nameType.setStored(true);
  nameType.setOmitNorms(false);
  try {
  IndexWriter indexWriter = new IndexWriter(DIRECTORY,
indexWriterConfig);

  List nameList = new ArrayList();
 
nameList.add("咨询公司");nameList.add("飞鹰咨询管理咨询公司");nameList.add("北京中标咨询公司");nameList.add("重庆咨询公司");nameList.add("商务咨询服务公司");nameList.add("法律咨询公司");
  for (int i = 0; i < nameList.size(); i++) {
  Document document = new Document();
  document.add(new Field("name", nameList.get(i),
nameType));
  document.add(new
Field("id",String.valueOf(i+1),nameType));
  indexWriter.addDocument(document);
}
  indexWriter.commit();
  } catch (IOException e) {
  // TODO Auto-generated catch block
  e.printStackTrace();
  }
}

search snippet:
 @Test
public void testChinese() throws IOException, ParseException {
String keyword = "咨询公司";
System.out.println("Searching for:" + keyword);
System.out.println();
IndexReader indexReader = DirectoryReader.open(DIRECTORY);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
Query query = null;
query = new QueryParser(Version.LUCENE_40,"name",new
SmartChineseAnalyzer(Version.LUCENE_40)).parse(keyword);
TopDocs topDocs = indexSearcher.search(query,15);
System.out.println("Search Result:");
if (null !=topDocs && 0 < topDocs.totalHits) {
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
System.out.println("doc id:" +
indexSearcher.doc(scoreDoc.doc).get("id"));
String name = indexSearcher.doc(scoreDoc.doc).get("name");
System.out.println("content of Field:" + name);
dumpCNTokens(name);
System.out.println("score:" + scoreDoc.score);
   
System.out.println("---");
}
} else {
System.out.println("no results");
}

}


And search result as follows:
Searching for:咨询公司

Search Result:
doc id:1
content of Field:咨询公司
Terms:咨询公司  
score:0.74763227
---
doc id:2
content of Field:飞鹰咨询管理咨询公司
Terms:飞鹰咨询  管理  咨询  公司  
score:0.6317303
---
doc id:3
content of Field:北京中标咨询公司
Terms:北京中标  咨询  公司  
score:0.5981058
---
doc id:4
content of Field:重庆咨询公司
Terms:重庆咨询  公司  
score:0.5981058
---
doc id:5
content of Field:商务咨询服务公司
Terms:商务咨询  服务  公司  
score:0.5981058
---
doc id:6
content of Field:法律咨询公司
Terms:法律咨询  公司  
score:0.5981058
---

docs:3,4,5,6 have the same score, but I think the doc 4 and doc 6 should
have a higner score than the doc 3,5, becase the doc 4 and doc 6 have three
terms ,doc 3,5 have four terms. 
Am I right? who can give me a explanation? And how to get the expected
result?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Length-of-the-filed-does-not-affect-the-doc-score-accurately-for-chinese-analyzer-SmartChineseAnalyz-tp4111390.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Length of the filed does not affect the doc score accurately for chinese analyzer(SmartChineseAnalyzer)

2014-02-12 Thread andy
thanks for your reply Erick, this is the case ,But how can I keep the
precision of the fields' length?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Length-of-the-filed-does-not-affect-the-doc-score-accurately-for-chinese-analyzer-SmartChineseAnalyz-tp4111390p4116832.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Length of the filed does not affect the doc score accurately for chinese analyzer(SmartChineseAnalyzer)

2014-02-12 Thread andy
Thanks Uwe,could you please give me a more detail example about how to change
the lucene behavior


Uwe Schindler wrote
> Hi Erick,
> 
> a statement like " Adding &debug=all to the query will show you if this is
> the case" will not help a Lucene user, as it is only available in the Solr
> server. But Andy uses Lucene directly. In his case he should use
> IndexSearcher's explain functionalities to retrieve a structured output of
> how the documents are scored for this query for debugging:
> 
> http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/IndexSearcher.html#explain(org.apache.lucene.search.Query,
> int)
> 
> But yes, the length norm is encoded with loss of precsision in Lucene (it
> is a float values encoded to 1 byte only). With Lucene 4 there are ways to
> change that behavior, but that included changing the similarity
> implementation and use a different DocValues type for encoding the norms.
> In most cases this is not needed, because user won't notice.
> 
> Uwe
> 
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: 

> uwe@

> 
> 
>> -Original Message-
>> From: Erick Erickson [mailto:

> erickerickson@

> ]
>> Sent: Wednesday, January 15, 2014 1:30 PM
>> To: java-user
>> Subject: Re: Length of the filed does not affect the doc score accurately
>> for
>> chinese analyzer(SmartChineseAnalyzer)
>> 
>> the lengths of fields are encoded and lose some precision. So I suspect
>> the
>> length of the field calculated for the two documents are the same after
>> encoding.
>> 
>> Adding &debug=all to the query will show you if this is the case.
>> 
>> Best
>> Erick
>> 
>> On Wed, Jan 15, 2014 at 3:39 AM, andy <

> yhlweb@

> > wrote:
>> > Hi guys,
>> >
>> > As the topic,it seems that the length of filed does not affect the doc
>> > score accurately for chinese analyzer in my source code
>> >
>> > index source code
>> >
>> >  private static Directory DIRECTORY;
>> >
>> >
>> > @BeforeClass
>> > public static void before() throws IOException {
>> >   DIRECTORY = new RAMDirectory();
>> >   Analyzer chineseanalyzer = new
>> > SmartChineseAnalyzer(Version.LUCENE_40);
>> >   IndexWriterConfig indexWriterConfig = new
>> > IndexWriterConfig(Version.LUCENE_40,chineseanalyzer);
>> >   FieldType nameType = new FieldType();
>> >   nameType.setIndexed(true);
>> >   nameType.setStored(true);
>> >   nameType.setOmitNorms(false);
>> >   try {
>> >   IndexWriter indexWriter = new IndexWriter(DIRECTORY,
>> > indexWriterConfig);
>> >
>> >   List
> 
>  nameList = new ArrayList
> 
> ();
>> >
>> > nameList.add("咨询公司");nameList.add("飞鹰咨询管理咨询公司
>> ");nameList.add("北京中标咨询公司");nameList.add("重庆咨询公司
>> ");nameList.add("商务咨询服务公司");nameList.add("法律咨询公司");
>> >   for (int i = 0; i < nameList.size(); i++) {
>> >   Document document = new Document();
>> >   document.add(new Field("name", nameList.get(i),
>> > nameType));
>> >   document.add(new
>> > Field("id",String.valueOf(i+1),nameType));
>> >   indexWriter.addDocument(document);
>> > }
>> >   indexWriter.commit();
>> >   } catch (IOException e) {
>> >   // TODO Auto-generated catch block
>> >   e.printStackTrace();
>> >   }
>> > }
>> >
>> > search snippet:
>> >  @Test
>> > public void testChinese() throws IOException, ParseException {
>> > String keyword = "咨询公司";
>> > System.out.println("Searching for:" + keyword);
>> > System.out.println();
>> > IndexReader indexReader = DirectoryReader.open(DIRECTORY);
>> > IndexSearcher indexSearcher = new IndexSearcher(indexReader);
>> > Query query = null;
>> > query = new QueryParser(Version.LUCENE_40,"name",new
>> > SmartChineseAnalyzer(Version.LUCENE_40)).parse(keyword);
>> > TopDocs topDocs = indexSearcher.search(query,15);
>> > System.out.println(&q

RE: Length of the filed does not affect the doc score accurately for chinese analyzer(SmartChineseAnalyzer)

2014-02-12 Thread andy
Hi Uwe, 

thanks a lot, I will try with that. 


Uwe Schindler wrote
> Hi andy,
> 
> unfortunately, that is not easy to show with one simple code. You have to
> change the Similarity used.
> 
> Before starting to do this, you should be sure, that this affects you
> users. The example you gave is showing very short documents. Lucene is
> optimized to handle larger documents, for short documents, the document
> statistics are not behaving in an ideal way - that’s the main issue here.
> Instead of trying to change the very basic Lucene statictics, you should
> first verify that this affects a large part of your user queries and
> documents, not just this example which looks like special case. Otherwise
> it is not an option.
> 
> Please read the documentation of Lucene how to change the similarity,
> specifically the length norm, while indexing/searching:
> http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/package-summary.html#changingScoring
> 
> Uwe
> 
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: 

> uwe@

> 
> 
>> -Original Message-
>> From: andy [mailto:

> yhlweb@

> ]
>> Sent: Wednesday, February 12, 2014 10:53 AM
>> To: 

> java-user@.apache

>> Subject: RE: Length of the filed does not affect the doc score accurately
>> for
>> chinese analyzer(SmartChineseAnalyzer)
>> 
>> Thanks Uwe,could you please give me a more detail example about how to
>> change the lucene behavior
>> 
>> 
>> Uwe Schindler wrote
>> > Hi Erick,
>> >
>> > a statement like " Adding &debug=all to the query will show you if
>> > this is the case" will not help a Lucene user, as it is only available
>> > in the Solr server. But Andy uses Lucene directly. In his case he
>> > should use IndexSearcher's explain functionalities to retrieve a
>> > structured output of how the documents are scored for this query for
>> debugging:
>> >
>> >
>> http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/Inde
>> > xSearcher.html#explain(org.apache.lucene.search.Query,
>> > int)
>> >
>> > But yes, the length norm is encoded with loss of precsision in Lucene
>> > (it is a float values encoded to 1 byte only). With Lucene 4 there are
>> > ways to change that behavior, but that included changing the
>> > similarity implementation and use a different DocValues type for
>> encoding
>> the norms.
>> > In most cases this is not needed, because user won't notice.
>> >
>> > Uwe
>> >
>> > -
>> > Uwe Schindler
>> > H.-H.-Meier-Allee 63, D-28213 Bremen
>> > http://www.thetaphi.de
>> > eMail:
>> 
>> > uwe@
>> 
>> >
>> >
>> >> -Original Message-
>> >> From: Erick Erickson [mailto:
>> 
>> > erickerickson@
>> 
>> > ]
>> >> Sent: Wednesday, January 15, 2014 1:30 PM
>> >> To: java-user
>> >> Subject: Re: Length of the filed does not affect the doc score
>> >> accurately for chinese analyzer(SmartChineseAnalyzer)
>> >>
>> >> the lengths of fields are encoded and lose some precision. So I
>> >> suspect the length of the field calculated for the two documents are
>> >> the same after encoding.
>> >>
>> >> Adding &debug=all to the query will show you if this is the case.
>> >>
>> >> Best
>> >> Erick
>> >>
>> >> On Wed, Jan 15, 2014 at 3:39 AM, andy <
>> 
>> > yhlweb@
>> 
>> > > wrote:
>> >> > Hi guys,
>> >> >
>> >> > As the topic,it seems that the length of filed does not affect the
>> >> > doc score accurately for chinese analyzer in my source code
>> >> >
>> >> > index source code
>> >> >
>> >> >  private static Directory DIRECTORY;
>> >> >
>> >> >
>> >> > @BeforeClass
>> >> > public static void before() throws IOException {
>> >> >   DIRECTORY = new RAMDirectory();
>> >> >   Analyzer chineseanalyzer = new
>> >> > SmartChineseAnalyzer(Version.LUCENE_40);
>> >> >   IndexWriterConfig indexWriterConfig = new
>> >> > IndexWriterConfig(Version.LUCENE_40,chineseanalyzer);
>> >> >   FieldType nameType = new Field

Re: Lucene searching across documents

2009-04-08 Thread Andy
Hello all,



I'm trying to implement a vector space model using lucene. I need to
have a file (or on memory) with TF/IDF weight of each term in each
document. (in fact that is a matrix with documents presented as
vectors, in which the elements of each vector is the TF weight ...) 



Please Please help me on this

contac me if you need any further info via andykan1...@yahoo.com

Many Many thanks


  

Vector space implemantion

2009-04-08 Thread Andy
Hello all,

I'm trying to implement a vector space model using lucene. I need to have a 
file (or on memory) with TF/IDF weight of each term in each document. (in fact 
that is a matrix with documents presented as vectors, in which the elements of 
each vector is the TF weight ...)

Please Please help me on this
contact me if you need any further info via andykan1...@yahoo.com
Many Many thanks




  

Vector space implemantion

2009-04-09 Thread Andy
Hello all,

I'm new to lucene and trying to implement a vector space model using lucene. I 
need to have a file (or on memory) with TF/IDF weight of each term in each 
document. (in fact that is a matrix with documents presented as vectors, in 
which the elements of each vector is the TF weight ...)

Please Please help me on this
contact me if you need any further info via andykan1...@yahoo.com
Many Many thanks




  --- Begin Message ---

Hello all,

I'm new to lucene and trying to implement a vector space model using lucene. I 
need to have a file (or on memory) with TF/IDF weight of each term in each 
document. (in fact that is a matrix with documents presented as vectors, in 
which the elements of each vector is the TF weight ...)

Please Please help me on this
contact me if you need any further info via andykan1...@yahoo.com
Many Many thanks

--- On Thu, 4/9/09, John Byrne  wrote:

From: John Byrne 
Subject: Re: query c++
To: java-user@lucene.apache.org
Date: Thursday, April 9, 2009, 12:57 PM

Hi,

This came up before, a while ago: 
http://www.nabble.com/searching-for-C%2B%2B-to18093942.html#a18093942

I don't think there is an easier way than modifying the standard 
analyzer. As I suggested in that earlier thread, I would make the 
analyzer recognize token patterns that consist of words with prefixed or 
postfixed symbols[1] Then you will receive tokens like "c++" or 
"~/.file" in your token filter. You can then choose to pass them as 
single tokens, or split them down further into two or more tokens.

-John

[1] If you decide to try matching words with symbols in the middle, be 
aware that the StandardAnalyzer already handles some examples of this, 
such as e-mail addresses, so you may make something redundant.

??? wrote:
> to be detailed, I implemented a ftp search engine for campus students. I
> have handle many different words including chinese words, as a result I
> can't only use whitespaceanalyzer. My analyzer is now like this:
>
>     StandardTokenizer tokenStream = new StandardTokenizer(reader,
> replaceInvalidAcronym);
>     tokenStream.setMaxTokenLength(maxTokenLength);
>     TokenStream result = new StandardFilter(tokenStream);
>     result = new LowerCaseFilter(result);
>     result = new StopFilter(result, stopSet);
>     result = new SnowballFilter(result,STEMMER);
>
> StandardTokenizer is modified by me to split words like season09(like search
> for friends season 09) to “season" and "09"?
> word "c++" is analyzed as "c".
>
> I know i can modify the standardtokenizer to achieve my goal. But are there
> any other neat methods?
>
> 2009/4/9 hyj 
>
>   
>> ???,??!
>>
>>        WhitespaceAnalyzer can work.
>>
>> === 2009-04-09 15:15:14 ???:===
>>
>>     
>>> I want to make my lucene can search word like c++, c#,  how can i modify
>>>       
>> my
>>     
>>> analyzer to achieve this goal?
>>>
>>> --
>>> ???(Weiwei Wang)
>>> Department of Computer Science
>>> Gulou Campus of Nanjing University
>>> Nanjing, P.R.China, 210093
>>>
>>> Mobile: 86-13913310569
>>> MSN: ww.wang...@gmail.com
>>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>>>       
>> = = = = = = = = = = = = = = = = = = = =
>>
>>
>> ?
>> ?!
>>
>>
>> hyj
>> hongyin...@163.com
>> 2009-04-09
>>
>>
>>     
>
>
>   
> 
>
>
> No virus found in this incoming message.
> Checked by AVG - www.avg.com 
> Version: 8.0.238 / Virus Database: 270.11.48/2048 - Release Date: 04/08/09 
> 19:02:00
>
>   


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




  
No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 8.0.238 / Virus Database: 270.11.48/2048 - Release Date: 04/08/09 
19:02:00
--- End Message ---

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Vector space implemantion

2009-04-09 Thread Andy

Well, I'm planning to have the term weights (assume in a matrix) and then using 
an adaptive learning system transform them into a new weights in such a way 
that index formed of these be optimized. Its just a test to see if this 
hypothesis is working or not.


--- On Thu, 4/9/09, Grant Ingersoll  wrote:

From: Grant Ingersoll 
Subject: Re: Vector space implemantion
To: java-user@lucene.apache.org
Date: Thursday, April 9, 2009, 6:29 PM

Assuming you want to handle the vectors yourself, as opposed to relying on the 
fact that Lucene itself implements the VSM, you should index your documents 
with TermVector.YES.  That will give you the term freq on a per doc basis, but 
you will have to use the TermEnum to get the Doc Freq.  All and all, this is 
not going to be very efficient for you, but you should be able to build up a 
matrix from it.

What is the problem you are trying to solve?



On Apr 9, 2009, at 2:33 AM, Andy wrote:

> Hello all,
> 
> I'm trying to implement a vector space model using lucene. I need to have a 
> file (or on memory) with TF/IDF weight of each term in each document. (in 
> fact that is a matrix with documents presented as vectors, in which the 
> elements of each vector is the TF weight ...)
> 
> Please Please help me on this
> contact me if you need any further info via andykan1...@yahoo.com
> Many Many thanks
> 
> 
> 
> 

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using 
Solr/Lucene:
http://www.lucidimagination.com/search


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




  

Index in text format

2009-04-09 Thread Andy
Is there a way to have lucene to write index in a txt file?



  

Piece of coded needed

2009-04-24 Thread Andy


--- On Sat, 4/25/09, andykan1...@yahoo.com  wrote:

From: andykan1...@yahoo.com 
Subject: Piece of coded needed
To: java-user@lucene.apache.org
Date: Saturday, April 25, 2009, 1:37 AM

Hi every body

I know it may seem stupid, but I'm in the middle of a research and I need a 
piece of code in lucene to give me a weight matrix of a text collection and a 
given query:

W i,j = (f i,j)x(idf i) 
AND    for the query:  
W i,q = (0.5 + (0.5xfreq i,q)/Max(freq i,q))x (idf i )

where:

f i,j = Normilize frequency = freq i,j / Max(freq j)
freq i,j = frequency of (k i) in document j (d j)


idf i= log(N/(n i))  idf= Inverse Document Frequency
N = total number of documents in the collection
n i = number of documents which has the TERM i (k i)

could any body help?
Many thanks in advance
best wishes to all





  


  

How to search multiple fields using multiple search terms

2010-04-15 Thread Andy

Hi, I am trying to use the MultiFieldQueryParser to search "title" and "desc" 
fields.  However the Lucene API appears to only let me provide a single search 
term.  Is it possible to use multiple search terms (one for each field)?

 

For example, the SQL equivalent would be:


select *
from lucene
where title = 'abc'
and desc = '123'


Thanks!
  
_
Hotmail has tools for the New Busy. Search, chat and e-mail from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_1

RE: How to search multiple fields using multiple search terms

2010-04-15 Thread Andy

I am just getting started with Lucene so I didnt know you could just use a 
regular query parser.  That seems to work.

Thanks


 
> Date: Thu, 15 Apr 2010 19:32:50 -0400
> Subject: Re: How to search multiple fields using multiple search terms
> From: erickerick...@gmail.com
> To: java-user@lucene.apache.org
> 
> Why are you locked into using MultiFieldQueryParser? The simpler approach is
> just send something like +title:abc +desc:123 through the regular query
> parser
> 
> HTH
> Erick
> 
> On Thu, Apr 15, 2010 at 6:34 PM, Andy  wrote:
> 
> >
> > Hi, I am trying to use the MultiFieldQueryParser to search "title" and
> > "desc" fields. However the Lucene API appears to only let me provide a
> > single search term. Is it possible to use multiple search terms (one for
> > each field)?
> >
> >
> >
> > For example, the SQL equivalent would be:
> >
> >
> > select *
> > from lucene
> > where title = 'abc'
> > and desc = '123'
> >
> >
> > Thanks!
> >
> > _
> > Hotmail has tools for the New Busy. Search, chat and e-mail from your
> > inbox.
> >
> > http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_1
> >
  
_
The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with 
Hotmail. 
http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5

How to search by numbers

2010-04-19 Thread Andy

Hi, I have indexed the following two fields:
org_id - NOT_ANALYZEDorg_name - ANALYZED
However when I try to search by org_id, for example, 12345, I get no hits.  
I am using the StandardAnalyzer to index and search.  
And I am using:  Query query = queryParser.parse("org_id:12345");
Any ideas?  Thx   
_
Hotmail is redefining busy with tools for the New Busy. Get more from your 
inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_2

RE: How to search by numbers

2010-04-19 Thread Andy

That works, and now that I re-test my original code, it also works.

> Date: Mon, 19 Apr 2010 10:52:45 -0700
> From: iori...@yahoo.com
> Subject: Re: How to search by numbers
> To: java-user@lucene.apache.org
> 
> 
> > Hi, I have indexed the following two fields:
> > org_id - NOT_ANALYZEDorg_name - ANALYZED
> > However when I try to search by org_id, for example, 12345,
> > I get no hits.  
> > I am using the StandardAnalyzer to index and search. 
> > 
> > And I am using:  Query query =
> > queryParser.parse("org_id:12345");
> 
> What happens when you search with this query? 
> Query query  = new TermQuery(new Term("org_id","12345"));
> 
> 
>   
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
  
_
Hotmail has tools for the New Busy. Search, chat and e-mail from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_1

Re: sort by field and score

2012-11-28 Thread Andy Yu
I revise the code to

SortField sortField[] = {new SortField("id", new
CustomComparatorSource(bitSet)),SortField.FIELD_SCORE};

Sort sort = new Sort(sortField);

TopFieldCollector topFieldCollector =
TopFieldCollector.create(sort, 1000, true, true, true, true);
indexSearcher.search(query, topFieldCollector);
TopDocs topDocs = topFieldCollector.topDocs();

but I got the same result with the previous code, need I custom the
class TopFieldCollector?

thank you lan


2012/11/27 Ian Lea 

> What are you getting for the scores?  If it's NaN I think you'll need
> to use a TopFieldCollector.  See for example
> http://www.gossamer-threads.com/lists/lucene/java-user/86309
>
>
> --
> Ian.
>
>
> On Tue, Nov 27, 2012 at 3:51 AM, Andy Yu  wrote:
> > Hi All,
> >
> >
> > Now  I want to sort by a field and the relevance
> > For example
> >
> > SortField sortField[] = {new SortField("id", new
> > CustomComparatorSource(bitSet)),SortField.FIELD_SCORE};
> > Sort sort = new Sort(sortField);
> > TopDocs topDocs = indexSearcher.search(query, 10,sort);
> >
> > if (0 < topDocs.totalHits) {
> > for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
> >
> > System.out.println(indexSearcher.doc(scoreDoc.doc).get("id"));
> > System.out.println("score is " + scoreDoc.score);
> >
> >  System.out.println(indexSearcher.doc(scoreDoc.doc).get("name"));
> > }
> > }
> >
> > I found that the search result sort just by [new SortField("id", new
> > CustomComparatorSource(bitSet))]
> > [SortField.FIELD_SCORE] does not work at all
> >
> >
> > PS: my lucene version is 3.6
> >
> > does anybodu know the reason or how to solve it ?
> >
> >
> > Thanks ,
> > Andy
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: sort by field and score

2012-12-02 Thread Andy Yu
CustomComparatorSource is a class which extends FieldComparatorSource I
just want to custom the sort and add some business logic into the
comparator
actually my desirable situation is
firt sort by  CustomComparatorSource
and then sort by the score

thanks man


2012/11/30 Ian Lea 

> Using a TopFieldCollector works fine for me in a little test program.
> My program sorts on a simple String field rather than your
> CustomComparatorSource, whatever that is.
>
> SortField sortField[] = {
> new SortField("cat", SortField.STRING),
> SortField.FIELD_SCORE
> };
> Sort sort = new Sort(sortField);
>
> Query q = whatever;
>
> TopFieldCollector tfc = TopFieldCollector.create(sort,
>  1000,
>  true,
>  true,
>  true,
>  true);
> searcher.search(q, tfc);
> TopDocs td = tfc.topDocs();
>
>
> I suggest you break your code down into a simple standalone program
> and post that if it still doesn't work.
>
>
> --
> Ian.
>
> On Thu, Nov 29, 2012 at 4:20 AM, Andy Yu  wrote:
> > I revise the code to
> >
> > SortField sortField[] = {new SortField("id", new
> > CustomComparatorSource(bitSet)),SortField.FIELD_SCORE};
> >
> > Sort sort = new Sort(sortField);
> >
> > TopFieldCollector topFieldCollector =
> > TopFieldCollector.create(sort, 1000, true, true, true, true);
> > indexSearcher.search(query, topFieldCollector);
> > TopDocs topDocs = topFieldCollector.topDocs();
> >
> > but I got the same result with the previous code, need I custom the
> > class TopFieldCollector?
> >
> > thank you lan
> >
> >
> > 2012/11/27 Ian Lea 
> >
> >> What are you getting for the scores?  If it's NaN I think you'll need
> >> to use a TopFieldCollector.  See for example
> >> http://www.gossamer-threads.com/lists/lucene/java-user/86309
> >>
> >>
> >> --
> >> Ian.
> >>
> >>
> >> On Tue, Nov 27, 2012 at 3:51 AM, Andy Yu  wrote:
> >> > Hi All,
> >> >
> >> >
> >> > Now  I want to sort by a field and the relevance
> >> > For example
> >> >
> >> > SortField sortField[] = {new SortField("id", new
> >> > CustomComparatorSource(bitSet)),SortField.FIELD_SCORE};
> >> > Sort sort = new Sort(sortField);
> >> > TopDocs topDocs = indexSearcher.search(query, 10,sort);
> >> >
> >> > if (0 < topDocs.totalHits) {
> >> > for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
> >> >
> >> > System.out.println(indexSearcher.doc(scoreDoc.doc).get("id"));
> >> > System.out.println("score is " + scoreDoc.score);
> >> >
> >> >  System.out.println(indexSearcher.doc(scoreDoc.doc).get("name"));
> >> > }
> >> > }
> >> >
> >> > I found that the search result sort just by [new SortField("id", new
> >> > CustomComparatorSource(bitSet))]
> >> > [SortField.FIELD_SCORE] does not work at all
> >> >
> >> >
> >> > PS: my lucene version is 3.6
> >> >
> >> > does anybodu know the reason or how to solve it ?
> >> >
> >> >
> >> > Thanks ,
> >> > Andy
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


custom solr sort problem

2013-01-05 Thread Andy Yu
etRelation(doc);

}

@Override
public void setBottom(int slot) {
bottom = values[slot];
}

@Override
public FieldComparator setNextReader(
AtomicReaderContext ctx) throws IOException {
uidDoc = FieldCache.DEFAULT.getInts(ctx.reader(), "userID",
true);
return this;
}

@Override
public Float value(int slot) {
return new Float(values[slot]);
}

private float getRelation(int doc) throws IOException {
if (dg3.get(uidDoc[doc])) {
return 3.0f;
} else if (dg2.get(uidDoc[doc])) {
return 4.0f;
} else if (dg1.get(uidDoc[doc])) {
return 5.0f;
} else {
return 1.0f;
}
}

@Override
public int compareDocToValue(int arg0, Object arg1)
throws IOException {
// TODO Auto-generated method stub
return 0;
    }
}

}
}


and solrconfig.xml configuration is




   
mySortComponent
  



Andy


combining MultiFieldQueryParserparser with FuzzyQuery

2010-10-18 Thread Andy Yang
I would like to use MultiFieldQueryParser to serach multiple fields, then in
each field, I want to use fuzzy search. How can that be done? Any example
will be appreciated.

Thanks,
Andy


minimum string length for proximity search

2011-03-30 Thread Andy Yang
Is there a minimum string length requirement for proximity search? For
example, would "a~" or "an~" trigger proximity search? The result
would be horrible if there is no such requirement.

Thanks,
Andy

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: minimum string length for proximity search

2011-03-30 Thread Andy Yang
We are trying to do proximity search for multi-terms and we don't care
the order of the terms. Therefore "term1 term2"~5 probably will not
get you "term2 term1" if both terms are long. So instead of applying
distance at the end, we apply distance to each word, "term1~2
term2~2". I am wondering if we should skip short words if it is not
done automatically by the engine.

Thanks,
Andy

On Wed, Mar 30, 2011 at 4:02 PM, Erick Erickson  wrote:
> Uhhhm, doesn't "term1 term2"~5 work? If not, why not?
>
> You might get some use from
> http://lucene.apache.org/java/2_4_0/queryparsersyntax.html
>
> Or if that's not germane, perhaps you can explain your use case.
>
> Best
> Erick
>
> On Wed, Mar 30, 2011 at 5:49 PM, Andy Yang  wrote:
>> Is there a minimum string length requirement for proximity search? For
>> example, would "a~" or "an~" trigger proximity search? The result
>> would be horrible if there is no such requirement.
>>
>> Thanks,
>> Andy
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: minimum string length for fuzzy search

2011-03-30 Thread Andy Yang
My question should really be on "fuzzy search". Is there a minimum
length requirement for fuzzy search to start? For example, would
"an~0.8" kick off fuzzy search?

Thanks,
Andy

On Wed, Mar 30, 2011 at 4:02 PM, Erick Erickson  wrote:
> Uhhhm, doesn't "term1 term2"~5 work? If not, why not?
>
> You might get some use from
> http://lucene.apache.org/java/2_4_0/queryparsersyntax.html
>
> Or if that's not germane, perhaps you can explain your use case.
>
> Best
> Erick
>
> On Wed, Mar 30, 2011 at 5:49 PM, Andy Yang  wrote:
>> Is there a minimum string length requirement for proximity search? For
>> example, would "a~" or "an~" trigger proximity search? The result
>> would be horrible if there is no such requirement.
>>
>> Thanks,
>> Andy
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



solr facet query with Tagging and Excluding Filters

2014-09-18 Thread Andy Yu
Hi guys,

I want to do a facet with facet query,and let it has the [Tagging and
Excluding Filters] (
https://cwiki.apache.org/confluence/display/solr/Faceting)style which
facet.field has,so how to do it , pls guide me!

Thanks,

Andy


phrases and slop

2008-08-28 Thread Andy Goodell
I thought I understood phrases and slop until one of my coworkers
brought by the following example

For a document that contains
"quick brown fox"

"quick brown fox"~0
"quick fox brown"~2
"fox quick brown"~3

all match.

I would have expected "fox quick brown" to require a 4 instead of a 3,
two to transpose brown and fox, two to transpose quick and fox.  Why
is this only 3?

- andy g

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Luke is coming .. not there yet.

2008-10-30 Thread Andy Triana
whichever is chosen.

Just a huge thank you for making this tool available!

Great tool!

//andy

On Thu, Oct 30, 2008 at 4:06 AM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> Hi all,
>
> Many people ask me when the next version of Luke becomes available. It's
> almost ready, and the release should happen in about a week, depending on
> the situation in my daily job.
>
> I'd like to ask the Lucene user community what version of Lucene would be
> preferable to include in this Luke release:
>
> 1) Luke 2.4 release. This has the advantage of being an official stable
> release, with a well-defined functionality. The disadvantage is that you
> will miss some new features available in 2.9 (current trunk) for a long time
> to come, at least until the next Lucene release.
>
> 2) Luke 2.9-dev snapshot. This has the advantage that you get the
> cutting-edge features, and it's easier to update Luke to the most recent
> version of Lucene. However, it means that any modifications to existing
> indexes (such as e.g. deleting a doc, or optimizing an index) will promote
> the index to a new format, incompatible with earlier versions of Lucene
> (including 2.4 release).
>
> --
> Best regards,
> Andrzej Bialecki <><
>  ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Scaling out/up or a mix

2009-06-30 Thread Andy Goodell
I have improved date-sorted searching performance pretty dramatically by
replacing the two step "search then sort" operation with a one step "use the
date as the score" algorithm.  The main gotcha was making sure to not affect
which results get counted as hits in boolean searches, but overall I only
spent about a week on the project, and got a 60x speed improvement on the
target set. (from minutes to seconds)  YMMV however, since the app requires
the collection of the complete set of results for analysis.

- andy g

On Mon, Jun 29, 2009 at 12:47 AM, Marcus Herou
wrote:

> Thanks for the answer.
>
> Don't you think that part 1 of the email would give you a hint of nature of
> the index ?
>
> Index size(and growing): 16Gx8 = 128G
> Doc size (data): 20k
> Num docs: 90M
> Num users: Few hundred but most critical is that the admin staff which is
> using the index all day long.
> Query types: Example: title:"Iphone" OR description:"Iphone" sorted by
> publishedDate... = Very simple, no fuzzy searches etc. However since the
> dataset is large it will consume memory on sorting I guess.
>
> Could not one draw any conclusions about best-practice in terms of hardware
> given the above "specs" ?
>
> Basically I would like to know if I really need 8 cores since machines with
> dual-cpu support are the most expensive and I would like to not throw away
> money so getting it right is a matter of economy.
>
> I mean it is very simple: Let's say someone gives me a budget of 50 000 USD
> and I then want to get the most bang for the buck for my workload.
> Should I go for
> X machines with quad-core 3.0GHz, 4 disks RAID1+0, 8G RAM costing me
> 1200USD
> a piece (giving me 40 machines: 160 disks, 160 cores, 320G RAM)
> or
> X machines with dual quad-core 2.0GHz, 4 disks RAID1+0, 36G RAM costing me
> 3400 USD a piece (giving me 15 machines:  60 disks, 120 cores,  540G RAM)
>
> Basically I would like to know what factors make the workload IO bound vs
> CPU bound ?
>
> //Marcus
>
>
>
>
>
>
> On Mon, Jun 29, 2009 at 8:53 AM, Eric Bowman  wrote:
>
> > There is no single answer -- this is always application specific.
> >
> > Without knowing anything about what you are doing:
> >
> > 1. disk i/o is probably the most critical.  Go SSD or even RAM disk if
> > you can, if performance is absolutely critical
> > 2. Sometimes CPU can become an issue, but 8 cores is probably enough
> > unless you are doing especially cpu-bound searches.
> >
> > Unless you are doing something with hard performance requirements, or
> > really quite unusual, buying "good" kit is probably good enough, and you
> > won't really know for sure until you measure.  Lucene is a general
> > enough tool that there isn't a terribly universal answer to this.  We
> > were a bit surprised to end up cpu-bound instead of disk i/o-bound, for
> > instance, but we ended up taking an unusual path.  YMMV.
> >
> > Marcus Herou wrote:
> > > Hi. I think I need to be more specific.
> > >
> > > What I am trying to find out is if I should aim for:
> > >
> > > CPU (2x4 cores, 2.0-3.0Ghz)? or perhaps just a 4 cores is enough.
> > > Fast disk IO: 8 disks, RAID1+0 ? or perhaps 2 disks is enough...
> > > RAM - if the index does not fit into RAM how much RAM should I then buy
> ?
> > >
> > > Please any hints would be appreciated since I am going to invest soon.
> > >
> > > //Marcus
> > >
> > > On Sat, Jun 27, 2009 at 12:00 AM, Marcus Herou
> > > wrote:
> > >
> > >
> > >> Hi.
> > >>
> > >> I currently have an index which is 16GB per machine (8 machines =
> 128GB)
> > >> (data is stored externally, not in index) and is growing like crazy
> (we
> > are
> > >> indexing blogs which is crazy by nature) and have only allocated 2GB
> per
> > >> machine to the Lucene app since we are running some other stuff there
> in
> > >> parallell.
> > >>
> > >> Each doc should be roughly the size of a blog post, no more than 20k.
> > >>
> > >> We currently have about 90M documents and it is increasing rapidly so
> > >> getting into the G+ document range is not going to be too far away.
> > >>
> > >> Now due to search performance I think I need to move these instances
> to
> > >> dedicated index/search machines (or index on some machines and search
> on
> > >> others). Anyway I would like to get some feedback about two things:
> > >>
&

Is my app a good fit for Lucene?

2009-07-10 Thread Andy Faibishenko
I have a GUI application which needs to open large files (hundreds of MB)
and be able to search through them quickly for user specified strings.
These files are frequently updated while the user is viewing them and the
updates are captured by the application.  Also, the files contain records
which are KEY=VALUE pairs separated by a non-printable ASCII character
instead of normal English text.
 
I installed Lucene in Eclipse and tried to play around with some sample
code.  One thing I noticed is that the wildcard searching doesn't seem to
work right on this data.  I am guessing it is because the text format is
tripping up the tokenizing.
 
I am trying to figure out whether using Lucene to implement this is a good
thing or whether I should just try to implement my own search logic.  
 
Andy Faibishenko


lucene search

2010-01-28 Thread andy green

hello,

I programmed with Lucene code to handle the search on my site ... the
articles indexed are those stored in a database, then I do a search with
"lucene.queryparser" on the field "code" of various objects (a "code" is a
word of 3 6-character) ...

My problem is the fact that when I search, I am obliged to insert exactly
the real "code" to get a result. For example if in the database, I have an
object whose "code" is "lpg" , by typing "lp" in my textbox (for searching),
I get nothing ... I must enter the real entire code  ... "lpg"

In addition my research does not react with "figures" or characters such as
"_"

How can I do? I think that the problem may be due to "analyzer" I chose? (I
tried to use "SimpleAnalyser" or "StandardAnalyser)


Thank you for your help!

-- 
View this message in context: 
http://old.nabble.com/lucene-search-tp27358766p27358766.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: lucene search

2010-01-29 Thread andy green


   

  Thanks 


-- 
View this message in context: 
http://old.nabble.com/lucene-search-tp27358766p27367213.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Indexing Wikipedia dumps

2007-12-12 Thread Andy Goodell
My firm uses a parser based on javax.xml.stream.XMLStreamReader to
break (english and nonenglish) wikipedia xml dumps into lucene-style
"documents and fields."  We use wikipedia to test our
language-specific code, so we've probably indexed 20 wikipedia dumps.

- andy g

On Dec 11, 2007 9:35 PM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I need to index a Wikipedia dump.  I know there is code in contrib/benchmark 
> for indexing *English* Wikipedia for benchmarking purposes.  However, I'd 
> like to index a non-English dump, and I actually don't need it for 
> benchmarking, I just want to end up with a Lucene index.
>
> Any suggestions where I should start?  That is, can anything in 
> contrib/benchmark already do this, or is there anything there that I should 
> use as a starting point?  As opposed to writing my own Wikipedia XML dump 
> parser+indexer.
>
> Thanks,
> Otis
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Using Lucene to find duplicate/similar names

2008-04-16 Thread Andy DePue
I'm new to Lucene, and would like to use it to find duplicate (or 
similar) names in a contact list.  Is Lucene a good fit?
We have a form where a user enters a company or person's name, and we 
want the system to warn them if there is already a company or person 
entered with the same or similar name.
Based on the little I know of Lucene, I'm thinking an NGram algorithm 
(based on characters, not words) would work best... but, I'm not sure if 
Lucene takes proximity or edit distances into account?  For example, say 
you have these two names:

 Andrew John
 John Andrew

If a user enters Andy John, without proximity or edit distance, these 
two names will match about the same, while, obviously, the first name 
should be ranked higher.

Thanks in advance for any help or advice.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Using Lucene to find duplicate/similar names

2008-04-16 Thread Andy DePue
Thanks for the pointer.  I found the thread, and there is certainly some 
interesting information there.  I'd like to stick to what Lucene has 
available today, mainly because I lack the time to implement anything 
more than that.  I originally thought Levenshtein, but then realized 
that Lucene would probably have to do a whole index scan for that?  I 
don't need anything too fancy, so I'm still wondering if NGram with some 
sort of proximity ranking would do the trick.  By proximity, I mean, how 
closely the NGrams in the document field match in proximity and order to 
each other as the same NGrams in the search string.  I'm hoping NGrams 
would avoid the need for a whole index scan.  Does Lucene already factor 
this into its hit score, or would I need to do some custom work?


 - Andy

Grant Ingersoll wrote:
I believe there were some posts on this about a year ago.  Try 
searching in the archives for duplicate names, as well as "record 
linkage" or any other various synonyms that you can think of.  The 
short answer is Lucene is reasonable to attempt this with, but you may 
need some help.  The long answer is to dig into those archives and see 
the other recommendations.


-Grant

On Apr 16, 2008, at 12:37 PM, Andy DePue wrote:

I'm new to Lucene, and would like to use it to find duplicate (or 
similar) names in a contact list.  Is Lucene a good fit?
We have a form where a user enters a company or person's name, and we 
want the system to warn them if there is already a company or person 
entered with the same or similar name.
Based on the little I know of Lucene, I'm thinking an NGram algorithm 
(based on characters, not words) would work best... but, I'm not sure 
if Lucene takes proximity or edit distances into account?  For 
example, say you have these two names:

Andrew John
John Andrew

If a user enters Andy John, without proximity or edit distance, these 
two names will match about the same, while, obviously, the first name 
should be ranked higher.

Thanks in advance for any help or advice.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Using lucene as a database... good idea or bad idea?

2008-07-31 Thread Andy Liu
If essentially all you need is key-value storage, Berkeley DB for Java works
well.  Lookup by ID is fast, can iterate through documents, supports
secondary keys, updates, etc.

Lucene would work relatively well for this, although inserting documents
might not be as fast, because segments need to be merged and data ends up
getting copied over again at certain points.  So if you're running a batch
process with a lot of inserts, you might get better throughput with BDB as
opposed to Lucene, but, of course, benchmark to confirm ;)

Andy

On Thu, Jul 31, 2008 at 9:12 AM, Karsten F.
<[EMAIL PROTECTED]>wrote:

>
> Hi Ganesh,
>
> in this Thread nobody said, that lucene is a good storage server.
> Only "it could be used as storage server" (Grant: Connect data storage with
> simple, fast lookup and Lucene..)
>
> I don't now about automatic rentention.
> But for the rest in your list of features I suggest to take a deep look to
>  - Jackrabbit (Standard jcr jsr170 implemention, I like the webDAV support)
>  - dSpace (real working content repository software, with good permissions
> management)
>
> Both use lucene for searching
>
> Best regards
>Karsten
>
>
> Ganesh - yahoo wrote:
> >
> > which one will be the best to use as storage server. Lucene or
> Jackrabbit.
> >
> > My requirement is to provide support to
> > 1) Archive the documents
> > 2) Do full text search on the documents.
> > 3) Do backup the index store and archive store. [periodical basis]
> > 4) Remove the documents after certain period [rentention policy]
> >
> > Whether Lucene could be used as archival store. Most of them in this
> > mailing
> > list said 'yes'. If so going for separate database to archive the data
> and
> > separate database to index it, will be better option or one database to
> be
> > used as archive and index.
> >
> > One more idea from this list is to use Jackrabbit / JDBM / My SQL to
> > archive
> > the data. Which will be the best?
> >
> > I am in desiging phase and i have time to explore and prototype any other
> > products. Please do suggest me a good one.
> >
> > Regards
> > Ganesh
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Using-lucene-as-a-database...-good-idea-or-bad-idea--tp18703473p18754258.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: delete by doc id

2008-08-08 Thread Andy Triana
I rarely submit but I've been seeing this sort of thing more and more
on this board.

It seems that there is a need to treat Lucene as if it were a data
storage or database like repository, where in
fact it isn't.

In our case, for large indexes we run either a parallel process to
create the index or a nightly process. Always treating
the index as a disposable lookup mechanism.

When it comes to deleting items from an index, we simply just mark
those items in a separate list or data structure and then filter them
away until the index is refreshed the next go around.

My point is, that it seems like folks want Lucene to act like a
relational database when in fact it is not meant to be used that way.

Perhaps I'm wrong and upgrades will allow for efficient deletions, but
it has always been clear to me that Lucene is strictly
an index and should be treated as a feature of your storage
repository, i.e. database, file system, web, whatever. But not relied
upon for that very storage or management of that storage.

The deletion issue is true of almost any indexing engine I've used in
the past, i.e. DT Search, Verity, etc.

Am I missing something? For us it has been a "best practice" to treat
Lucene as described.

//andy


On Fri, Aug 8, 2008 at 2:39 PM, Cam Bazz <[EMAIL PROTECTED]> wrote:
> hello,
>
> what would happen if I modified the class IndexWriter, and made the delete
> by id method public?
>
> I have two fields in my documents and I got to be able to delete by those
> two fields, (by query in other words) and I do not wish to go trunk version.
>
> I am getting quite desperate, and if not found a solution I will have to
> make my documents with 3 fields, a, b and a + b so I can delete by a and b.
>
> Best.
>
> could there be a side effect?
>
> Best.
>
> -c.b.
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Using ParallelReader over large immutable index and small updatable index

2007-03-06 Thread Andy Liu

Is there a working solution out there that would let me use ParallelReader
to search over a large, immutable index and a smaller, auxillary index that
is updated frequently?  Currently, from my understanding, the ParallelReader
fails when one of the indexes is updated because the document ID's get out
of synch.  Using ParallelReader in this way is attractive for me because it
would allow me to quickly make updates to only the fields that change.

The alternative is to use one index.  However, an update would require me to
delete the entire document (which is quite large in my application) and
reinsert it after making updates.  This requires a lot more I/O and is a lot
slower, and I'd like to avoid this if possible.

I can think of other alternatives, but all involve storing data and/or
bitsets in memory, which is not very scalable.  I need to be able to handle
millions of documents.

I'm also open to any solution that don't involve ParallelReader that would
help me make quick updates in the most non-disruptive and scalable fashion.
But it just seems that ParallelReader would be perfect for me needs, if I
can get past this issue.

I've seen posts about this issue on the list, but nothing pointing to a
solution.  Can somebody help me out?

Andy


Re: Using ParallelReader over large immutable index and small updatable index

2007-03-07 Thread Andy Liu

From my understanding, MultiSearcher is used to combine two indexes that

have the same fields but different documents.  ParallelReader is used to
combine two indexes that have same documents but different fields.  I'm
trying to do the latter.  Is my understanding correct?  For example, what
I'm trying to do is have one immutable index that has these fields:

field1
field2
field3

and my "update" index that has one field

field4

Both indexes have the same documents, and the docId's are synchronized.
This allows me to execute searches like:

+field1:foo +field4:bar

field4 is a field that would be updated frequently and as real-time as
possible.  However, once I update field4, the docId's are no longer
synchronized, and ParallelReader fails.

Andy

On 3/6/07, Alexey Lef <[EMAIL PROTECTED]> wrote:


We use MultiSearcher for a similar scenario. This way you can keep the
Searcher/Reader for the read-only index alive and refresh the small index
Searcher whenever an update is made. If you have any cached filters, they
are mapped to a Reader, so the cached filters for the big index will stay
alive as well. The only (small) problem I have found so far is how
MultiSearcher handles custom Similarity (see
https://issues.apache.org/jira/browse/LUCENE-789).

Hope this helps,

Alexey

-Original Message-
From: Andy Liu [mailto:[EMAIL PROTECTED]
Sent: Tuesday, March 06, 2007 3:34 PM
To: java-user@lucene.apache.org
Subject: Using ParallelReader over large immutable index and small
updatable index

Is there a working solution out there that would let me use ParallelReader
to search over a large, immutable index and a smaller, auxillary index
that
is updated frequently?  Currently, from my understanding, the
ParallelReader
fails when one of the indexes is updated because the document ID's get out
of synch.  Using ParallelReader in this way is attractive for me because
it
would allow me to quickly make updates to only the fields that change.

The alternative is to use one index.  However, an update would require me
to
delete the entire document (which is quite large in my application) and
reinsert it after making updates.  This requires a lot more I/O and is a
lot
slower, and I'd like to avoid this if possible.

I can think of other alternatives, but all involve storing data and/or
bitsets in memory, which is not very scalable.  I need to be able to
handle
millions of documents.

I'm also open to any solution that don't involve ParallelReader that would
help me make quick updates in the most non-disruptive and scalable
fashion.
But it just seems that ParallelReader would be perfect for me needs, if I
can get past this issue.

I've seen posts about this issue on the list, but nothing pointing to a
solution.  Can somebody help me out?

Andy

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Range search in numeric fields

2007-04-03 Thread Andy Liu

You can try using MemoryCachedRangeFilter.

https://issues.apache.org/jira/browse/LUCENE-855

It stores field values in memory as longs so your values don't have to be
lexigraphically comparable.  Also, MemoryCachedRangeFilter can be orders of
magnitude faster than standard RangeFilter, depending on your data.

Andy

On 4/3/07, Ivan Vasilev <[EMAIL PROTECTED]> wrote:


Hi All,
I have the following problem:
I have to implement range search for fields that contain numbers. For
example the field size that contains file size. The problem is that the
numbers are not kept in strings with strikt length. There are field
values like this: "32", "421", "1201". So when makeing search like this:
+size:[10 TO 50], as the order for string is lexicorafical the result
contains the documents with size 32 and 1201. I can see the following
possible aproaches:
1. Changing indexing process so that all data entered in those fields is
with fixed length. Example 032, 421, 0001201.
Disadvantages here are:
- Have to be reindexed all existng indexes;
- The index will grow a bit.

2. Generating query without ranges but including all numbers between the
bounds - +size=10 +size=11 +size=12 +size=49 + size=50. For
narrow ranges it makes sense but for large ones... :)

3. Generating query with intervals (inclusive and exclusive) but the
number of this intervals will be the same (or one more) than the
conditions in point 2. +size:[10 TO 50] -size:[10 TO 119] -
size:[11 TO 1299] ... etc.

So if someone can help with some new oportunity please mail.

Thanks in advance.
Ivan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Index updates between machines

2007-04-03 Thread Andy Liu

Sounds like you might have an I/O issue.  If you have multiple partitions /
disks on the searching server you can search from one partition and copy to
another and alternate.  If you're using RAID different RAID levels are
optimized for simultaneous reads and writes.

If you have a 3rd machine you can load balance 2 search servers and take one
out of the cluster when the index is being copied.  Alternatively, if it's
possible, you can copy the index at an offpeak hour.

Andy

On 4/3/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:


How fast are your disks?  Perhaps they are having trouble keeping up with
simultaneous searches and massive file copying.

Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Chun Wei Ho <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Tuesday, April 3, 2007 10:40:16 AM
Subject: Index updates between machines

We are running a search service on the internet using two machines. We
have a crawler machine which crawls the web and merges new documents
found into the Lucene index. We have a searcher machine which allows
users to perform searches on the Lucene index.

Periodically, we would copy the newest version of the index from the
crawler machine over to the searcher machine (via copy over a NFS
mount). The searcher would then detect the new version, close the old
index, open the new index and resume the search service.

As the index have been growing in size, we have been noticing that the
search response time on the searcher machine increases drastically
when an index (about 15GB) is being copied from the crawler to the
searcher. Both machines run Fedora Core 4 and are on a gbps lan.

We've tried a number of ways to reduce the impact of the copy over NFS
on searching performance, such as "nice"ing the copy process, but to
no avail. I wonder if anyone is running a lucene search service over a
similar architecture and how you are managing the updates to the
lucene index.

Thanks!

Regards,
CW

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: How many Searches is a Searcher Worth?

2007-04-05 Thread Andy Goodell

My approach to dealing with these kinds of issues (which has worked well for
me thus far) is:

- Run java with -XX:+HeapDumpOnOutOfMemoryError command-line option
- use jhat to inspect the heap dump, like so:
$ /usr/java/jdk1.6/bin/jhat ./java_pid1347.hprof

jhat will take a while to parse the heap dump, and will start an http
listener on port 7000 by default.

Interesting statistics can be found at the bottom of the front page.  These
will enable you to discover whether it is a memory leak in the java runtime
or in the lucene library.

- andy g



On 4/5/07, Craig W Conway <[EMAIL PROTECTED]> wrote:


So, forgetting the RMI stuff, I put together a test client very similar to
the one in the book "Lucene in Action" page 182.

The  client:

1. instantiates a IndexSearcher
2. loops through queries, searches, prints hit count, saves nothing

I am only able to run through about 40 searches before I get an
OutOfMemoryException.  JDK 1.5

Because of this, I have put a counter in my search server to close and
re-open the IndexSearcher after a certian number of searches. But this
shouldn't be necessary right? What's eating up all the memory?

Source code @ http://urbanmarsupial.com/share/TestLuceneMemory.java

Any hints would be greatly appreciated!

Thanks,

Craig

- Original Message 
From: Otis Gospodnetic <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, April 4, 2007 10:39:50 AM
Subject: Re: How many Searches is a Searcher Worth?

No reason that I can think of.  What makes you think the problem is with
the IndexSearcher?  Maybe it's something else in your code, for instance.
Make sure you have the same version of Java on both ends of the
call.  Also, Java 6 made our RMI calls a lot more stable than even 1.5.

Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Craig W Conway <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, April 4, 2007 1:25:27 PM
Subject: How many Searches is a Searcher Worth?

I am using an RMI architecture for calling a remote service which uses an
IndexSearcher in its own JVM. I am starting the service with the following
provisions for memory allocation and garbage collection: java -server
-Xmx1024m -XX:+UseConcMarkSweepGC -XX:+UseParNewGC

After about 1000 search calls I start to run out of memory, and I have to
close and re-open the IndexSearcher, loosing any cached data and filters...
Is there any reason why I shouldn't be able to use my IndexSearcher forever,
until I want to close it?

Thanks!

Craig







Now that's room service!  Choose from over 150,000 hotels
in 45,000 destinations on Yahoo! Travel to find your fit.
http://farechase.yahoo.com/promo-generic-14795097



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]











No need to miss a message. Get email on-the-go
with Yahoo! Mail for Mobile. Get started.
http://mobile.yahoo.com/mail


Searching with a score cutoff

2007-06-04 Thread Andy Goodell

Currently our application implements a score cutoff by iterating through the
hits and then stopping once it reaches a hit whose score is below our
threshold.  We'd like to optimize this (and avoid looking at the entire hits
when we don't need to) by having the score cutoff applied when the hits are
gathered.  The only way I can see of doing this is by over-riding
Similarity, which seems like an incredibly complex procedure.  What am I
missing?

- andy g.


trying to boost a phrase higher than its individual words

2005-10-27 Thread Andy Lee
I have a situation where I want to search for individual words in a  
phrase as well as the phrase itself.  For example, if the user enters  
["classical music"] (with quotes) I want to find documents that  
contain "classical music" (the phrase) *and* the individual words  
"classical" and "music".


Of course, I could just search for the individual words and the  
phrase would get found as a consequence.  But I want documents  
containing the phrase to appear first in the search results, since  
the phrase is the user's primary interest.


I've constructed the following query, using boost values...

[+(content:"classical music"^5.0 content:classical^0.1  
content:music^0.1)]


...but the boost values don't seem to affect the order of the search  
results.


Am I misunderstanding the purpose or proper usage of boosts, and if  
so, can someone explain (at least roughly) how to achieve the desired  
result?


--Andy


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: trying to boost a phrase higher than its individual words

2005-10-28 Thread Andy Lee

On Oct 28, 2005, at 10:38 AM, Erik Hatcher wrote:
So in this case a matching document must have both terms?  Or could  
it just have one or the other?  If it must have both, you could try  
a PhraseQuery with a slop of Integer.MAX_VALUE.  PhraseQuery scores  
closer matches higher.


Good to know, thanks.  I saw references to slop but didn't know what  
they meant.  I'll see if this is one way I could solve my problem.


But as Chris suggested - check the IndexSearcher.explain() for some  
documents you feel should be ranked higher and work from there.   
You're on the right track, but some tuning appears necessary.


Okay, I looked at the explanations and realized part of the problem  
was that I was applying a sort field to the search results, which I  
had forgotten.  So of course that affected the display order, duh.   
But I also do need to do some tuning, because I'm adding other stuff  
to the query that is also skewing the ranking.


It took me a while to figure out the differences between the  
searcher.explain() example in LIA and the latest changes to the API.   
It was a little annoying that I couldn't find a way to get plain text  
output -- it seems to be only HTML now.  Finally I wrote a  
convenience method that dumps the HTML to a file, which I view in a  
browser.


Thanks, Chris and Erik!

--Andy


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: trying to boost a phrase higher than its individual words

2005-10-28 Thread Andy Lee

On Oct 28, 2005, at 8:17 PM, Chris Hostetter wrote:
One thing to keep in mind is that if you have things you are adding  
to hte
query to restrict the results, but you don't want them to  
contribute to
the score, then try using a Filter instead.  If you can't find an  
easy way
to replace a query by a filter, try using a boost of 0.0001 ( i'd  
say use

a boost of 0, but I'm not sure that all query types handle that as
correctly as they should)


Thanks for the advice.  I hadn't even noticed the Filter classes  
until very recently.  I really need to take the time to work  
methodically through LIA...



Really? .. the LIA example i found was in 3.3.1, it just printed out
explanation.toString() ... that should still work just fine even  
with the

trunk of SVN.


You know what, I was confusing Nutch and Lucene classes (as I've done  
before), in this case the IndexSearcher classes.  All I could find  
was the *Nutch* IndexSearcher's getExplanation() method, which I see  
sends toHtml() rather than toString() to its internal Lucene  
IndexSearcher.


--Andy


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Reverse sorting by index order

2005-11-03 Thread Andy Lee

On Nov 3, 2005, at 9:37 AM, Oren Shir wrote:

If I understand correctly, when sorting by Sort.INDEXORDER the oldest
documents that were added to the index will be returned first. I  
want the

reverse, because I'm more interested in newer documents.


Looking at the source, I see that Sort.INDEXORDER is simply an  
instance of Sort:


  public static final Sort INDEXORDER = new Sort(SortField.FIELD_DOC);

Haven't tried this myself, but you could create your own instance  
that uses a reverse sort:


  Sort reverseIndexOrder = new Sort(SortField.FIELD_DOC, true);

And use that wherever you were using Sort.INDEXORDER.

--Andy


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Reverse sorting by index order

2005-11-03 Thread Andy Lee

On Nov 3, 2005, at 10:22 AM, Oren Shir wrote:
There is no constructor for Sort(SortField, boolean) in Lucene API.  
Which

version are you using?


I think 1.9rc1.  I have a pretty recent svn checkout -- maybe this  
constructor is new.


--Andy



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



IO Error/Jira

2005-12-01 Thread Andy Hind
Hi

 

I would like some pointers for causes of the following error, using
lucene 1.4.3.

 

I have not really got much to go on at the moment other then the error.

(I am aware of the IBM JVM issue and would be surprise if this JVM is
being used - but I do not have enough information to rule this out)

Has anyone seen anything similar?

 

 

Caused by: java.io.IOException: read past EOF

at
org.apache.lucene.store.InputStream.refill(InputStream.java:154)

at
org.apache.lucene.store.InputStream.readByte(InputStream.java:43)

at
org.apache.lucene.store.InputStream.readBytes(InputStream.java:57)

at
org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:356)

at
org.apache.lucene.index.MultiReader.norms(MultiReader.java:159)

at
org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:64)

at
org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.
java:165)

at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:85)

at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64)

at org.apache.lucene.search.Hits.(Hits.java:43)

at
org.apache.lucene.search.Searcher.search(Searcher.java:33)

at
org.apache.lucene.search.Searcher.search(Searcher.java:27)

 

I am assuming the index has been corrupted in some way.

Is it possible this could be related to issue 415?

 

When I look at bugs on Jira it is not clear to me what version of lucene
I need to include a fix.

 

Has version 1.4.3 been fixed up beyond the latest official binary dated
29-Nov-2004?

Should I be getting and building from the repository?

 

Any help appreciated,

 

Regards

 

Andy



Re: sub search

2006-03-07 Thread hu andy
2006/3/7, Anton Potehin <[EMAIL PROTECTED]>:
>
> Is it possible to make search among results of previous search?
>
>
>
>
>
> For example: I made search:
>
>
>
> Searcher searcher =...
>
>
>
> Query query = ...
>
>
>
> Hits hits = 
>
>
>
> hits = Searcher.search(query);
>
>
>
>
>
>
>
> After it I want to not make a new search, I want to make search among
> found results...
>
> You can use like this

  TermQuery termQuery = new TermQuery(
 Filter  queryFilter = new QueryFilter(temQuery);
hits = Searcher.search(query,queryFilter);


Re: sub search

2006-03-07 Thread hu andy
It uses cache mechanism. The detail is described in the book Lucene in
Action. Maybe you can test it to decide which is faster

2006/3/7, [EMAIL PROTECTED] <[EMAIL PROTECTED]>:
>
> As far as I understood that will make new search throughout the index. But
> what the difference between that and search described below:
>
> TermQuery termQuery = new TermQuery(
> BooleanQuery bq = ..
> bq.add(termQuery,true,false);
> bq.add(query,true,false);
> hits = Searcher.search(bq,queryFilter);
>
>
>
> -Original Message-
> From: hu andy [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, March 07, 2006 12:40 PM
> To: java-user@lucene.apache.org
> Subject: Re: sub search
> Importance: High
>
> 2006/3/7, Anton Potehin <[EMAIL PROTECTED]>:
> >
> > Is it possible to make search among results of previous search?
> > For example: I made search:
> > Searcher searcher =...
> > Query query = ...
> > Hits hits = 
> > hits = Searcher.search(query);
> > After it I want to not make a new search, I want to make search among
> > found results...
> >
> > You can use like this
>
> TermQuery termQuery = new TermQuery(
> Filter  queryFilter = new QueryFilter(temQuery);
> hits = Searcher.search(query,queryFilter);
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


who can tell me how lucene search in the index files

2006-03-14 Thread hu andy
I see there are seven different files with extentions .fnm .tis and etc. I
just can't make sure how it looks up in the .tis file. Does lucene use
Binary-Search to locate the term?


About index deletion

2006-03-16 Thread hu andy
Because I will delete the indexed document periodically, So the index files
must be deleted after that. If I just want to delete some documents added
before some past day from the index, How should i do it?
Thank you in advance


Re: question...

2006-03-16 Thread hu andy
Do you mean you pack the index files into the file *.luc.If it is the case,
Lucene can't read it.
If you put index files and *.luc together under some directory, That's OK.
Lucene knows how to find these files


2006/3/14, Aditya Liviandi <[EMAIL PROTECTED]>:
>
>  Hi all,
>
>
>
> If I want to embed the index files into another file (say of extension
> *.luc, so now all the index files are flattened inside this new file), can I
> still use the index without having to extract out the index files to a temp
> folder?
>
>
>
> aditya
>
> --- I²R Disclaimer
> --
> This email is confidential and may be privileged.  If you are not the
> intended recipient, please delete it and notify us immediately. Please do
> not copy or use it for any purpose, or disclose its contents to any other
> person. Thank you.
>
> -
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Speed up Indexing

2006-03-22 Thread hu andy
Hi,everyone. I have a large mount of xml files of size 1G. I use lucene(the
dotNet edition) to index . There are 8 fields for a document, with 4 keyword
fields and 4 unstored fields. I have set the minMergeDocs to 1 and
mergeFactor to 100. It took about 2.5 hours (main memeory 3G, CPU p4 ) .I
also try in-memory indexing  which is also more than 2.5hours.  Due to the
performance requirement , I need complete the indexing in one hour without
the use of distributing or clustering system . Cant it be possible?  Is it
faster to use java Lucene than dotNet one? Any advice will be appreciated.
Thank you in advance.


Re: Update or Delete Document for Lucene 1.4.x

2006-04-03 Thread hu andy
IndexReader.delete(int docNum) or IndexReader.delete(Term term)

2006/4/1, Don Vaillancourt <[EMAIL PROTECTED]>:
>
> Hi All,
>
> I need to implement the ability to update one document within a Lucene
> collection.
>
> I haven't been able to find anything in the API.  Is there a way to
> update one document or delete a document so that I can add an update?
>
> Thank You
>
> --
> Don Vaillancourt
> Director of Software Development
> WEB IMPACT INC.
> phone:   416-815-2000 ext. 245
> fax: 416-815-2001
> toll free:   866-319-1573 ext. 245
> email:   [EMAIL PROTECTED] 
> blackberry:  [EMAIL PROTECTED]
> 
> web: http://www.web-impact.com
> address: http://www.mapquest.ca
> <
> http://www.mapquest.com/maps/map.adp?country=CA&addtohistory=&formtype=address&searchtype=address&cat=&address=99%20Atlantic%20Ave&city=Toronto&state=ON&zipcode=M6K%203J8
> >
>
>
> This email message is intended only for the addressee(s) and contains
> information that may be confidential and/or copyright.
>
> If you are not the intended recipient please notify the sender by reply
> email and immediately delete this email.
>
> Use, disclosure or reproduction of this email by anyone other than the
> intended recipient(s) is strictly prohibited. No representation is made
> that this email or any attachments are free of viruses. Virus scanning
> is recommended and is the responsibility of the recipient.
>
>
>


What is the retrieval modle for lucene?

2006-04-10 Thread hu andy
I have seen in some documents that there are three kinds of retrieval modle
which are used often: Boolean,  vector space and probability.
So I want to which is it that used by lucene. Thank you in advance


Re: performance differences between 1.4.3 and 1.9.1

2006-04-26 Thread Andy Goodell
For my application we have several hundred indexes, different subsets
of which are searched depending on the situation.  Aside from not
upgrading to lucene 1.9, or making a big index for every possible
subset, do you have any ideas for how can we maintain fast
performance?

- andy g

On 4/26/06, Daniel Naber <[EMAIL PROTECTED]> wrote:
> MultiSearcher in Lucene 1.4 had a broken ranking implementation. This has
> been fixed in Lucene 1.9, but this might have bad effects on performance.
> 23 indexes is quite much, maybe you can speed up things greatly be using a
> smaller number of indexes.
>
> Regards
>  Daniel
>
> --
> http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Ask for a better solution for the case

2006-04-28 Thread hu andy
Hi, I hava an application that need mark the retrieved documents  which have
been read. So the next time I needn't read the marked documents again.

I have an idea  that adding a particular field into the indexed
document. But as lucene have no update method, I have to delete that
document, and add it again.  I think it seems a little stupid. Or I can use
a database to satisfy the mark requirement, but how does the database relate
to lucene index, especially when i want to retrieve document that I have
read? Maybe there is a better idea.

Any suggestion will be greatly appreciated.


Maybe a bug of lucene 1.9

2006-05-29 Thread hu andy

I indexed a collection of Chinese documents. I use a special segmentation
api to do the analysis, because the segmentation of Chinese is different
from English.

A strange thing happened.   With lucene 1.4 or lucene 2.0, it will be all
right to retrieve the corresponding documents given the terms that exist in
the index  *.tis file(I wrote a program to pick the terms from the .tis file
and search them).  But with 1.9, for some terms that existed in the index, I
couldn't retrieve the corresponding document.

Can anybody give me some advice about this? Thank you in advance.


Re: Maybe a bug of lucene 1.9

2006-05-30 Thread hu andy

2006/5/29, hu andy <[EMAIL PROTECTED]>:


 I indexed a collection of Chinese documents. I use a special segmentation
api to do the analysis, because the segmentation of Chinese is different
from English.

 A strange thing happened.   With lucene 1.4 or lucene 2.0, it will be all
right to retrieve the corresponding documents given the terms that exist in
the index  *.tis file(I wrote a program to pick the terms from the .tis file
and search them).  But with 1.9, for some terms that existed in the index,
I couldn't retrieve the corresponding document.

Can anybody give me some advice about this? Thank you in advance.



Re: Maybe a bug of lucene 1.9

2006-06-05 Thread hu andy

I am very grateful for all your reply and very sorry for my late response.
You can see that I posted my message twice, because I didn't see it after I
posted first and thought it wouldn't appear in the list. So these days I
didn't check my gmail box.  I have figured out that problem. The index was a
mixture of two kinds of index types  that were comprised of  .cfs and
(.tis,.tii.and so on).

 >I'll second Otis' request about the special segmentation api.  If it

is open source, I'd love to tinker with it.  中文是不太难。  :)


The api is open source, I can distribute it to you. I am glad You understand
Chinese. How I should deliver it to you?  Because the api includes a Chinese
lexis which is nearly 10M in size. Maybe I can mail it to you.



2006/5/30, Erik Hatcher <[EMAIL PROTECTED]>:



On May 29, 2006, at 6:34 AM, hu andy wrote:
> I indexed a collection of Chinese documents. I use a special
> segmentation
> api to do the analysis, because the segmentation of Chinese is
> different
> from English.

I'll second Otis' request about the special segmentation api.  If it
is open source, I'd love to tinker with it.  中文是不太难。  :)

> A strange thing happened.   With lucene 1.4 or lucene 2.0, it will
> be all
> right to retrieve the corresponding documents given the terms that
> exist in
> the index  *.tis file(I wrote a program to pick the terms from
> the .tis file
> and search them).  But with 1.9, for some terms that existed in the
> index, I
> couldn't retrieve the corresponding document.
>
> Can anybody give me some advice about this? Thank you in advance.

If you can share an example that demonstrates an issue, we'd love to
have it and incorporate it into our test suite and fix the
implementation if a bug exists.   A working example of a bug can get
fixed much easier than looking for needles in a haystack.

   Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Consult some information about adding index while searching

2006-07-27 Thread hu andy

I met this problem: when searching,  I add documents to index. Although I
instantiates a new IndexSearcher,  I can't retrieve the newly added
documents. I have to close the program and enter the program, then it will
be ok.

The platform is win xp. Is it the fault of xp?

Thank you in advance.


Re: Consult some information about adding index while searching

2006-07-27 Thread hu andy

Yes, I have closed IndexWriter.  But it doesn't work.

2006/7/27, Michael McCandless <[EMAIL PROTECTED]>:



> I met this problem: when searching,  I add documents to index. Although
I
> instantiates a new IndexSearcher,  I can't retrieve the newly added
> documents. I have to close the program and enter the program, then it
will
> be ok.

Did you close your IndexWriter (so it flushes all changes to disk)
before instantiating a new IndexSearcher?

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Consult some information about adding index while searching

2006-07-28 Thread hu andy

These codes are written in C#,. There is a C# version of Lucene 1.9, which
can be downloaded from http://www.dotlucene.net
This implements the indexing .
 public void CreateIndex()
   {
   try
   {
   AddDirectory(directory);
   writer.Optimize();
   writer.Close();
   directory.Refresh();
   }
   catch (Exception e)
   {
   fmLog.AddLog(fmLog.LogType.Error, Current.User.ID, e.Message
);
   return;
   }
   }

This is a wrapper of IndexSearcher. At first , I want to use a singleton
IndexSearcher. But then I found the updated document can't be retrieved
immediately. So Every time I instantiate a new IndexSeacher, although it is
inefficient.
   public class SingletonSearcher
   {
   SingletonSearcher searcher
   IndexSearcher indexSearcher = null;
   static Object o = typeof(SingletonSearcher);


   /// 
   ///
   /// 
   /// 

   private SingletonSearcher(String indexPath)
   {
   try
   {
   indexSearcher = new IndexSearcher(indexPath);
   }
   catch (Exception e)
   {
   Console.WriteLine(e.Message);
   searcher = null;
   }
   }
   public static SingletonSearcher GetSearcher()
   {
   //lock (o)
   //{
   //if (searcher == null)
   //   searcher = new SingletonSearcher(Current.Server.Path);
   //return searcher;
   //}
   return new SingletonSearcher(Current.Server.Path);
   }

   public static Hits GetHits(Query query)
   {
   if (GetSearcher() == null)
   return null;
   else if (GetSearcher().indexSearcher == null)
   return null;
   return GetSearcher().indexSearcher.Search(query);
   }
}

2006/7/28, Doron Cohen <[EMAIL PROTECTED]>:


> Yes, I have closed IndexWriter.  But it doesn't work.

This is strange...
Can you post a small version of your code that can be executed to show the
problem?
- Doron


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Consult some information about adding index while searching

2006-07-30 Thread hu andy

Thank you


About the use of HitCollector

2006-08-07 Thread hu andy

How can I  use HitCollector to iterate over every returned document?

Thank you in advance.


Re: About the use of HitCollector

2006-08-07 Thread hu andy

Martin, Thank you for your reply.

But the Lucene API said:
This is called in an inner search loop. For good search performance,
implementations of this method should not call
Searcher.doc(int)or
IndexReader.document(int)on
every document number encountered

Because I have to check a field in the document to determine whether I
should return the document. The total number of documents is about two
hundred thousand. So I'm afraid the
performance


2006/8/7, Martin Braun <[EMAIL PROTECTED]>:


hi andy,
> How can I  use HitCollector to iterate over every returned document?

You have to override the function collect for the HitCollector class and
then store the retrieved Data in an array or map.

Here is just a source-code scratch (is = IndexSearcher)

   is.search(query, null, new HitCollector(){
   public void collect(int docID, float score)
   {
   Document doc = is.doc(docID);
   titles[docID] = doc.get("title");
   }
   });


hth,
martin

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: About the use of HitCollector

2006-08-07 Thread hu andy

Hey,Simon, thanks for your reply
I have an ID Field in the index. For the efficiency of indexing speed, I put
some fields in
a database, because I found the total of fields in a Document will badly
degrade the indexing speed. So for the search, I will first query the
database to get a list of ID, then use the list to check whether the Lucene
seached results should  be returned.
Can you give some suggestion?

Also can  you show me to how you use filter?

2006/8/8, Simon Willnauer <[EMAIL PROTECTED]>:


Hey Andy,

i don't know how you determinate whether a document has to be
displayed or not but I use a filter to do such kind of jobs. We have a
index for a specific website with personalized areas which should be
searchable for users having corresponding usergroups. That works quiet
well and you can use the filter cache e.g. cache the filter itself for
your queries.

regards Simon

On 8/7/06, hu andy <[EMAIL PROTECTED]> wrote:
> Martin, Thank you for your reply.
>
> But the Lucene API said:
> This is called in an inner search loop. For good search performance,
> implementations of this method should not call
> Searcher.doc(int)<
file:///E:/java/IR%20Library/lucene-1.9.1/lucene-1.9.1/docs/api/org/apache/lucene/search/Searcher.html#doc(int)
>or
> IndexReader.document(int)<
file:///E:/java/IR%20Library/lucene-1.9.1/lucene-1.9.1/docs/api/org/apache/lucene/index/IndexReader.html#document(int)
>on
> every document number encountered
>
> Because I have to check a field in the document to determine whether I
> should return the document. The total number of documents is about two
> hundred thousand. So I'm afraid the
> performance
>
>
> 2006/8/7, Martin Braun <[EMAIL PROTECTED]>:
> >
> > hi andy,
> > > How can I  use HitCollector to iterate over every returned document?
> >
> > You have to override the function collect for the HitCollector class
and
> > then store the retrieved Data in an array or map.
> >
> > Here is just a source-code scratch (is = IndexSearcher)
> >
> >is.search(query, null, new HitCollector(){
> >public void collect(int docID, float
score)
> >{
> >Document doc = is.doc(docID);
> >titles[docID] = doc.get
("title");
> >}
> >});
> >
> >
> > hth,
> > martin
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: About the use of HitCollector

2006-08-08 Thread hu andy

Hey, Ryan, Thanks for your reply.
The scenario is I use a custom Filter which get some information from a
database table which consists of hundreds of thousands of rows. I use the
IndexSearcher.search(query, filter, hitcollector). I found it was consumed
more time with filter than that without no filter.
Can you give me some advice?


2006/8/8, Ryan O'Hara <[EMAIL PROTECTED]>:


Hey Andy,

If you have enough RAM, try using FieldCache:

String[] fieldYouWant = FieldCache.DEFAULT.getStrings
(searcher.getIndexReader(), "fieldYouWant");
searcher.search(query, new HitCollector(){
   public void collect(int doc, float score){
   doWhatYouWant(fieldYouWant[doc]);
   }
}

If you need all results, this is probably the fastest method.
However, this is assuming you have some way of storing the
fieldYouWant array.  For all indexes, especially those containing
many documents, it is a good idea to store the fieldYouWant array,
since creating this array creates serious overhead.

Best,
Ryan

On Aug 7, 2006, at 9:48 AM, hu andy wrote:

> Martin, Thank you for your reply.
>
> But the Lucene API said:
> This is called in an inner search loop. For good search performance,
> implementations of this method should not call
> Searcher.doc(int) lucene-1.9.1/docs/api/org/apache/lucene/search/Searcher.html#doc
> (int)>or
> IndexReader.document(int) lucene-1.9.1/docs/api/org/apache/lucene/index/
> IndexReader.html#document(int)>on
> every document number encountered
>
> Because I have to check a field in the document to determine whether I
> should return the document. The total number of documents is about two
> hundred thousand. So I'm afraid the
> performance
>
>
> 2006/8/7, Martin Braun <[EMAIL PROTECTED]>:
>>
>> hi andy,
>> > How can I  use HitCollector to iterate over every returned
>> document?
>>
>> You have to override the function collect for the HitCollector
>> class and
>> then store the retrieved Data in an array or map.
>>
>> Here is just a source-code scratch (is = IndexSearcher)
>>
>>is.search(query, null, new HitCollector(){
>>public void collect(int docID,
>> float score)
>>{
>>Document doc = is.doc(docID);
>>titles[docID] = doc.get
>> ("title");
>>}
>>});
>>
>>
>> hth,
>> martin
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: SQL-Like Join in Lucene

2006-08-10 Thread hu andy

4. Search for records with filter.

if the filter returns a lot of ids, it willn' t be fast.
Recently I have a test. I customized a filter which get a list of ids from a
mysql database table of size 5000. Then I invoke the search(query, filter,
hitcollector), I took me more than 40s to retrieve the first 100 hits. Then
I try search(query, hitcollector), the fragment code is following:

   public MyHitCollector1(ArrayList list , String[] allIDs, ArrayList
idList)
   {
   this.list = list;
   this.idList = idList;
   this.allIDs = allIDs;
   }
   public override void Collect(int doc, float score)
   {
   if (score > 0.0f)
   {
   if (count < 1000)
   {
   if (idList.BinarySearch(allIDs[doc]) >= 0)
   {
   list.Add(IndexSearcher.GetDocument(doc));
   count++;
   }
   }
   else
   return;

   }
   }
you can get String[] allIDs = FieldCache.DEFAULT.GetStrings(IndexReader(),
"ID") in this way.

This method took about 8s to retrieve the first 100/1000 hits. Here the
database table has about 300,000 records


2006/8/11, Aleksei Valikov <[EMAIL PROTECTED]>:


Hi.

I'm investigating a possibility to make a "join" in Lucene/Compass.

Here's the thread:
http://forums.opensymphony.com/thread.jspa?threadID=39685&tstart=0

I have records m:m entities. Entities hold indexed information. Records
consist
of entities. One entity may belong to many records.
I would like to search for records having certain entity information.

Entity documents contain indexable entity fields plus entity id.
Record documents contain indexable record fielfs, record id and entity
ids.

I'd like to "search for records having entity ids in (search for entity
ids
where entity fields satisfy condition)".

Currently I am using self-written InSetFilter to accomplish the task.
1. Search among entities by the given condition.
2. Put ids of the found entities into a set.
3. Create filter with this set.
4. Search for records with filter.

The join is basically implemented by a filter:

   public BitSet bits(IndexReader reader) throws IOException {

   BitSet bits = new BitSet(reader.maxDoc());

   int[] docs = new int[1];
   int[] freqs = new int[1];
   for (Iterator iterator = set.iterator(); iterator.hasNext();)
{
   final String id = (String) iterator.next();
   final TermDocs termDocs = reader.termDocs(new
Term(fieldName, id));
   final int count = termDocs.read(docs, freqs);
   if (count == 1) {
   bits.set(docs[0]);
   }
   }
   return bits;
   }

Is this approach ok or I missed something and there's an easier way to
join?

Thank you for you time.

Bye.
/lexi

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Lopsided scores for each term in BooleanQuery

2006-09-18 Thread Andy Liu

For multi-word queries, I would like to reward documents that contain a more
even distribution of each word and penalize documents that have a skewed
distribution.  For example, if my search query is:

+content:fast +content:car

I would prefer a document that contains each word an equal number of times
over a document that contains the word "fast" 100 times and the word "car" 1
time.  In other words, I would like to compare the scores of each
BooleanQuery term and adjust the score according to the distribution.

Can somebody point me in the right direction as to how I would implement
this?

Thanks,
Andy


Re: Lopsided scores for each term in BooleanQuery

2006-09-18 Thread Andy Liu

In our application we have multiple fields that are searched.  So fast car
becomes:

+(field1:fast field2:fast field3:fast) +(field1:car field2:car field3:car)

I understand that the default sqrt implementation of tf() would help the
"lopsided score" phenomenon with searches within the same field.  But when
searching in multiple fields, this effect is obscured since each matching
field adds to the score of that clause.  Is there a way to "peek" at the
scores of each clause, and adjust based on how divergent the scores are?  Or
is there an easier way to do this that I'm just not seeing?

Andy

On 9/18/06, Paul Elschot <[EMAIL PROTECTED]> wrote:


On Monday 18 September 2006 23:08, Andy Liu wrote:
> For multi-word queries, I would like to reward documents that contain a
more
> even distribution of each word and penalize documents that have a skewed
> distribution.  For example, if my search query is:
>
> +content:fast +content:car
>
> I would prefer a document that contains each word an equal number of
times
> over a document that contains the word "fast" 100 times and the word
"car" 1
> time.  In other words, I would like to compare the scores of each
> BooleanQuery term and adjust the score according to the distribution.
>
> Can somebody point me in the right direction as to how I would implement
> this?

It's already there in DefaultSimilarity.tf() which is the square root:

(sqrt(1) + sqrt(1)) > (sqrt(0) + sqrt(2))


Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Obtaining the contexts of hits

2005-03-09 Thread Andy Roberts
Hi,

I've been using Lucene for a few months now, although not in a typical 
"building a search engine" kind of way*. Basically, I have some large 
documents. I would like a system whereby I search for a term, and then I 
receive a hit for each match, with its context, e.g., ten words either side 
of the match.

I've been looking through the API, and there's a SpanNearQuery (or something 
similar) which looked at first glance to be what I want, but after second 
thoughts, it appears it's searching for a set of words within a given 
proximity. 

I know Lucene stores TermPositions: I know how to get the position of a match. 
Is there a way of doing a reverse lookup, i.e., given a position, return the 
term. Because if that were the case, I could easily build a context up by 
finding pos, then looping to get all terms +- a specified window.

Either way, I can't see a way forward. Simply being able to find which 
documents the terms are in isn't very helpful, because let's say I find the 
docs that match, I then have to open up each one, tokenise and re-search for 
my term,etc, etc. All this info is there, somewhere in the index, I just want 
to get to it so that I can benefit from the many speed benefits of Lucene.

I don't expect sample code or anything, just a pointer to the right direction. 
I do own the LIA book, but haven't read it all yet - so if there's anything 
in there which could be relevant, please let me know :)

Much help appreciated,
Andrew Roberts

* for those who are interested, I'm a computer scientist doing research which 
is basically a cross between computational linguistics and machine learning. 
I work with large text corpora, gather information about how words behave 
relative to each and try to infer word class and grammatical structure from 
it. I'm using Lucene to try and speed up text processing overheads, by having 
my corpora indexed.

-- 
Computer Vision and Language Reseearch Group
School of Computing
Univeristy of Leeds
Leeds, UK
LS2 9JT
http://www.comp.leeds.ac.uk/andyr

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Highlighter compile error

2005-03-10 Thread Andy Roberts
I've search the archives for this error, but it reported no matches...

I'm trying to get hold of the Highlighter code as this could be relevant 
to my earlier post. I've checked out the highlight repo to my PC and 
tried to build.

I get the following error:

$ ant
Buildfile: build.xml

init:
 [echo] Building highlighter

compile:
[javac] Compiling 17 source files 
to /home/andyr/programming/java/lucene/hig
hlighter/build/classes

[javac] /home/andyr/programming/java/lucene/highlighter/src/java/org/apache/
lucene/search/highlight/TokenSources.java:19: cannot find symbol
[javac] symbol  : class TermVectorOffsetInfo
[javac] location: package org.apache.lucene.index
[javac] import org.apache.lucene.index.TermVectorOffsetInfo;
[javac]^

[javac] /home/andyr/programming/java/lucene/highlighter/src/java/org/apache/
lucene/search/highlight/TokenSources.java:124: cannot find symbol
[javac] symbol  : class TermVectorOffsetInfo
[javac] location: class org.apache.lucene.search.highlight.TokenSources
[javac] TermVectorOffsetInfo[] offsets=tpv.getOffsets(t);
[javac] ^

[javac] /home/andyr/programming/java/lucene/highlighter/src/java/org/apache/
lucene/search/highlight/TokenSources.java:124: cannot find symbol
[javac] symbol  : method getOffsets(int)
[javac] location: interface org.apache.lucene.index.TermPositionVector
[javac] TermVectorOffsetInfo[] offsets=tpv.getOffsets(t);
[javac]   ^
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.
[javac] 3 errors

BUILD FAILED
/home/andyr/programming/java/lucene/common.xml:107: Compile failed; see the 
compiler error output for details.

It may not be obvious to those not using fixed-width fonts, but basically it 
can't find the 
TermVectorOffsetInfo class. Which is hardly surprising, since it doesn't seem 
to exist! I've
also downloaded and successfully built the code in the lucene-1.4.2-dev 
branch,
but that doesn't contain that class either!

Any hints? Google didn't shed any light, btw.

Cheers,
Andy Roberts

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Escaping special characters

2005-04-07 Thread Andy Roberts
On Thursday 07 Apr 2005 06:38, Chuck Williams wrote:
> Mufaddal Khumri writes (4/6/2005 11:21 PM):
> >Hi,
> >
> >Am new to Lucene. I found the following page:
> >http://lucene.apache.org/java/docs/queryparsersyntax.html. At the bottom
> >of the page there is a section that in order to escape special
> >characters one would use "\".
> >
> >
> >
> >I have an Indexer that indexes product names. Some product names have
> >"-" character in them. When I use my search class to search for product
> >names with - in them it wont find those products.
>
> How did you index those product names?  I.e., if you used a tokenized
> field for the product names and an analyzer that breaks on the hyphens,
> then there are no hyphenated tokens for you to match.  I would suggest
> using Luke to browse your index and see what you have.
>
> Chuck

The Lucene In Action book has an excellent chapter on Analysers - well worth a 
read. Of particular interest is some code that allows you to see how a given 
Analyser tokenises an input string.

You can download the source code from the book 
(http://www.lucenebook.com/LuceneInAction.zip). If you unzip this file you 
will find a directory called "LuceneInAction/src/lia/analysis" and in there 
is a class called AnalyzerDemo (which depends in AnalyzerUtils). Compile this 
and run to see how the Analysers work. Put in your hyphenated strings to see 
how they work too.

HTH
Andy

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multi-analyzer ?

2005-04-11 Thread Andy Roberts
Can you not provide the user with a option list to specify their input 
language?

Language identification can be a pretty tricky field. There are some tricks 
you can do with unicode to identify language, e.g., \u0600 - \u06FF contains 
the Arabic characters, so if you're input contains lots of chars within this 
range, you can guess that the input is Arabic, for example.

The problem comes with differentiating between the languages that use a Latin 
alphabet. Again, there are multiple approaches, although the only one I know 
of that worked pretty well for identifying European languages was to build a 
model based on character bigrams (that is, sequences of two letters) [1]

At the end of the day, Lucene cannot help you in choosing the correct language 
as it doesn't know, and so it'll be up to you to add the necessary logic to 
tell Lucene which Analyzers to utilise. :(

Andy

[1] Churcher, G E; Hayes, J; Hughes, J S; Johnson, S; Souter, C. Bigram and 
trigram models for language identification and classification in: Evett, L & 
Rose,T (editors) Computational Linguistics for Speech and Handwriting 
Recognition AISB'94 Workshop University of Leeds/AISB. 1994.

On Monday 11 Apr 2005 01:21, Eric Chow wrote:
> Hello,
>
> If I don't know the language of the input terms, how can I use
> different analyzer to search it ?
>
> For example, the input box accepts UTF-8 search text, they can be
> anything, such as Chinese, Japanese, English, Russian, Deuch, etc. How
> can search any of them or all of them with Lucene?
>
> Any example, please?
>
>
> Best Regards,
> Eric
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Terms & Postion from Hits ...

2005-04-11 Thread Andy Roberts
I've managed something like this from a slightly different perspective.

IndexReader ir = new IndexReader(yourIndex);

String searchTerm = "word";

TermPositions tp = ir.termPositions(new Term("contents", searchTerm);

tp.next();
int termFreq = tp.freq();
System.out.print(currentTerm.text());

for (int i=0; i < termFreq; i++) {
System.out.print(" " + tp.nextPosition());
}
System.out.println();

ir.close();

This will print out something like:

word 1 67 104 155

Where the term "word" occurs at positions 1, 67, 104 and 155 in the field 
"contents" of the index ir.

HTH,
Andy Roberts

On Sunday 10 Apr 2005 15:52, Patricio Galeas wrote:
> Hello,
> I am new with Lucene. I have following problem.
> When I execute a search I receive the list of document Hits.
> I get without problem the content of the documents too:
>
> for (int i = 0; i < hits.length(); i++) {
>   Document doc = hits.doc(i);
>   System.out.println(doc.get("content"));
> }
>
> Now, I would like to obtain the List of all Terms (and their corresponding
> position) from each document (hits.doc(i)).
>
> I have experimented creating a second index with the founded documents
> (Hits), and analyze it to obtain this information, but the algorithm work
> very slow.
>
> Do you have another idea?
>
> Thank You for your help!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multi-analyzer ?

2005-04-11 Thread Andy Roberts
On Monday 11 Apr 2005 14:55, Mike Baranczak wrote:
> Your example with Arabic wouldn't work reliably either - there are
> several other languages that use the Arabic script (Persian for
> example).

Good point. Although you could try a simple approach to test for the 
additional characters that exist in Persian but not in Arabic. Although, this 
again is not fool-proof. A letter-model approach would be better but is 
rather time consuming.

>
> This is the sort of problem that the end user can solve much better
> than the software can.
>

I completely agree, which is why I originally suggested prompting the user for 
this info. It may be the case that for the majority of queries, English is 
the usual language. And it is probably more feasible to do a test to 
determine whether the query English or not (still very tricky, mind). If not, 
then prompt the user to specify their input language because otherwise, 
results will be poor.

Andy Roberts

> -MB
>
> On Apr 11, 2005, at 6:02 AM, Andy Roberts wrote:
> > Can you not provide the user with a option list to specify their input
> > language?
> >
> > Language identification can be a pretty tricky field. There are some
> > tricks
> > you can do with unicode to identify language, e.g., \u0600 - \u06FF
> > contains
> > the Arabic characters, so if you're input contains lots of chars
> > within this
> > range, you can guess that the input is Arabic, for example.
> >
> > The problem comes with differentiating between the languages that use
> > a Latin
> > alphabet. Again, there are multiple approaches, although the only one
> > I know
> > of that worked pretty well for identifying European languages was to
> > build a
> > model based on character bigrams (that is, sequences of two letters)
> > [1]
> >
> > At the end of the day, Lucene cannot help you in choosing the correct
> > language
> > as it doesn't know, and so it'll be up to you to add the necessary
> > logic to
> > tell Lucene which Analyzers to utilise. :(
> >
> > Andy
> >
> > [1] Churcher, G E; Hayes, J; Hughes, J S; Johnson, S; Souter, C.
> > Bigram and
> > trigram models for language identification and classification in:
> > Evett, L &
> > Rose,T (editors) Computational Linguistics for Speech and Handwriting
> > Recognition AISB'94 Workshop University of Leeds/AISB. 1994.
> >
> > On Monday 11 Apr 2005 01:21, Eric Chow wrote:
> >> Hello,
> >>
> >> If I don't know the language of the input terms, how can I use
> >> different analyzer to search it ?
> >>
> >> For example, the input box accepts UTF-8 search text, they can be
> >> anything, such as Chinese, Japanese, English, Russian, Deuch, etc. How
> >> can search any of them or all of them with Lucene?
> >>
> >> Any example, please?
> >>
> >>
> >> Best Regards,
> >> Eric
> >>
> >> -
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multi-analyzer ?

2005-04-11 Thread Andy Roberts
On Tuesday 12 Apr 2005 00:53, Eric Chow wrote:
> But how about one document contains more than two different languages ??
>
>
> Eric

If you're indexing many documents which contain multiple languages then it's 
probably just better to use a SimpleAnalyser, rather than one that does any 
language specific stemming or removal of stoplist words.

If there are documents where one language is clearly more dominant than the 
other, then it would probably be ok to use an Analyzer for that language and 
hope it doesn't effect the indexing of the other language too much. However, 
it's clear that you can't really accomodate multi-language documents. It 
would be much easier to ensure all docs were in a single language before 
indexing.

Andy

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: getting the number of occurrences within a document

2005-04-14 Thread Andy Roberts
On Thursday 14 Apr 2005 15:15, Pablo Gomes Ludermir wrote:
> Hello all,
>
> I would like to get the following information from the index:
>
> 1. Given a term, how many times the term occurs in each document.
> Something like a triple:
> < Term, Doc1, Freq> , , , ...
>
> Is possible to do that?
>
>
> Regards,
> Pablo

Off the top of my head... assuming you have an IndexReader (or MultiReader) 
called reader:

TermEnum te = reader.terms();

while (te.next()) {
Term currentTerm = te.term();

TermDocs docs = reader.termDocs(currentTerm);
int docCounter = 1;
while (docs.next()) {
System.out.println(currentTerm.text() + ", doc" + docCount + ", 
" + docs.freq());
docCounter++;
}
}

HTH,

Andy

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Best way to purposely corrupt an index?

2005-04-19 Thread Andy Roberts
Hi,

Seems like an odd request I'm sure. However, my application relies an index, 
and should the index become unusable for some unfortunate reason, I'd like my 
app to gracefully cope with this situation.

Firstly, I need to know how to detect a broken index. Opening an IndexReader 
can potentially throw an IOException if a problem occurs, but presumably this 
will be thrown for other reasons, not just an unreadable index. Would the 
IndexReader.indexExists() be better?

Secondly, to test how my code responds to broken indexes, I'd like to 
purposely break an index. Any suggestions, or will removing any file from the 
directory be sufficient?

Many thanks,
Andy

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Best way to purposely corrupt an index?

2005-04-20 Thread Andy Roberts
On Tuesday 19 Apr 2005 22:37, Daniel Herlitz wrote:
> I would suggest you simply do not create unusable indexes. :-)  

I agree! :) I am obviously very confident that my application is building 
indexes correctly. I'm thinking of the rarer instances whereby user or system 
error has caused a problem. These could be permission problems, bad sector 
forming on harddisk that happens to be the same sector containing an index, 
user accidentally removing a file in the index, a virus, etc. All seem pretty 
far fetched, but not impossible.

As the index is rather critical to my program, I just wanted to make it really 
robust, and able to cope should a problem occur with the index itself. 
Otherwise, the user will be left with a non-functioning program with no 
explanation. That's my reasoning anyway.

Andy


> Handle 
> catch/throw/finally correctly and it should not present any problems.
>
> Assume one app builds the index, another uses it:
>
> try: Build the index in a separate catalogue.
> finally: remove ('rm') production index and move ('mv') newly built
> index to its place. Notify using app that it should reopen its IndexReader.
>
> This relies on UNIX file handling semantics. (Can't say a word about
> Windows). Don't know if this applies at all to our situation, but it
> works for us.
>
> /D
>
> Andy Roberts wrote:
> >Hi,
> >
> >Seems like an odd request I'm sure. However, my application relies an
> > index, and should the index become unusable for some unfortunate reason,
> > I'd like my app to gracefully cope with this situation.
> >
> >Firstly, I need to know how to detect a broken index. Opening an
> > IndexReader can potentially throw an IOException if a problem occurs, but
> > presumably this will be thrown for other reasons, not just an unreadable
> > index. Would the IndexReader.indexExists() be better?
> >
> >Secondly, to test how my code responds to broken indexes, I'd like to
> >purposely break an index. Any suggestions, or will removing any file from
> > the directory be sufficient?
> >
> >Many thanks,
> >Andy
> >
> >-
> >To unsubscribe, e-mail: [EMAIL PROTECTED]
> >For additional commands, e-mail: [EMAIL PROTECTED]
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Best way to purposely corrupt an index?

2005-04-20 Thread Andy Roberts
On Wednesday 20 Apr 2005 08:27, Maik Schreiber wrote:
> > As the index is rather critical to my program, I just wanted to make it
> > really robust, and able to cope should a problem occur with the index
> > itself. Otherwise, the user will be left with a non-functioning program
> > with no explanation. That's my reasoning anyway.
>
> You should perhaps go about implementing an automatic index backup feature
> of some sort. In the case of index corruption you would at least be able to
> go back to the latest backup.

Don't worry, I know what I intend to do *should* an error exist. My original 
post was about how to detect corrupt indexes, and also how to purposely 
corrupt an index for the purposes of testing.

Note, IndexReader throws IOExceptions, but, this could be for a multitude of 
reasons, not just a corrupt index. I was rather hoping for a 
CorruptIndexException of some sort!

It looks to me that if I do get an IOException, I will then have to perform a 
number of additional checks to eliminate the other possible causes of 
IOExceptions (such as permissions issues), and by a process of elimination, 
determine a corrupt index.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Best way to purposely corrupt an index?

2005-04-21 Thread Andy Roberts
On Wednesday 20 Apr 2005 12:52, Kevin L. Cobb wrote:
> My policy on this type of exception handling is to only byte off what
> you can chew. If you catch an IOException, then you simply report to the
> user that an unexpected error has occurred and the search engine is
> unobtainable at the moment. Errors should be logged and developers
> should look at the specifics of the error to solve the issue. As you
> implied, either it's a corrupted index, a permission problem, or another
> access problem.


Of course, you are making the assumption that Lucene is only used in the 
context of online search engines. This is not the case here. I have developed 
a stand alone application for text analysis, and I bundle the Lucene jar with 
it to store text in an efficient index. Once the software is on the users' 
computer, I don't want to be doing any maintenance of their indexes! (And I'm 
sure they'd prefer it that way too)

Andy

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Digester and simple XML files

2005-04-22 Thread Andy Roberts
Hi all,

Just been playing with Digester after reading chapter 7 in LIA. Seems to fit 
my needs as I have a relatively simple XML structure. 











some sentences
some more sentences




Now, I want the text that's found between the  tags within the body 
section. So, I wrote a little test class using a largely bastardised version 
of DigesterXMLHandler from LIA.

I'm a tad confused to say the least. I create a Paragraph object upon seeing 
text/body. Then, call setText when I see the next  tag. And that's all I 
want per object, so I've added the addSetNext to call printParagraph which I 
hope prints the previous paragraph contents. But it doesn't!

My code looks so wrong but I've been hacking at it for a while with little 
fun. Any suggestions?

Thanks,
Andy

public class DigesterTest {

private Digester dig;

public DigesterTest(File inFile) throws IOException, SAXException {
dig = new Digester();
dig.setValidating(false);
  
dig.addObjectCreate("text/body/", Paragraph.class);


dig.addCallMethod("text/body/p", "setText", 0);

dig.addSetNext("text/body/p", "printParagraph");

System.out.println(inFile);
dig.parse(inFile);

}

public void printParagraph(Paragraph p) {
System.out.println(p.getText());
}

   
public static void main(String[] args) {
try {
new DigesterTest(new File(args[0]));
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
}
}
public class Paragraph {

private String text;

public Paragraph() {
}

public String getText() {
return text;
}

public void setText(String inText) {

if (inText != null) {
text = inText;
}
}

   }

}

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Retrieve all terms

2005-05-19 Thread Andy Roberts
On Thursday 19 May 2005 06:53, Morus Walter wrote:

> I think he doesn't want the contents but a term list for these contents.
> Something like
> 1   1
> 4   1
> content 2
> document  2
> for his sample, where the number is the fequency of the term.
>
> I don't think that you can easily get that from one lucene index.
> The easiest way to get a term listing for one field of one document is
> to use the term vector support. But for a document collection that would
> still mean to join all term vectors of all matched documents.

I'm not sure if this helps. I have a method that I used when experimenting 
with Lucene. It takes in a String which contains the path for an index, opens 
a reader, enumerates each term in that index. For each term, it prints out 
the term itself its frequency, and the number of documents that term appeared 
in.

public static void viewTerms(String indexPath) throws IOException {

IndexReader reader = IndexReader.open(indexPath);

TermEnum te = reader.terms();

while (te.next()) {
Term currentTerm = te.term();

TermPositions tp = reader.termPositions(currentTerm);

int termFreq = 0;

while (tp.next()) {
termFreq += tp.freq();
}

System.out.println(currentTerm.text() + "(" + termFreq 
+ "|" + te.docFreq() 
+ ")");
    
}

reader.close();
}

HTH,
Andy

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing multiple languages

2005-06-03 Thread Andy Roberts
On Friday 03 Jun 2005 01:06, Bob Cheung wrote:
> For the StandardAnalyzer, will it have to be modified to accept
> different character encodings.
>
> We have customers in China, Taiwan and Hong Kong.  Chinese data may come
> in 3 different encoding:  Big5, GB and UTF8.
>
> What is the default encoding for the StandardAnalyser.

The analysers themselves do not worry about encodings, per se. Java uses 
Unicode strings throughout, which is adequate enough to describing all 
languages.  When reading in text files, it's a matter of letting the reader 
know which encoding the file is in, this helps Java to read in the text, and 
essentially map that encoding to the Unicode encoding. All the string 
operations, like analysing are done on these Unicode strings.

So, the task is making sure the file reader you use to open a document for 
indexing is given the required information for correctly decoding your file. 
If you don't specify, Java will use one based on the locale that your OS 
uses. For me, that's Latin1 as I'm in Britain. This clearly is inadequate for 
non-Latin texts and wouldn't be able to read in Chinese texts properly as the 
Latin1 encoding doesn't support such characters. You need to specify Big5 
yourself. Read the info on InputStreamReaders:

http://java.sun.com/j2se/1.5.0/docs/api/java/io/InputStreamReader.html

Andy

>
> Btw, I did try running the lucene demo (web template) to index the HTML
> files after I added one including English and Chinese characters.  I was
> not able to search for any Chinese in that HTML file (returned no hits).
> I wonder whether I need to change some of the java programs to index
> Chinese and/or accept Chinese as search term.  I was able to search for
> the HTML file if I used English word that appeared in the added HTML
> file.
>
> Thanks,
>
> Bob
>
>
> On May 31, 2005, Erik wrote:
>
> Jian - have you tried Lucene's StandardAnalyzer with Chinese?  It
> will keep English as-is (removing stop words, lowercasing, and such)
> and separate CJK characters into separate tokens also.
>
>  Erik
>
> On May 31, 2005, at 5:49 PM, jian chen wrote:
> > Hi,
> >
> > Interesting topic. I thought about this as well. I wanted to index
> > Chinese text with English, i.e., I want to treat the English text
> > inside Chinese text as English tokens rather than Chinese text tokens.
> >
> > Right now I think maybe I have to write a special analyzer that takes
> > the text input, and detect if the character is an ASCII char, if it
> > is, assembly them together and make it as a token, if not, then, make
> > it as a Chinese word token.
> >
> > So, bottom line is, just one analyzer for all the text and do the
> > if/else statement inside the analyzer.
> >
> > I would like to learn more thoughts about this!
> >
> > Thanks,
> >
> > Jian
> >
> > On 5/31/05, Tansley, Robert <[EMAIL PROTECTED]> wrote:
> >> Hi all,
> >>
> >> The DSpace (www.dspace.org) currently uses Lucene to index metadata
> >> (Dublin Core standard) and extracted full-text content of documents
> >> stored in it.  Now the system is being used globally, it needs to
> >> support multi-language indexing.
> >>
> >> I've looked through the mailing list archives etc. and it seems it's
> >> easy to plug in analyzers for different languages.
> >>
> >> What if we're trying to index multiple languages in the same
> >> site?  Is
> >> it best to have:
> >>
> >> 1/ one index for all languages
> >> 2/ one index for all languages, with an extra language field so
> >> searches
> >> can be constrained to a particular language
> >> 3/ separate indices for each language?
> >>
> >> I don't fully understand the consequences in terms of performance for
> >> 1/, but I can see that false hits could turn up where one word
> >> appears
> >> in different languages (stemming could increase the changes of this).
> >> Also some languages' analyzers are quite dramatically different (e.g.
> >> the Chinese one which just treats every character as a separate
> >> token/word).
> >>
> >> On the other hand, if people are searching for proper nouns in
> >> metadata
> >> (e.g. "DSpace") it may be advantageous to search all languages at
> >> once.
> >>
> >>
> >> I'm also not sure of the storage and performance consequences of 2/.
> >>
> >> Approach 3/ seems like it might be the most c

Relative term frequency?

2005-06-06 Thread Andy Liu
Is there a way to calculate term frequency scores that are relative to
the number of terms in the field of the document?  We want to override
tf() in this way to curb keyword spamming in web pages.  In
Similarity, only the document's term frequency is passed into the tf()
method:

float tf(int freq)

It would be nice to have something like:

float tf(int freq, String fieldName, int numTerms)

If this isn't available out of the box, how difficult would it be to
hack up Lucene to allow for this?

Thanks,
Andy

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Hypenated word

2005-06-13 Thread Andy Roberts
On Monday 13 Jun 2005 13:18, Markus Wiederkehr wrote:
> I see, the list of exceptions makes this a lot more complicated than I
> thought... Thanks a lot, Erik!
>

I expect you'll need to do some pre-processing. Read in your text into a 
buffer, line-by-line. If a given line ends with a hyphen, you can manipulate 
the buffer to merge the hyphenated tokens.

Andy


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Hypenated word

2005-06-13 Thread Andy Roberts
On Monday 13 Jun 2005 14:52, Markus Wiederkehr wrote:
> On 6/13/05, Andy Roberts <[EMAIL PROTECTED]> wrote:
> > On Monday 13 Jun 2005 13:18, Markus Wiederkehr wrote:
> > > I see, the list of exceptions makes this a lot more complicated than I
> > > thought... Thanks a lot, Erik!
> >
> > I expect you'll need to do some pre-processing. Read in your text into a
> > buffer, line-by-line. If a given line ends with a hyphen, you can
> > manipulate the buffer to merge the hyphenated tokens.
>
> As Erik wrote it is not that simple, unfortunately. For example, if
> one line ends with "read-" and the next line begins with "only" the
> correct word is "read-only" not "readonly". Whereas "work-" and "ing"
> should of course be merged into "working".
>
> Markus

Perhaps you do some crude checking against a dictionary. Combine the word 
anyway and check if it's in the dictionary. If so, keep it merged otherwise, 
it's a compound and so revert back to the hyphenated form.

Word lists come part of all good OSS dictionary projects, as well as other 
language resources, like the BNC word lists etc.

Andy

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: n-gram indexing

2005-07-18 Thread Andy Roberts
On Monday 18 Jul 2005 21:27, Rajesh Munavalli wrote:
> At what point do I add n-grams? Does the order in which I add n-grams
> affect exact phrase queries later? My questions are
>
> (1) Should I add all the 1-grams followed by 2-grams followed by
> 3-grams..etc sentence by sentence OR
>
> (2) Add all the 1 grams of entire document first before starting 2-grams
> for the entire document?
>
> What is the general accepted notion of adding n-grams of a document?
>
> thanks,
>
> Rajesh

I can't see any real advantage of storing n-grams explicitly. Just index the 
document and use phrase queries. Order is significant with phrase queries if 
I recall correctly, although you can use SpanNearQueries to look for 
unordered ngrams, although I don't know why you would want to!

Perhaps if you explain a little more about what you are trying to achieve more 
generally, we can confirm that you don't need to mess with explicit indexing 
of indexing.

Andy

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: n-gram indexing

2005-07-18 Thread Andy Roberts
On Monday 18 Jul 2005 22:06, Rajesh Munavalli wrote:
> Intution behind adding n-grams is to boost naturally occurring larger
> phrases versus using phrase queries. For example, if I am searching for
> "united states of america", I want the search results to return the
> documents ordered as follows
>
> Rank 1 - Documents containing all the words occurring together
> Rank 2 - Documents containing maximum number of words in the same
> sentence
> Rank 3 - Documents containing all the words but some might appear in the
> same sentence some may not
> Rank 4 - Documents containig atleast one or two words
>
> If we have a n-gram index, most probably document talking about "united
> states" gets preference over document containing "united" and "states"
> seperately. If I am correct, this can be achieved without using phrase
> queries. I am not sure if there is a better way to achieve the same
> effect.
>

I don't think ngrams will help either. You could perform a set of individual 
queries. Firstly, run the phrase query to find hits with the exact phrase, 
then perhaps run a SpanNear query to find the docs with the terms close to 
each other. Thirdly, do a boolean AND query for all terms and fourthly run an 
OR boolean query. It will require a little extra processing of course, as you 
are technically executing 4 queries in 1. Naturally, this only has to be done 
when there are more than one term in the search query. Also, there is 
obviously going to be some duplication of hits, so you could use a HashMap 
when iterating of the Hits to ensure you get unique hits when the queries are 
collated.

Andy

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: A very technical question.

2005-09-28 Thread Andy Liu
While you're indexing, you can assign each doc with a field that refers to
how long the document is. So, for example, you can add a field named
"docLength" for each document, and assign it with discrete values such as
"veryshort", "short", "medium", "long", "verylong", depending on how
granular you need it. Then at query time you can specify the field and a
given boost value, i.e.

civil war docLength:verylong^5 docLength:long^3

Andy

On 9/28/05, Dawid Weiss <[EMAIL PROTECTED]> wrote:
>
>
> Hi.
>
> I have a very technical question. I need to alter document score (or in
> fact: document boosts) for an existing index, but for each query. In
> other words, I'd like these to have pseudo-queries of the form:
>
> 1. civil war PREFER:shorter
> 2. civil war PREFER:longer
>
> for these two queries, 1. would score shorter documents higher then
> option 2, which would in turn score longer documents higher. Note that
> these preferences can be expressed at query time, so static document
> boosts are of little help.
>
> I'd appreciate if those familiar with the internals of Lucene gave me
> brief instructions on how this could be achieved (my rough guess is that
> I'll need to build my own Scorer... but how to access document length
> and where to plug in that scorer... besides I'd rather hear it from
> somebody with more expertise).
>
> Thanks,
> D.
>
> -----
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


--
Andy Liu
[EMAIL PROTECTED]
(301) 873-8458


Query to return all documents in the index

2005-10-05 Thread Andy Goodell
Hi,
In my project we've been using the Searcher.search(query, filter, sort)
method to gather results. But as it turns out, sometimes we just want all of
the documents that match with the filter, sorted by the sort field. Does
anyone know a query that returns all the documents in the index, so that i
could use that in this case?

thanks,
andy g


Do you believe in Clause sanity?

2005-10-13 Thread Andy Lee
The API for BooleanQuery only seems to allow adding clauses.  The  
nearest way I can see to *remove* a clause is by laboriously  
constructing a new BooleanQuery (assuming you aren't absolutely tied  
to the original instance) and adding all the clauses from the  
original query except the one you're removing.  And *that's* rather  
cumbersome because you can't actually add a clause; you have to use  
one of the addRequired-/addProhibited- methods -- and they take  
arrays of String rather than the array of Term that you can get from  
a Clause.


It seems reasonable to me to want to remove clauses from a query.  Is  
there some reasonable way of doing this that I'm missing?


--Andy


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Do you believe in Clause sanity?

2005-10-13 Thread Andy Lee
Oops, I'm confusing libraries.  I meant I want to remove a Nutch  
Clause from a Nutch Query.


--Andy

On Oct 13, 2005, at 4:45 PM, Andy Lee wrote:

The API for BooleanQuery only seems to allow adding clauses.  The  
nearest way I can see to *remove* a clause is by laboriously  
constructing a new BooleanQuery (assuming you aren't absolutely  
tied to the original instance) and adding all the clauses from the  
original query except the one you're removing.  And *that's* rather  
cumbersome because you can't actually add a clause; you have to use  
one of the addRequired-/addProhibited- methods -- and they take  
arrays of String rather than the array of Term that you can get  
from a Clause.


It seems reasonable to me to want to remove clauses from a query.   
Is there some reasonable way of doing this that I'm missing?


--Andy


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]