date:20050321

Re: Removing similar documents from search results

2005-03-21 Thread Miles Barr

On Sun, 2005-03-20 at 00:49 -0800, Chris Hostetter wrote:
> Actually, your "Split across several pages" comment implies that you want
> a system which can tell that page 1 of a multipage article should be
> grouped with page 2 -- which may be radically different content.  Most
> multipage documents have very differnet text on subsequent pages, so i'm
> not sure that a progromatic solution is going to be bale to spot that.

Actually I added that in after I saw that Google does it. You're right
that the context is likely to be completely different so I guess they do
it through some URL matching.

> I may also be reading too much into your message, but it sounds like you
> aren't trying to index generic content -- it sounds like you are trying to
> index content under your control (ie: content on your own web site).
> 
> if that's the case, then presumably you know somethign about the
> source data and the URL strucutre -- maybe you could solve this problem
> when you build your index.
> 
> for example, if i look at a site like perl.com, i can see a pattern in the
> way the article URLs look...
> 
> page 1...
> http://www.perl.com/pub/a/2005/02/17/3d_engine.html
> page 2, etc...
> http://www.perl.com/pub/a/2005/02/17/3d_engine.html?page=2
> printable...
> http://www.perl.com/lpt/a/2005/02/17/3d_engine.html
> 
> 
> So instead of putting all of those URLs in the index as seperate docs, why
> not create a single doc, with all of those URLs?

I have to index several sites and I used some examples of the problems
I've come across so far. I don't control the content for any of them,
and they get picked up by a spider so excluding pages requires adding
special cases.

I'll probably adopt a two stage approach.

1. Prevent duplicate documents from getting into the index in the first
place, e.g. compare MD5 hashes and file sizes, maybe make the spider
configurable to spot certain URL patterns, etc.

2. Try out the various techniques suggested in this thread to spot
similar pages at query time and hide them.

-- 
Miles Barr <[EMAIL PROTECTED]>
Runtime Collective Ltd.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Multiple Field Queries

2005-03-21 Thread Gusenbauer Stefan

Hello,
at the moment i cannot search through the mailinglist archives so i will 
bother you. I will search over multiple fields for example content and 
filename. The MultiFieldQueryParser is not applicable for me so i create 
the query syntax programmatically. The querystring is parsed with the 
QueryParser i use it in this example two times for content and filename 
the resulting query. Then i combine them with BooleanQuery add the 
resulting string is for example +content:test +filename:test. The 
problem here is i would like to construct a query like (+content:test) 
OR (+filename:test). Is the only alternative to extend the boolean query 
to the string and make some string operations above it and pass it 
through the QueryParser again?
Thanks
Stefan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Multiple Field Queries

2005-03-21 Thread Aad Nales

Perhaps i misunderstand but it seems to me that if you execute the add 
with two times a false value you will end up with the required result.

(content:test) (filename:test)
which is equivalent to your requested query.
hope this helps,
Aad Nales
Gusenbauer Stefan wrote:
Hello,
at the moment i cannot search through the mailinglist archives so i 
will bother you. I will search over multiple fields for example 
content and filename. The MultiFieldQueryParser is not applicable for 
me so i create the query syntax programmatically. The querystring is 
parsed with the QueryParser i use it in this example two times for 
content and filename the resulting query. Then i combine them with 
BooleanQuery add the resulting string is for example +content:test 
+filename:test. The problem here is i would like to construct a query 
like (+content:test) OR (+filename:test). Is the only alternative to 
extend the boolean query to the string and make some string operations 
above it and pass it through the QueryParser again?
Thanks
Stefan




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: how to detect index integrity?

2005-03-21 Thread Ravi Rao

> From: [EMAIL PROTECTED]
> Sent: Fri 3/18/2005 11:34 PM

> Is there any way to detect the index's integrity?
> Sometimes I came upon exceptions like these. If it happens, my only way 
> is to delete the corrupted index.

>* Exception in thread "main" java.io.IOException : read past EOF
>* java.lang.ArrayIndexOutOfBoundsException

> [ ... ]

I did too, which is why I wrote NullDirectory.  You can find the
sources and a description in bugzilla.

   http://issues.apache.org/bugzilla/show_bug.cgi?id=33851

Look at the tests for examples of use.  I would value your feedback.
-- 
Ravi/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Multiple Field Queries

2005-03-21 Thread Gusenbauer Stefan

Aad Nales wrote:
Perhaps i misunderstand but it seems to me that if you execute the add 
with two times a false value you will end up with the required result.

(content:test) (filename:test)
which is equivalent to your requested query.
hope this helps,
Aad Nales
Gusenbauer Stefan wrote:
Hello,
at the moment i cannot search through the mailinglist archives so i 
will bother you. I will search over multiple fields for example 
content and filename. The MultiFieldQueryParser is not applicable for 
me so i create the query syntax programmatically. The querystring is 
parsed with the QueryParser i use it in this example two times for 
content and filename the resulting query. Then i combine them with 
BooleanQuery add the resulting string is for example +content:test 
+filename:test. The problem here is i would like to construct a query 
like (+content:test) OR (+filename:test). Is the only alternative to 
extend the boolean query to the string and make some string 
operations above it and pass it through the QueryParser again?
Thanks
Stefan




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

thanks
add(query,false,false) works now. The failure was because i added a 
field to all documents for searching with a datefile. therefore there 
were always all documents returned.
stefan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: new added documents not showing

2005-03-21 Thread roy-lucene-user

On Sat, 19 Mar 2005 22:43:44 +0300, Pasha Bizhan <[EMAIL PROTECTED]> wrote:
> Could you provide the code snippets for your process?
> 

Sure (thanx for helping, btw)

I just realized that the way I described our process was off a little bit.

Here's the process again:

1.  grab all index Directorys (index parts)
2.  loop newest to oldest and make documents unique (by deleting older 
documents)
3.  get list of documents from index parts to delete from our main index
4.  delete documents from main index
5.  add all documents from index parts into the main index

I apologize for the amount of code below.

Here is the code that loops through all the index parts, from newest to oldest, 
and then deletes the documents from any older index parts.

The unique ID we use as a Key Field is "ReceivedDate".

IndexReader reader = null;
IndexReader reader2 = null;

try {
/*
 *-
 * Loop backwards (latest to oldest) through parts
 *-
 */
for ( int i = ( directories.length - 1 ); i >= 0; i-- ) {
reader = IndexReader.open( FSDirectory.getDirectory( 
directories[i], false ) );
int numDocuments = reader.numDocs();

/*
 *-
 * Loop forward (oldest to latest) up to the current part
 * being looked at.
 * Delete any messages from the older parts that exist in the
 * current part.
 *-
 */
for ( int x = 0; x < i; x++ ) {
String partName = directories[x].getName();
reader2 = IndexReader.open( FSDirectory.getDirectory( 
directories[x], false ) );

for ( int h = 0; h < numDocuments; h++ ) {
if ( !reader.isDeleted( h ) ) {
Document d = reader.document( h );
String receivedDate = d.get( "ReceivedDate" );
Term term = new Term( "ReceivedDate", receivedDate 
);
int num = reader2.delete( term );
}
}

reader2.close();
reader2 = null;
}

reader.close();
reader = null;
}
}
catch ( Exception e ) {
// log error
}
finally {
try {
if ( reader != null ) reader.close();
if ( reader2 != null ) reader2.close();
}
catch ( IOException e ) {
// log error
}
}

Here we build up a list of ReceivedDates to help us delete from the main.index. 
 I just realized that we could build this list from the previous section.

List list = new ArrayList();
for ( int i = 0; i < directories.length; i++ ) {
IndexReader r = null;
try {
r = IndexReader.open( directories[i] );
int num = r.numDocs();

for ( int x = 0; x < num; x++ ) {
if ( !r.isDeleted( x ) ) {
Map map = new HashMap();
Document d = r.document( x );
map.put( "ReceivedDate", d.get( "ReceivedDate" ) );
list.add( map );
}
}
}
catch ( Exception e ) {
e.printStackTrace();
}
finally {
if ( r != null ) try { r.close(); } catch ( Exception e ) {}
}
}
return list;

Here we actually go through and delete the documents from the main index.

IndexReader reader = null;

Map message;
try {
reader = IndexReader.open( mainindex );
Iterator it = indexList.iterator(); // returned from previous 
section

/*
 *-
 * Loop through messages to clear from the index
 *-
 */
while ( it.hasNext() ) {
message = (Map)it.next();

/*
 *-
 * Delete based on received date
 *-
 */
String receivedDate = (String)message.get( "ReceivedDate" );
Term term = new Term( "ReceivedDate", receivedDate );
int num = reader.delete( term );

Re: NumberTools

2005-03-21 Thread Chris Hostetter

: One annoyance I have run across is the impedance mismatch between
: range queries and sorting.
:
: If your terms are  indexed as standard numbers, then integer sorting
: is fast, but range queries don't work (for negative values).  If you
: format the terms such that range queries work for any integer, then
: you have to use the slower string (or custom) sorting.
:
: Is there a way around this besides writing my own custom sorting hit 
collector?

yeah, this is something that's never really made sense to me, I've tried
digging into the code to understand this a couple of times, but i've never
had much success, maybe my assumptions/understanding is wrong...

   1) lucene stores all fields as Strings
   2) You can construct a "Sort" object with SortField of type "INT"
   3) according to tribal wisdom (and Lucene in Action) sorting by a
  numeric fields caches the numeric value and is more efficient then
  sorting by a string field (in which the string value needs to be cached)

1+2+3 tells me that at some point, when the the search/sort code sees a
SortField of type "INT" (or of type AUTO and the value of that field in
the first doc looks like an INT) that a single pass is done to convert
the string value of hte field from disk into a numeric value for caching
(and sorting).

 So why couldn't a user specified NumberFormat object be used to
 convert that string into an Integer?  Allowing people to format
 their numbers in a way that sorts lexigraphically for Range Filters,
 but still get the good Numeric Sotr efficiency?


I can see in FieldDocSortedHitQueue where the case statement deals with
the various types of SortField, but at that point it's comparing FieldDoc
objects whose fields[i] is expected to allready be an "Integer" object.
where is that "Integer" object parsed from the String value of the field?



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: new added documents not showing

2005-03-21 Thread Pasha Bizhan

Hi, 

> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 

> I just realized that the way I described our process was off 
> a little bit.
> 
> Here's the process again:
> 
> I apologize for the amount of code below.

When you open the index writer? Where is the code?

Pasha Bizhan
http://lucenedotnet.com
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: new added documents not showing

2005-03-21 Thread Pasha Bizhan

Hi, 

> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]

> I just realized that the way I described our process was off a little 
> bit.
> 
> Here's the process again:
> 
> I apologize for the amount of code below.

When do you open the index writer? Where is the code?

Pasha Bizhan
http://lucenedotnet.com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: NumberTools

2005-03-21 Thread Chuck Williams


: One annoyance I have run across is the impedance mismatch between
: range queries and sorting.
:
: If your terms are  indexed as standard numbers, then integer sorting
: is fast, but range queries don't work (for negative values).  If you
: format the terms such that range queries work for any integer, then
: you have to use the slower string (or custom) sorting.
:
: Is there a way around this besides writing my own custom sorting hit collector?
 

I solve this problem by using two separate fields:  one for range 
queries and one for sorting, each formatted appropriately.  Adds very 
little space to the index.  A bit ugly, but better than writing a custom 
hit collector.

A better solution that unified these formats, perhaps along the lines 
Hoss suggests, would be appreciated.

Chuck

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: new added documents not showing

2005-03-21 Thread roy-lucene-user

> When do you open the index writer? Where is the code?

Ah, sorry.  That last section is in a method that gets called in a loop.

IndexWriter writer = null;
try {
writer = new IndexWriter( mainindex, new StandardAnalyzer(), false 
);
for ( int i = 0; i < directories.length; i++ ) {
moveDocumentsOver( writer, directories[i] );
// delete dir
}
}
catch ( Exception e ) {
// log error
}
finally {
if ( writer != null ) try { writer.close(); } catch ( Exception e ) 
{}
}

Roy.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: new added documents not showing

2005-03-21 Thread Pasha Bizhan

Hi, 

> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 

> Ah, sorry.  That last section is in a method that gets called 
> in a loop.

The shortest version of your code is:
-
 void mainFunction() {
IndexWriter writer = null;
writer = new IndexWriter( mainindex, new StandardAnalyzer(), false
);
moveDocumentsOver( writer, oldDirectory);   
writer.close();
 }

 void moveDocumentsOver( IndexWriter writer, string oldDirectory){
IndexReader r = null;
r = IndexReader.open( oldDirectory );
int num = r.numDocs();
for ( int i = 0; i < num; i++ ) {
  if ( !r.isDeleted( i ) ) { 
Document d = r.document( i );
Document nd = new Document();
// fill nd by d
writer.addDocument( nd );
  }
}
r.close();
}
-

And then you execute the search (using mainindex) and you don't see the new
documents. Yes?

Pasha Bizhan
http://lucenedotnet.com



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: new added documents not showing

2005-03-21 Thread roy-lucene-user

correct, we also can't see the new documents when we open an IndexReader to the 
main index.

Roy.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

using Expression language for lucene api

2005-03-21 Thread Omar Didi

I have the following expression : 



results is of type Hits, i want to know if there is a way using Expression 
language or jstl to access for example: result.doc(i).

boosting?

2005-03-21 Thread Stefan Groschupf

Hi there,
how to get the real boost value of a field or document?
The java doc says that it is _may_ not correct returned when reading a 
document with a index reader.
 Any hints how to get the boost when reading a document?
Thanks.
Stefan 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: boosting?

2005-03-21 Thread Erik Hatcher

Stefan,
Boosts are not stored directly, necessarily.  Each field has an 
associated normalization factor, of which boost is multiplied into.  
This value is precomputed at indexing time, so getting the boost isn't 
possible unless the length normalization is 1.0 (which is not usually a 
good idea).

Erik
On Mar 21, 2005, at 4:35 PM, Stefan Groschupf wrote:
Hi there,
how to get the real boost value of a field or document?
The java doc says that it is _may_ not correct returned when reading a 
document with a index reader.
 Any hints how to get the boost when reading a document?
Thanks.
Stefan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: new added documents not showing

2005-03-21 Thread Otis Gospodnetic

Hello,

Sorry if this is stating the obvious, but have you used Luke to verify
that the new documents were indexed in the first place?  Sorry if
you've already mentioned this.

Otis


--- [EMAIL PROTECTED] wrote:
> > When do you open the index writer? Where is the code?
> 
> Ah, sorry.  That last section is in a method that gets called in a
> loop.
> 
> IndexWriter writer = null;
> try {
> writer = new IndexWriter( mainindex, new
> StandardAnalyzer(), false );
> for ( int i = 0; i < directories.length; i++ ) {
> moveDocumentsOver( writer, directories[i] );
> // delete dir
> }
> }
> catch ( Exception e ) {
> // log error
> }
> finally {
> if ( writer != null ) try { writer.close(); } catch (
> Exception e ) {}
> }
> 
> Roy.
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: using Expression language for lucene api

2005-03-21 Thread Otis Gospodnetic

I think there are some taglibs that let you call functions on objects,
but you could also considering wrapping Hits in something that is JSTL
friendly, perhaps a List that JSTL knows how to handle.

Otis

--- Omar Didi <[EMAIL PROTECTED]> wrote:
> I have the following expression : 
> 
> 
> 
> results is of type Hits, i want to know if there is a way using
> Expression language or jstl to access for example: result.doc(i).
> 
>  
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: boosting?

2005-03-21 Thread Paul Elschot

Stephan,

On Monday 21 March 2005 22:35, Stefan Groschupf wrote:
> Hi there,
> how to get the real boost value of a field or document?
> The java doc says that it is _may_ not correct returned when reading a 
> document with a index reader.
>   Any hints how to get the boost when reading a document?

The javadoc of Field.setBoost() has meanwhile been extended a bit
(source from the trunk at
http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/document/):

   * The boost is multiplied by [EMAIL PROTECTED] Document#getBoost()} of 
the document
   * containing this field.  If a document has multiple fields with the same
   * name, all such values are multiplied together.  This product is then
   * multipled by the value [EMAIL PROTECTED] 
Similarity#lengthNorm(String,int)}, and
   * rounded by [EMAIL PROTECTED] Similarity#encodeNorm(float)} before it is 
stored in the
   * index.  One should attempt to ensure that this product does not overflow
   * the range of that encoding.

One feature of Similarity.encodeNorm(float) is that it returns a byte, so
at most 256 different values can be stored, which is a lot less than
the number of possible floating point values. 
encodeNorm() rounds to a representable value close to the given float,
and decodeNorm() returns that representable value, normally
used in TermScorer.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: NumberTools

2005-03-21 Thread John Patterson

Chris Hostetter  fucit.org> writes:

> 
>  So why couldn't a user specified NumberFormat object be used to
>  convert that string into an Integer?  Allowing people to format
>  their numbers in a way that sorts lexigraphically for Range Filters,
>  but still get the good Numeric Sotr efficiency?
> 
> I can see in FieldDocSortedHitQueue where the case statement deals with
> the various types of SortField, but at that point it's comparing FieldDoc
> objects whose fields[i] is expected to allready be an "Integer" object.
> where is that "Integer" object parsed from the String value of the field?
> 

Surely, by using the number -> string algorithm I showed earlier this would not
be a problem.  Did I miss something?





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Removing similar documents from search results

Multiple Field Queries

Re: Multiple Field Queries

RE: how to detect index integrity?

Re: Multiple Field Queries

Re: new added documents not showing

Re: NumberTools

RE: new added documents not showing

RE: new added documents not showing

Re: NumberTools

Re: new added documents not showing

RE: new added documents not showing

Re: new added documents not showing

using Expression language for lucene api

boosting?

Re: boosting?

Re: new added documents not showing

Re: using Expression language for lucene api

Re: boosting?

Re: NumberTools

20 matches

Site Navigation

Mail list logo

Footer information