Re: Removing similar documents from search results
On Sun, 2005-03-20 at 00:49 -0800, Chris Hostetter wrote: > Actually, your "Split across several pages" comment implies that you want > a system which can tell that page 1 of a multipage article should be > grouped with page 2 -- which may be radically different content. Most > multipage documents have very differnet text on subsequent pages, so i'm > not sure that a progromatic solution is going to be bale to spot that. Actually I added that in after I saw that Google does it. You're right that the context is likely to be completely different so I guess they do it through some URL matching. > I may also be reading too much into your message, but it sounds like you > aren't trying to index generic content -- it sounds like you are trying to > index content under your control (ie: content on your own web site). > > if that's the case, then presumably you know somethign about the > source data and the URL strucutre -- maybe you could solve this problem > when you build your index. > > for example, if i look at a site like perl.com, i can see a pattern in the > way the article URLs look... > > page 1... > http://www.perl.com/pub/a/2005/02/17/3d_engine.html > page 2, etc... > http://www.perl.com/pub/a/2005/02/17/3d_engine.html?page=2 > printable... > http://www.perl.com/lpt/a/2005/02/17/3d_engine.html > > > So instead of putting all of those URLs in the index as seperate docs, why > not create a single doc, with all of those URLs? I have to index several sites and I used some examples of the problems I've come across so far. I don't control the content for any of them, and they get picked up by a spider so excluding pages requires adding special cases. I'll probably adopt a two stage approach. 1. Prevent duplicate documents from getting into the index in the first place, e.g. compare MD5 hashes and file sizes, maybe make the spider configurable to spot certain URL patterns, etc. 2. Try out the various techniques suggested in this thread to spot similar pages at query time and hide them. -- Miles Barr <[EMAIL PROTECTED]> Runtime Collective Ltd. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Multiple Field Queries
Hello, at the moment i cannot search through the mailinglist archives so i will bother you. I will search over multiple fields for example content and filename. The MultiFieldQueryParser is not applicable for me so i create the query syntax programmatically. The querystring is parsed with the QueryParser i use it in this example two times for content and filename the resulting query. Then i combine them with BooleanQuery add the resulting string is for example +content:test +filename:test. The problem here is i would like to construct a query like (+content:test) OR (+filename:test). Is the only alternative to extend the boolean query to the string and make some string operations above it and pass it through the QueryParser again? Thanks Stefan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Multiple Field Queries
Perhaps i misunderstand but it seems to me that if you execute the add with two times a false value you will end up with the required result. (content:test) (filename:test) which is equivalent to your requested query. hope this helps, Aad Nales Gusenbauer Stefan wrote: Hello, at the moment i cannot search through the mailinglist archives so i will bother you. I will search over multiple fields for example content and filename. The MultiFieldQueryParser is not applicable for me so i create the query syntax programmatically. The querystring is parsed with the QueryParser i use it in this example two times for content and filename the resulting query. Then i combine them with BooleanQuery add the resulting string is for example +content:test +filename:test. The problem here is i would like to construct a query like (+content:test) OR (+filename:test). Is the only alternative to extend the boolean query to the string and make some string operations above it and pass it through the QueryParser again? Thanks Stefan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: how to detect index integrity?
> From: [EMAIL PROTECTED] > Sent: Fri 3/18/2005 11:34 PM > Is there any way to detect the index's integrity? > Sometimes I came upon exceptions like these. If it happens, my only way > is to delete the corrupted index. >* Exception in thread "main" java.io.IOException : read past EOF >* java.lang.ArrayIndexOutOfBoundsException > [ ... ] I did too, which is why I wrote NullDirectory. You can find the sources and a description in bugzilla. http://issues.apache.org/bugzilla/show_bug.cgi?id=33851 Look at the tests for examples of use. I would value your feedback. -- Ravi/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Multiple Field Queries
Aad Nales wrote: Perhaps i misunderstand but it seems to me that if you execute the add with two times a false value you will end up with the required result. (content:test) (filename:test) which is equivalent to your requested query. hope this helps, Aad Nales Gusenbauer Stefan wrote: Hello, at the moment i cannot search through the mailinglist archives so i will bother you. I will search over multiple fields for example content and filename. The MultiFieldQueryParser is not applicable for me so i create the query syntax programmatically. The querystring is parsed with the QueryParser i use it in this example two times for content and filename the resulting query. Then i combine them with BooleanQuery add the resulting string is for example +content:test +filename:test. The problem here is i would like to construct a query like (+content:test) OR (+filename:test). Is the only alternative to extend the boolean query to the string and make some string operations above it and pass it through the QueryParser again? Thanks Stefan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] thanks add(query,false,false) works now. The failure was because i added a field to all documents for searching with a datefile. therefore there were always all documents returned. stefan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: new added documents not showing
On Sat, 19 Mar 2005 22:43:44 +0300, Pasha Bizhan <[EMAIL PROTECTED]> wrote: > Could you provide the code snippets for your process? > Sure (thanx for helping, btw) I just realized that the way I described our process was off a little bit. Here's the process again: 1. grab all index Directorys (index parts) 2. loop newest to oldest and make documents unique (by deleting older documents) 3. get list of documents from index parts to delete from our main index 4. delete documents from main index 5. add all documents from index parts into the main index I apologize for the amount of code below. Here is the code that loops through all the index parts, from newest to oldest, and then deletes the documents from any older index parts. The unique ID we use as a Key Field is "ReceivedDate". IndexReader reader = null; IndexReader reader2 = null; try { /* *- * Loop backwards (latest to oldest) through parts *- */ for ( int i = ( directories.length - 1 ); i >= 0; i-- ) { reader = IndexReader.open( FSDirectory.getDirectory( directories[i], false ) ); int numDocuments = reader.numDocs(); /* *- * Loop forward (oldest to latest) up to the current part * being looked at. * Delete any messages from the older parts that exist in the * current part. *- */ for ( int x = 0; x < i; x++ ) { String partName = directories[x].getName(); reader2 = IndexReader.open( FSDirectory.getDirectory( directories[x], false ) ); for ( int h = 0; h < numDocuments; h++ ) { if ( !reader.isDeleted( h ) ) { Document d = reader.document( h ); String receivedDate = d.get( "ReceivedDate" ); Term term = new Term( "ReceivedDate", receivedDate ); int num = reader2.delete( term ); } } reader2.close(); reader2 = null; } reader.close(); reader = null; } } catch ( Exception e ) { // log error } finally { try { if ( reader != null ) reader.close(); if ( reader2 != null ) reader2.close(); } catch ( IOException e ) { // log error } } Here we build up a list of ReceivedDates to help us delete from the main.index. I just realized that we could build this list from the previous section. List list = new ArrayList(); for ( int i = 0; i < directories.length; i++ ) { IndexReader r = null; try { r = IndexReader.open( directories[i] ); int num = r.numDocs(); for ( int x = 0; x < num; x++ ) { if ( !r.isDeleted( x ) ) { Map map = new HashMap(); Document d = r.document( x ); map.put( "ReceivedDate", d.get( "ReceivedDate" ) ); list.add( map ); } } } catch ( Exception e ) { e.printStackTrace(); } finally { if ( r != null ) try { r.close(); } catch ( Exception e ) {} } } return list; Here we actually go through and delete the documents from the main index. IndexReader reader = null; Map message; try { reader = IndexReader.open( mainindex ); Iterator it = indexList.iterator(); // returned from previous section /* *- * Loop through messages to clear from the index *- */ while ( it.hasNext() ) { message = (Map)it.next(); /* *- * Delete based on received date *- */ String receivedDate = (String)message.get( "ReceivedDate" ); Term term = new Term( "ReceivedDate", receivedDate ); int num = reader.delete( term );
Re: NumberTools
: One annoyance I have run across is the impedance mismatch between : range queries and sorting. : : If your terms are indexed as standard numbers, then integer sorting : is fast, but range queries don't work (for negative values). If you : format the terms such that range queries work for any integer, then : you have to use the slower string (or custom) sorting. : : Is there a way around this besides writing my own custom sorting hit collector? yeah, this is something that's never really made sense to me, I've tried digging into the code to understand this a couple of times, but i've never had much success, maybe my assumptions/understanding is wrong... 1) lucene stores all fields as Strings 2) You can construct a "Sort" object with SortField of type "INT" 3) according to tribal wisdom (and Lucene in Action) sorting by a numeric fields caches the numeric value and is more efficient then sorting by a string field (in which the string value needs to be cached) 1+2+3 tells me that at some point, when the the search/sort code sees a SortField of type "INT" (or of type AUTO and the value of that field in the first doc looks like an INT) that a single pass is done to convert the string value of hte field from disk into a numeric value for caching (and sorting). So why couldn't a user specified NumberFormat object be used to convert that string into an Integer? Allowing people to format their numbers in a way that sorts lexigraphically for Range Filters, but still get the good Numeric Sotr efficiency? I can see in FieldDocSortedHitQueue where the case statement deals with the various types of SortField, but at that point it's comparing FieldDoc objects whose fields[i] is expected to allready be an "Integer" object. where is that "Integer" object parsed from the String value of the field? -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: new added documents not showing
Hi, > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > I just realized that the way I described our process was off > a little bit. > > Here's the process again: > > I apologize for the amount of code below. When you open the index writer? Where is the code? Pasha Bizhan http://lucenedotnet.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: new added documents not showing
Hi, > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > I just realized that the way I described our process was off a little > bit. > > Here's the process again: > > I apologize for the amount of code below. When do you open the index writer? Where is the code? Pasha Bizhan http://lucenedotnet.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: NumberTools
: One annoyance I have run across is the impedance mismatch between : range queries and sorting. : : If your terms are indexed as standard numbers, then integer sorting : is fast, but range queries don't work (for negative values). If you : format the terms such that range queries work for any integer, then : you have to use the slower string (or custom) sorting. : : Is there a way around this besides writing my own custom sorting hit collector? I solve this problem by using two separate fields: one for range queries and one for sorting, each formatted appropriately. Adds very little space to the index. A bit ugly, but better than writing a custom hit collector. A better solution that unified these formats, perhaps along the lines Hoss suggests, would be appreciated. Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: new added documents not showing
> When do you open the index writer? Where is the code? Ah, sorry. That last section is in a method that gets called in a loop. IndexWriter writer = null; try { writer = new IndexWriter( mainindex, new StandardAnalyzer(), false ); for ( int i = 0; i < directories.length; i++ ) { moveDocumentsOver( writer, directories[i] ); // delete dir } } catch ( Exception e ) { // log error } finally { if ( writer != null ) try { writer.close(); } catch ( Exception e ) {} } Roy. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: new added documents not showing
Hi, > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > Ah, sorry. That last section is in a method that gets called > in a loop. The shortest version of your code is: - void mainFunction() { IndexWriter writer = null; writer = new IndexWriter( mainindex, new StandardAnalyzer(), false ); moveDocumentsOver( writer, oldDirectory); writer.close(); } void moveDocumentsOver( IndexWriter writer, string oldDirectory){ IndexReader r = null; r = IndexReader.open( oldDirectory ); int num = r.numDocs(); for ( int i = 0; i < num; i++ ) { if ( !r.isDeleted( i ) ) { Document d = r.document( i ); Document nd = new Document(); // fill nd by d writer.addDocument( nd ); } } r.close(); } - And then you execute the search (using mainindex) and you don't see the new documents. Yes? Pasha Bizhan http://lucenedotnet.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: new added documents not showing
correct, we also can't see the new documents when we open an IndexReader to the main index. Roy. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
using Expression language for lucene api
I have the following expression : results is of type Hits, i want to know if there is a way using Expression language or jstl to access for example: result.doc(i).
boosting?
Hi there, how to get the real boost value of a field or document? The java doc says that it is _may_ not correct returned when reading a document with a index reader. Any hints how to get the boost when reading a document? Thanks. Stefan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: boosting?
Stefan, Boosts are not stored directly, necessarily. Each field has an associated normalization factor, of which boost is multiplied into. This value is precomputed at indexing time, so getting the boost isn't possible unless the length normalization is 1.0 (which is not usually a good idea). Erik On Mar 21, 2005, at 4:35 PM, Stefan Groschupf wrote: Hi there, how to get the real boost value of a field or document? The java doc says that it is _may_ not correct returned when reading a document with a index reader. Any hints how to get the boost when reading a document? Thanks. Stefan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: new added documents not showing
Hello, Sorry if this is stating the obvious, but have you used Luke to verify that the new documents were indexed in the first place? Sorry if you've already mentioned this. Otis --- [EMAIL PROTECTED] wrote: > > When do you open the index writer? Where is the code? > > Ah, sorry. That last section is in a method that gets called in a > loop. > > IndexWriter writer = null; > try { > writer = new IndexWriter( mainindex, new > StandardAnalyzer(), false ); > for ( int i = 0; i < directories.length; i++ ) { > moveDocumentsOver( writer, directories[i] ); > // delete dir > } > } > catch ( Exception e ) { > // log error > } > finally { > if ( writer != null ) try { writer.close(); } catch ( > Exception e ) {} > } > > Roy. > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: using Expression language for lucene api
I think there are some taglibs that let you call functions on objects, but you could also considering wrapping Hits in something that is JSTL friendly, perhaps a List that JSTL knows how to handle. Otis --- Omar Didi <[EMAIL PROTECTED]> wrote: > I have the following expression : > > > > results is of type Hits, i want to know if there is a way using > Expression language or jstl to access for example: result.doc(i). > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: boosting?
Stephan, On Monday 21 March 2005 22:35, Stefan Groschupf wrote: > Hi there, > how to get the real boost value of a field or document? > The java doc says that it is _may_ not correct returned when reading a > document with a index reader. > Any hints how to get the boost when reading a document? The javadoc of Field.setBoost() has meanwhile been extended a bit (source from the trunk at http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/document/): * The boost is multiplied by [EMAIL PROTECTED] Document#getBoost()} of the document * containing this field. If a document has multiple fields with the same * name, all such values are multiplied together. This product is then * multipled by the value [EMAIL PROTECTED] Similarity#lengthNorm(String,int)}, and * rounded by [EMAIL PROTECTED] Similarity#encodeNorm(float)} before it is stored in the * index. One should attempt to ensure that this product does not overflow * the range of that encoding. One feature of Similarity.encodeNorm(float) is that it returns a byte, so at most 256 different values can be stored, which is a lot less than the number of possible floating point values. encodeNorm() rounds to a representable value close to the given float, and decodeNorm() returns that representable value, normally used in TermScorer. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: NumberTools
Chris Hostetter fucit.org> writes: > > So why couldn't a user specified NumberFormat object be used to > convert that string into an Integer? Allowing people to format > their numbers in a way that sorts lexigraphically for Range Filters, > but still get the good Numeric Sotr efficiency? > > I can see in FieldDocSortedHitQueue where the case statement deals with > the various types of SortField, but at that point it's comparing FieldDoc > objects whose fields[i] is expected to allready be an "Integer" object. > where is that "Integer" object parsed from the String value of the field? > Surely, by using the number -> string algorithm I showed earlier this would not be a problem. Did I miss something? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]