[ANN] First of a kind open-source effort for advancing Hebrew IR

2010-06-07 Thread Itamar Syn-Hershko
Hi all, Indexing Hebrew texts for later retrieval is not a trivial task. Of all languages, Hebrew seem to be the toughest to handle. Although several solutions exist, they are not necessarily providing the best results in terms of relevancy. Either way, there is no freely available solution allowi

Re: Are there any tokenizers that ignore HTML tags but keep the offsets so they can be used for highlighting in the original document?

2010-06-07 Thread Ahmet Arslan
> I need to index HTML documents and one of the requirements > is to highlight > documents while maintaining all of the original formatting. > The documents > are relatively simple HTML, meaning no JavaScript code that > changes elements > at runtime or too fancy CSS styling. > > I think it should

[webinar] Rapid Prototyping Search Applications with Solr

2010-06-07 Thread Erik Hatcher
Marketing blurb below. My personal hype here... I'm going to be showcasing a straightforward document search engine, from files through indexing through usable user interface in no time. Come check it out. There'll be similarities to my EuroCon presentation[1], though this will be an ent

Are there any tokenizers that ignore HTML tags but keep the offsets so they can be used for highlighting in the original document?

2010-06-07 Thread Hans Merkl
Hi, I need to index HTML documents and one of the requirements is to highlight documents while maintaining all of the original formatting. The documents are relatively simple HTML, meaning no JavaScript code that changes elements at runtime or too fancy CSS styling. I think it should be possible t

Using field cache on real time index

2010-06-07 Thread Woolf, Ross
I'm looking for some clarification on the use of field cache in a real time index situation. We are using Lucene in a real time fashion, but we update our reader via IndexReader.reopen() rather than using the IndexWriter.getReader(); After opening a new reader the old reader is closed. In the

Re: How to add "tokens" to a stream

2010-06-07 Thread Aad Nales
Yes it does, The key bit is the part with the termAttribute... thanks a lot, cheers, Aad On Mon, Jun 7, 2010 at 4:53 PM, Simon Willnauer wrote: > Hey there, > in lucene 3.0 / 2.9 the Token class has been remove / replaced with an > Attribute based API. A TokenStream operates on Attibutes it ha

Re: How to add "tokens" to a stream

2010-06-07 Thread Simon Willnauer
Hey there, in lucene 3.0 / 2.9 the Token class has been remove / replaced with an Attribute based API. A TokenStream operates on Attibutes it has declared which are eventually accessed by the IndexWriter to create the inverted index. There are Attributes like TermAttribute, PositionIncrementAttribu

synonym highlighter

2010-06-07 Thread Aad Nales
Hi All, We are mixing Lucene with a commercial service giving us all kinds of synonyms. We add these synonyms to the index and we can search with them. The problem we have is 'highlighting' the orginal word when a synonym is found. We were thinking along the following approach. 1. Get a term 2.

How to add "tokens" to a stream

2010-06-07 Thread Aad Nales
Hi All, Years ago we implemented a Lucene solution which we are updating today, and i am a bit lost on the following. In Lucene 1.x and 2.x it was possible to add a token in a Filter simply by returning an extra Token when next was being called. What i can not find is an equivalent possiblity for

Re: index field used for boosting rank

2010-06-07 Thread Ian Lea
Your understanding is correct. There is no way to just add a new field to an existing document, or to update one field in an existing document. -- Ian. On Mon, Jun 7, 2010 at 1:46 PM, andynuss wrote: > > Hi, > > I want to add a rank field to my index with numbers 1 thru 10, and apply a > boos

index field used for boosting rank

2010-06-07 Thread andynuss
Hi, I want to add a rank field to my index with numbers 1 thru 10, and apply a boost appropriate for each of the values. One of the other indexed fields is huge, about 40,000 chars. My understanding is that if I change the new "rank" field from 1 to 2, the huge field is reindexed. Is there any

Re: indexWriter.addIndexes, Disk space, and open files

2010-06-07 Thread Regan Heath
>> That's pretty much exactly what I suspected was happening.  I've had the same >> problem myself on another occasion... out of interest is there any way to >> force the file closed without flushing? > >No, IndexOutput has no such method. We could consider adding one... That sounds useful in ge

Re: indexWriter.addIndexes, Disk space, and open files

2010-06-07 Thread Michael McCandless
On Mon, Jun 7, 2010 at 6:18 AM, Regan Heath wrote: > > That's pretty much exactly what I suspected was happening.  I've had the same > problem myself on another occasion... out of interest is there any way to > force the file closed without flushing? No, IndexOutput has no such method. We could

Re: indexWriter.addIndexes, Disk space, and open files

2010-06-07 Thread Regan Heath
That's pretty much exactly what I suspected was happening. I've had the same problem myself on another occasion... out of interest is there any way to force the file closed without flushing? From memory I tried everything I could think of at the time but couldn't manage it. Best I could do was

Re: indexWriter.addIndexes, Disk space, and open files

2010-06-07 Thread Michael McCandless
This is a bug in how Lucene handles IOException while closing files. Look at SegmentMerger's sources, for 2.3.2: https://svn.apache.org/repos/asf/lucene/java/tags/lucene_2_3_2/src/java/org/apache/lucene/index/SegmentMerger.java Look at the finally clause in mergeTerms: } finally {

Re: indexWriter.addIndexes, Disk space, and open files

2010-06-07 Thread Regan Heath
If you don't want to use the ImDisk software, a small flash drive will do just as well... Regan Heath wrote: > > Windows XP. > > The problem occurs on the local file system, but to replicate it more > easily I am using http://www.ltr-data.se/opencode.html#ImDisk to mount a > virtual 10mb dis