Re: PDF text extracted without spaces

2010-12-03 Thread Hans Merkl
-- > > >>>> Lance Norskog > > >>>> goks...@gmail.com > > >>>> > > >>>> > - > > >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > >&

Index strategy for tagged documents where tags can change often

2010-07-28 Thread Hans Merkl
Hi, In addition to text content my documents have tags which can be searched too. The problem now is that the tags change quite often and every time a tag gets added or removed I have to call UpdateDocument which is quite slow when done for hundreds of documents. Are there any well performing str

Re: arguments in favour of lucene over commercial competition

2010-06-23 Thread Hans Merkl
Just curious. What commercial alternatives are out there? On Wed, Jun 23, 2010 at 04:01, jm wrote: > Hi, > > I am trying to compile some arguments in favour of lucene as > management is deciding weather to standardize on lucene or a competing > commercial product (we have a couple of produc, one

Re: Are there any tokenizers that ignore HTML tags but keep the offsets so they can be used for highlighting in the original document?

2010-06-08 Thread Hans Merkl
Hi Ahmet, I am using Lucene.NET with C# so I can't test this quickly. Will HTMLStripCharFilter maintain the character offsets or does it just extract the plain text? Hans > You can use org.apache.solr.analysis.HTMLStripCharFilter. It is possible to > add one or more org.apache.lucene.analysis.C

Are there any tokenizers that ignore HTML tags but keep the offsets so they can be used for highlighting in the original document?

2010-06-07 Thread Hans Merkl
Hi, I need to index HTML documents and one of the requirements is to highlight documents while maintaining all of the original formatting. The documents are relatively simple HTML, meaning no JavaScript code that changes elements at runtime or too fancy CSS styling. I think it should be possible t