RE: best html parser for html documents generated by microsoft products

2005-12-03 Thread Mark Benussi
I use JTidy also, but not for Lucene parsing. There is no easy way of handling this, you simply have to remove all crappy Microsoft inserts as they come. -Original Message- From: Gaston [mailto:[EMAIL PROTECTED] Sent: 03 December 2005 13:49 To: java-user@lucene.apache.org Subject: best ht

Re: Distributed sort

2005-12-03 Thread Erik Hatcher
On Dec 3, 2005, at 1:26 PM, Jeff Rodenburg wrote: In one of the Google Labs whitepapers ( http://labs.google.com/papers/mapreduce-osdi04.pdf), a programming construct known as MapReduce is used in a variety of jobs/tasks within Google's operation. As an example of the application of MapRedu

Re: Lucene performance bottlenecks

2005-12-03 Thread Paul Elschot
On Saturday 03 December 2005 14:09, Andrzej Bialecki wrote: > Paul Elschot wrote: > > >In somewhat more readable layout: > > > >+(url:term1^4.0 anchor:term1^2.0 content:term1 > > title:term1^1.5 host:term1^2.0) > >+(url:term2^4.0 anchor:term2^2.0 content:term2 > > title:term2^1.5 host:term2^

Distributed sort

2005-12-03 Thread Jeff Rodenburg
In one of the Google Labs whitepapers ( http://labs.google.com/papers/mapreduce-osdi04.pdf), a programming construct known as MapReduce is used in a variety of jobs/tasks within Google's operation. As an example of the application of MapReduce, the whitepaper refers to Distributed Sorting. Essent

best html parser for html documents generated by microsoft products

2005-12-03 Thread Gaston
Hallo, JTidy is a very good HTMLParser but for HTML Websites made with the help of Microssoft Office Products like Word for example it is not optimal. Because ist returns "Microsoft specific HTML Tags" instead of only text. Or as should I handle HTML Pages with source begins so " http://ww

Re: Lucene performance bottlenecks

2005-12-03 Thread Andrzej Bialecki
Paul Elschot wrote: In somewhat more readable layout: +(url:term1^4.0 anchor:term1^2.0 content:term1 title:term1^1.5 host:term1^2.0) +(url:term2^4.0 anchor:term2^2.0 content:term2 title:term2^1.5 host:term2^2.0) url:"term1 term2"~2147483647^4.0 anchor:"term1 term2"~4^2.0 content:"term1 t

Re: Lucene performance bottlenecks

2005-12-03 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: For a simple TermQuery, if the DF(term) is above 10%, the response time from IndexSearcher.search() is around 400ms (repeatable, after warm-up). For such complex phrase queries the response time is around 1 sec or more (again, after warm-up). Ar

Re: Wildcard

2005-12-03 Thread Erik Hatcher
On Dec 2, 2005, at 6:21 PM, John Powers wrote: Hello, Lucene only lets you use a wildcard after a term, not before, correct? What work arounds are there for that? If I have an item 108585-123 And another 332323-123 How can I look for all the -123 family of items? To clarify something that no