Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-11 Thread Sreejith S
I suggest you Jsoup Html parser,which is fast ,easy and simple html parser.I used many html parsers and out of which i am comfortable with Jsoup. http://jsoup.org/ IBM ICU provides the best tokenizers. On 3/11/11, Bill Janssen wrote: > shrinath.m wrote: > >> Consider we've offline HTML pages

[GSoC] Apache Lucene @ Google Summer of Code 2011 [STUDENTS READ THIS]

2011-03-11 Thread Simon Willnauer
Hey folks, Google Summer of Code 2011 is very close and the Project Applications Period has started recently. Now it's time to get some excited students on board for this year's GSoC. I encourage students to submit an application to the Google Summer of Code web-application. Lucene & Solr are ama

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-11 Thread Bill Janssen
shrinath.m wrote: > Consider we've offline HTML pages, no parsing while crawling, now what ? > Any tokenizer someone has built for this ? In UpLib, which uses PyLucene, I use BeautifulSoup to simplify Web pages by selecting only text between certain tags, before indexing them. These are offline

Re: Indexing of multilingual labels

2011-03-11 Thread Stephane Fellah
Erick, I am trying to index multilingual taxonomies such as SKOS, Wordnet, Eurowordnet. Taxonomies are composed of concepts which have preferred and alternative labels in different languages. Some labels are the same lexical form in different languages. I want to be able to index these concepts in

Re: overall number of hits

2011-03-11 Thread Ian Lea
There are search methods that don't require a filter, but you are right that there is nothing quite as simple as search(q). >From http://www.gossamer-threads.com/lists/lucene/java-user/95032 you can use TopDocs tp = ms.search(lucquery, 1); And then the total count is in tp.totalHits -- Ian. O

Re: I send a email to lucene-dev solr-dev lucene-user but always failed

2011-03-11 Thread Li Li
I used plain text and sent successfully. thanks. 2011/3/11 Erick Erickson : > What mail client are you using? I also had this problem and it's > solved in Gmail by sending the mail as "plain text" rather than > "Rich formatting". > > Best > Erick > > On Fri, Mar 11, 2011 at 4:35 AM, Li Li wrote:

overall number of hits

2011-03-11 Thread Michael Wiegand
Hi, I am currently mainly interested in the overall number of matches in a document collection (several GBs) given a particular query. At the moment I am not interested in the matching documents themselves; just the number would be sufficient. In previous versions of lucene the Searcher class h

Re: I send a email to lucene-dev solr-dev lucene-user but always failed

2011-03-11 Thread Li Li
I don't use any client but browser. 2011/3/11 Erick Erickson > What mail client are you using? I also had this problem and it's > solved in Gmail by sending the mail as "plain text" rather than > "Rich formatting". > > Best > Erick > > On Fri, Mar 11, 2011 at 4:35 AM, Li Li wrote: > > hi > >

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-11 Thread shrinath.m
On Fri, Mar 11, 2011 at 6:27 PM, Erick Erickson [via Lucene] < ml-node+2664607-1236630615-376...@n3.nabble.com> wrote: > Solr doesn't do it. There exist various tokenizers/filters that just strip > the HTML tags, but there's nothing built into Solr that I know of that > understands HTML, HTML-awar

Check Numeric Fields

2011-03-11 Thread Thomas Rewig
Hello, I use a index with a numeric field: /doc.add( new Field(...));/ /doc.add( new Field(...));/ /doc.add(new NumericField(fieldName,Field.Store.YES,true)/ / .setIntValue(intFieldValue));/ /indexWriter.addDocument(doc);/

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-11 Thread Erick Erickson
Solr doesn't do it. There exist various tokenizers/filters that just strip the HTML tags, but there's nothing built into Solr that I know of that understands HTML, HTML-aware operations are outside Solr's purview. Best Erick On Fri, Mar 11, 2011 at 6:50 AM, shrinath.m wrote: > On Fri, Mar 11, 20

Re: Indexing of multilingual labels

2011-03-11 Thread Erick Erickson
It's not so much a matter of problems with indexing/searching as it is with search behavior. The reason these strategies are implemented is that using English stemming, say, on other languages will produce "interesting" results. There's no a-priori reason you can't index multiple languages in the

Re: I send a email to lucene-dev solr-dev lucene-user but always failed

2011-03-11 Thread Erick Erickson
What mail client are you using? I also had this problem and it's solved in Gmail by sending the mail as "plain text" rather than "Rich formatting". Best Erick On Fri, Mar 11, 2011 at 4:35 AM, Li Li wrote: > hi >    it seems my mail is judged as spam. >    Technical details of permanent failure:

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-11 Thread shrinath.m
On Fri, Mar 11, 2011 at 5:06 PM, Li Li [via Lucene] < ml-node+2664380-1940163870-376...@n3.nabble.com> wrote: > But I think the parser will most be used when crawling. So you can use > these parsers when crawling and save parsed result only. > Consider we've offline HTML pages, no parsing while

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-11 Thread Li Li
1. parser is the preprocessing of documents, lucene will not know anything about it. 2. I have only used NekoHtmlParser. Cobra is a java browser and it seems a little heavy. VietSpider is very heavy because it embed mozilla browser by SWT. MozillaParser is similar but embeding by itself(which nee

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-11 Thread Ivan Krišto
Hello! On Fri, Mar 11, 2011 at 12:03 PM, shrinath.m wrote: > I am trying to index content withing certain HTML tags, how do I index it ? > Which is the best parser/tokenizer available to do this ? As a general HTML parser I would recommend "Jericho HTML Parser" - http://jericho.htmlparser.net/do

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-11 Thread shrinath.m
Thank you Li Li. Two questions : 1. Is there anything *in* *Lucene* that I need to know of ? some contrib module or anything as such ? 2. You ran a search in java-source.net for me, thanks for that, but do you mind telling me which is the easiest and fastest ?? On Fri, Mar 11, 2011 at 4:38 PM, L

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-11 Thread Li Li
http://java-source.net/open-source/html-parsers 2011/3/11 shrinath.m > I am trying to index content withing certain HTML tags, how do I index it ? > Which is the best parser/tokenizer available to do this ? > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Which-is-the-

Re: document object

2011-03-11 Thread Ian Lea
You've been told several times that the searcher.doc() call can be expensive and given suggestions as to how to improve it. You have provided no evidence that you have tried any of these suggestions. I know nothing about clucene and you have not provided any evidence as to whether your comparison

Re: document object

2011-03-11 Thread Ian Lea
If I've read this right you are saying that you need to look at fields A and D for 1000 docs but B, C and E for just one. If that is right then lazy loading/FieldSelector will help. But even loading just A and D for 1000 hits will inevitably take time. As already suggested, you could look at cac

RE: document object

2011-03-11 Thread suman.holani
Hi, In Java I am using RAM based index For a small case for (int i = 0; i < hits.length; ++i) { //Document D = searcher.doc(hits[i].doc); } Found 37 hits. 0 total milliseconds == In case I uncomment the lines for (int i = 0; i < hits.length;

Re: index enforcing query terms to appear within the same sentence

2011-03-11 Thread Ian Lea
The example code in http://lucene.472066.n3.nabble.com/Problem-searching-in-the-same-sentence-td1501269.html reads custom standard analyzer: public class MyStandardAnalyzer extends StandardAnalyzer implements IndexFields { public MyStandardAnalyzer(Version matchVersion) {

Re: I send a email to lucene-dev solr-dev lucene-user but always failed

2011-03-11 Thread Li Li
Problem of Replication Reservation Durationhi all, I tried to send this mail to solr dev mail list but it tells me this is a spam. So I send it again and to lucene dev too. The replication handler in solr 1.4 which we used seems to be a little problematic in some extreme situation. The

I send a email to lucene-dev solr-dev lucene-user but always failed

2011-03-11 Thread Li Li
hi it seems my mail is judged as spam. Technical details of permanent failure: Google tried to deliver your message, but it was rejected by the recipient domain. We recommend contacting the other email provider for further information about the cause of this error. The error that the other

Re: Is ConcurrentMergeScheduler useful for multiple running IndexWriter's?

2011-03-11 Thread David Causse
On Fri, Mar 04, 2011 at 07:02:48AM -0800, Jason Rutherglen wrote: > ConcurrentMergeScheduler is tied to a specific IndexWriter, however if > we're running in an environment (such as Solr's multiple cores, and > other similar scenarios) then we'd have a CMS per IW. I think this > effectively disabl