I suggest you Jsoup Html parser,which is fast ,easy and simple html
parser.I used many html parsers and out of which i am comfortable with
Jsoup.
http://jsoup.org/
IBM ICU provides the best tokenizers.
On 3/11/11, Bill Janssen wrote:
> shrinath.m wrote:
>
>> Consider we've offline HTML pages
Hey folks,
Google Summer of Code 2011 is very close and the Project Applications
Period has started recently. Now it's time to get some excited students
on board for this year's GSoC.
I encourage students to submit an application to the Google Summer of Code
web-application. Lucene & Solr are ama
shrinath.m wrote:
> Consider we've offline HTML pages, no parsing while crawling, now what ?
> Any tokenizer someone has built for this ?
In UpLib, which uses PyLucene, I use BeautifulSoup to simplify Web pages
by selecting only text between certain tags, before indexing them.
These are offline
Erick,
I am trying to index multilingual taxonomies such as SKOS, Wordnet,
Eurowordnet. Taxonomies are composed of concepts which have preferred and
alternative labels in different languages. Some labels are the same lexical
form in different languages. I want to be able to index these concepts in
There are search methods that don't require a filter, but you are
right that there is nothing quite as simple as search(q).
>From http://www.gossamer-threads.com/lists/lucene/java-user/95032 you can use
TopDocs tp = ms.search(lucquery, 1);
And then the total count is in tp.totalHits
--
Ian.
O
I used plain text and sent successfully. thanks.
2011/3/11 Erick Erickson :
> What mail client are you using? I also had this problem and it's
> solved in Gmail by sending the mail as "plain text" rather than
> "Rich formatting".
>
> Best
> Erick
>
> On Fri, Mar 11, 2011 at 4:35 AM, Li Li wrote:
Hi,
I am currently mainly interested in the overall number of matches in a
document collection (several GBs) given a particular query.
At the moment I am not interested in the matching documents themselves;
just the number would be sufficient.
In previous versions of lucene the Searcher class h
I don't use any client but browser.
2011/3/11 Erick Erickson
> What mail client are you using? I also had this problem and it's
> solved in Gmail by sending the mail as "plain text" rather than
> "Rich formatting".
>
> Best
> Erick
>
> On Fri, Mar 11, 2011 at 4:35 AM, Li Li wrote:
> > hi
> >
On Fri, Mar 11, 2011 at 6:27 PM, Erick Erickson [via Lucene] <
ml-node+2664607-1236630615-376...@n3.nabble.com> wrote:
> Solr doesn't do it. There exist various tokenizers/filters that just strip
> the HTML tags, but there's nothing built into Solr that I know of that
> understands HTML, HTML-awar
Hello,
I use a index with a numeric field:
/doc.add( new Field(...));/
/doc.add( new Field(...));/
/doc.add(new NumericField(fieldName,Field.Store.YES,true)/
/
.setIntValue(intFieldValue));/
/indexWriter.addDocument(doc);/
Solr doesn't do it. There exist various tokenizers/filters that just strip
the HTML tags, but there's nothing built into Solr that I know of that
understands HTML, HTML-aware operations are outside Solr's purview.
Best
Erick
On Fri, Mar 11, 2011 at 6:50 AM, shrinath.m wrote:
> On Fri, Mar 11, 20
It's not so much a matter of problems with indexing/searching
as it is with search behavior. The reason these strategies
are implemented is that using English stemming, say, on
other languages will produce "interesting" results.
There's no a-priori reason you can't index multiple languages
in the
What mail client are you using? I also had this problem and it's
solved in Gmail by sending the mail as "plain text" rather than
"Rich formatting".
Best
Erick
On Fri, Mar 11, 2011 at 4:35 AM, Li Li wrote:
> hi
> it seems my mail is judged as spam.
> Technical details of permanent failure:
On Fri, Mar 11, 2011 at 5:06 PM, Li Li [via Lucene] <
ml-node+2664380-1940163870-376...@n3.nabble.com> wrote:
> But I think the parser will most be used when crawling. So you can use
> these parsers when crawling and save parsed result only.
>
Consider we've offline HTML pages, no parsing while
1. parser is the preprocessing of documents, lucene will not know anything
about it.
2. I have only used NekoHtmlParser. Cobra is a java browser and it seems a
little heavy. VietSpider is very heavy because it embed mozilla browser by
SWT. MozillaParser is similar but embeding by itself(which nee
Hello!
On Fri, Mar 11, 2011 at 12:03 PM, shrinath.m wrote:
> I am trying to index content withing certain HTML tags, how do I index it ?
> Which is the best parser/tokenizer available to do this ?
As a general HTML parser I would recommend "Jericho HTML Parser" -
http://jericho.htmlparser.net/do
Thank you Li Li.
Two questions :
1. Is there anything *in* *Lucene* that I need to know of ? some contrib
module or anything as such ?
2. You ran a search in java-source.net for me, thanks for that, but do you
mind telling me which is the easiest and fastest ??
On Fri, Mar 11, 2011 at 4:38 PM, L
http://java-source.net/open-source/html-parsers
2011/3/11 shrinath.m
> I am trying to index content withing certain HTML tags, how do I index it ?
> Which is the best parser/tokenizer available to do this ?
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Which-is-the-
You've been told several times that the searcher.doc() call can be
expensive and given suggestions as to how to improve it. You have
provided no evidence that you have tried any of these suggestions.
I know nothing about clucene and you have not provided any evidence as
to whether your comparison
If I've read this right you are saying that you need to look at fields
A and D for 1000 docs but B, C and E for just one. If that is right
then lazy loading/FieldSelector will help.
But even loading just A and D for 1000 hits will inevitably take time.
As already suggested, you could look at cac
Hi,
In Java I am using RAM based index
For a small case
for (int i = 0; i < hits.length; ++i) {
//Document D = searcher.doc(hits[i].doc);
}
Found 37 hits.
0 total milliseconds
==
In case I uncomment the lines
for (int i = 0; i < hits.length;
The example code in
http://lucene.472066.n3.nabble.com/Problem-searching-in-the-same-sentence-td1501269.html
reads
custom standard analyzer:
public class MyStandardAnalyzer extends StandardAnalyzer implements
IndexFields {
public MyStandardAnalyzer(Version matchVersion) {
Problem of Replication Reservation Durationhi all,
I tried to send this mail to solr dev mail list but it tells me this is
a spam. So I send it again and to lucene dev too.
The replication handler in solr 1.4 which we used seems to be a little
problematic in some extreme situation.
The
hi
it seems my mail is judged as spam.
Technical details of permanent failure:
Google tried to deliver your message, but it was rejected by the recipient
domain. We recommend contacting the other email provider for further
information about the cause of this error. The error that the other
On Fri, Mar 04, 2011 at 07:02:48AM -0800, Jason Rutherglen wrote:
> ConcurrentMergeScheduler is tied to a specific IndexWriter, however if
> we're running in an environment (such as Solr's multiple cores, and
> other similar scenarios) then we'd have a CMS per IW. I think this
> effectively disabl
25 matches
Mail list logo