Note that the current code doesn't actually do anything with the wiki
syntax, but I would think as long as the other language is in the same
format you should be fine.
Just incidentally -- do you know of something that would parse the wikipedia
markup (to plain text, for example)?
D.
Hi Erick,
Thanks for the great idea, it's exactly the kind of suggestion I was looking
for!
Lucifer
On Dec 12, 2007 2:34 PM, Erick Erickson <[EMAIL PROTECTED]> wrote:
> I faced a very similar requirement and solved it by indexing multiple
> tokens at the same place. For instance, say you're ind
Erick Erickson wrote:
I don't believe you can compare scores across queries in any meaningful
way.
I actually investigated this to some degree in my thesis, comparing
different participating systems from the TREC campaigns. It turns out
that some systems' scores (e.g. the top scores for a gi
Michael McCandless wrote:
Ruslan Sivak wrote:
This seems to be problematic though. There are other things that
depend on the reader that is not so obvious. For example,
IndexReader reader=getReader();
IndexSearcher searcher=new IndexSearcher(reader);
Hits hits=searcher.search(query);
searc
Seems that PerFieldAnalyzerWrapper would be convenient here?
Doron
On Dec 12, 2007 10:41 PM, ts01 <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> We have a requirement to index as well as store multiple fields in a
> document, each with its own special tokenizer. The following seems to
> provide a way to
Ruslan Sivak wrote:
This seems to be problematic though. There are other things that
depend on the reader that is not so obvious. For example,
IndexReader reader=getReader();
IndexSearcher searcher=new IndexSearcher(reader);
Hits hits=searcher.search(query);
searcher.close();
reader.close(
This seems to be problematic though. There are other things that depend
on the reader that is not so obvious. For example,
IndexReader reader=getReader();
IndexSearcher searcher=new IndexSearcher(reader);
Hits hits=searcher.search(query);
searcher.close();
reader.close();
Iterator i=hits.itera
You would probably get better and quicker answer in
Nutch mailing lists:
http://lucene.apache.org/nutch/mailing_lists.html
Doron
On Dec 12, 2007 11:16 PM, Developer Developer <[EMAIL PROTECTED]>
wrote:
> I believe nutch stores parsed content somewhere. Can you please let me
> know
> how I can
I believe nutch stores parsed content somewhere. Can you please let me know
how I can access the parsed content given a url ?
Thanks !
Hi,
We have a requirement to index as well as store multiple fields in a
document, each with its own special tokenizer. The following seems to
provide a way to index multiple fields each with its own tokenizer:
Field(String name, Reader reader)
The following seems to provide a way to Index and
You need to keep a reader open so long as you plan to use any of its
methods from any thread.
The reader does close exactly when you ask it to (when you call
reader.close()).
You should not have to "open a new reader for every method call" --
you only need to open a new reader (and in y
Thank you to everyone for your comments. I didn't realize that readers
need to be kept open and won't close exactly when you ask them too. I
have restructured my code to keep the RamDirectory cached, and to open a
new reader for every method call. This seems to be working fine.
Russ
Erick
Hi! All,
I will like to extract some information regarding some word in a field.
below are info I will like to have:
1. frequency count of that word
2. word after been analyzed...
Any chance I can use Lucene to do that?
spking
I faced a very similar requirement and solved it by indexing multiple
tokens at the same place. For instance, say you're indexing
the word "foxes". Index something like fox$ and foxes at the same
position (see SynonymAnalyzer in Lucene In Action for an example).
You probably MUST index the multiple
Hi,
We've got a requirement that we need to give our users the ability to
search on exact phrases within a field, or, if they prefer, they can match
on plurals(either via stems, or another plural algorithm). However, the
cases are mutually exclusive, for example given the following field in the
My firm uses a parser based on javax.xml.stream.XMLStreamReader to
break (english and nonenglish) wikipedia xml dumps into lucene-style
"documents and fields." We use wikipedia to test our
language-specific code, so we've probably indexed 20 wikipedia dumps.
- andy g
On Dec 11, 2007 9:35 PM, Oti
Mark, Russ, thanks for the replies.
Mark, this looks great, I think it's exactly what I was looking for. I
think this should definitely be added to Lucene when it is stable
enough. I suspect there are others that would find it useful.
JLuna
Mark Miller wrote:
Take a look at: https://issue
12 dec 2007 kl. 06.35 skrev Otis Gospodnetic:
I need to index a Wikipedia dump. I know there is code in contrib/
benchmark for indexing *English* Wikipedia for benchmarking
purposes. However, I'd like to index a non-English dump, and I
actually don't need it for benchmarking, I just want
Even if you could tell a reader is closed, you'd wind up with
unmaintainable code. I envision you have a bunch of places
where you'd do something like
if (reader.isClosed()) {
reader = create a new reader.
}
But practically, you'd be opening a new reader someplace,
closing it someplace else,
Probably want a combination of extractWikipedia.alg and wikipedia.alg?
You want the EnwikiDocMaker from extractWikipedia.alg which reads the
uncompressed xml file but rather than using WriteLineDoc, you want to go
ahead and index as wikipedia.alg does. (Ditch the query part.)
You'll need an accep
On Wed, 2007-12-12 at 11:37 +0100, Lars Clausen wrote:
> I've now made trial runs with no norms on the two indexed fields, and
> also tried with varying TermIndexIntervals. Omitting the norms saves
> about 4MB on 50 million entries, much less than I expected.
Seems there's a reason we still use
Note that the current code doesn't actually do anything with the wiki
syntax, but I would think as long as the other language is in the same
format you should be fine.
-Grant
On Dec 12, 2007, at 5:28 AM, Michael McCandless wrote:
I haven't actually tried it, but I think very likely the cu
On Wed, 2007-12-12 at 11:37 +0100, Lars Clausen wrote:
> Increasing
> the TermIndexInterval by a factor of 4 gave no measurable savings.
Following up on myself because I'm not 100% sure that the indexes have
the term index intervals I expect, and I'd like to check. Where can I
see what term ind
On Tue, 2007-11-13 at 07:26 -0800, Chris Hostetter wrote:
> : > Can it be right that memory usage depends on size of the index rather
> : > than size of the result?
> :
> : Yes, see IndexWriter.setTermIndexInterval(). How much RAM are you giving to
> : the JVM now?
>
> and in general: yes. Luc
Ruslan Sivak wrote:
Michael McCandless wrote:
Ruslan Sivak wrote:
I have an index of about 10mb. Since it's so small, I would like
to keep it loaded in memory, and reload it about every minute or
so, assuming that it has changed on disk. I have the following
code, which works, except
I haven't actually tried it, but I think very likely the current code
in contrib/benchmark might be able to extract non-English Wikipedia
dump as well?
Have a look at contrib/benchmark/conf/extractWikipedia.alg: I think
if you just change the docs.file to reference your downloaded XML
f
Otis, I've used this to index wikipedia from XML before now:
http://schmidt.devlib.org/software/lucene-wikipedia.html
Cheers
Mark
- Original Message
From: Otis Gospodnetic <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, 12 December, 2007 8:18:49 AM
Subject: Re: Inde
I'm not even sure if it can be considered Named Entity Recognition, but what
the hell...
so here's my problem...
I was asked to retrieve a the named entities out of a collection of
documents, and I've thought of two ways of doing so (not sure if either of
them work)...
a) index the documents by w
Hi All,
I am parsing this query: "Auto* machine"~4.
Will it work? If yes then right now it's not working. Can
anyone help on this?
Thanks & Regards
Shakti Sareen
DISCLAIMER:
This email (including any attachments) is intended for the sole use of the
in
Database? I imagine I can avoid that Wiki dump.gz -> gunzip -> parse ->
index no?
Otis
- Original Message
From: Chris Lu <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, December 12, 2007 1:55:02 AM
Subject: Re: Indexing Wikipedia dumps
For a quick java approa
30 matches
Mail list logo