Hi Mike,
I'd simply store a field "doctype" with values "pdf", "txt", "html"
and perform a separate search for each type. Although, I'd be
interested if anyone has a cooler way of doing this.
Cheers,
Phil
On Thu, Oct 1, 2009 at 9:56 AM, Michael Masters wrote:
> I was wondering if there is any w
Hi,
I'm not sure why my IndexReader.reopen() call is not working.
The latest results are not coming back, meaning the reader / searcher
has not being re-opened for the new Documents that have been added.
IndexReader openReader = searcher.getIndexReader();
searcher.close();
openReader.reope
Sorry, just realised my mistake. I should read the docs more
carefully. IndexReader.reopen() does not reopen the existing
IndexReader, but returns a new one.
Phil
On Mon, Sep 14, 2009 at 3:20 PM, Phil Whelan wrote:
> Hi,
>
> I'm not sure why my IndexReader.reopen() call is not wo
Hi Mark,
Is there any Lucene 2.9 versions of this in development that I could
get my hands on? I'd be happy to be an alpha tester.
Cheers,
Phil
> LucidGaze for Lucene works as a drop-in replacement for the Lucene JAR;
> it requires no changes to the source code of the application, or even
> reco
Hi Uwe,
Thanks for the explanation! It really helps. That makes sense that for
a small number of values, such as "hour" NumericField is not going to
help me. I'm experimenting with using epoch NumericField for sorting,
which funnily is where I started with 2.4.1, before going down the
usual TooMan
Hi,
I've used NumericField to store my "hour" field.
Example...
doc.add(new NumericField("hour").setIntValue(Integer.parseInt("12")));
Before I was using plain string Field and enumerating them with
TermEnum, which worked fine.
Now I'm using NumericField's I'm not sure how to port this enu
the "" part.
>>
>> That's why I said in my original post that I was kind of surprised that
>> doing a web query for "path:.yyy" succeeded, i.e, in the path field in
>> the index, there is no ".yyy", just "".
>&
Hi Jim,
Are you using the same Analyzer for indexing and searching? .yyy
will be seem as a HOSTNAME by StandardAnalyzer and will keep it as one
term, whereas another indexer might split this into 2 terms. This
should not matter either way as long as you are using the same
Analyzer for both ind
(sorry, tangent. I'll be quick)
On Tue, Aug 4, 2009 at 8:42 AM, Shai Erera wrote:
> Interesting ... I don't have access to a Japanese dictionary, so I just
> extract bi-grams.
Shai - if you're interested in parsing Japanese, check out Kakasi. It
can split into words and convert Kanji->Katakana/Hi
On Tue, Aug 4, 2009 at 8:31 AM, Shai Erera wrote:
> Hi Darren,
>
> The question was, how given a string "aboutus" in a document, you can return
> that document as a result to the query "about us" (note the space). So we're
> mostly discussing how to detect and then break the word "aboutus" to two
>
On Tue, Aug 4, 2009 at 3:56 AM, Shai Erera wrote:
> 2) Use a dictionary (real dictionary), and search it for every substring,
> e.g. "a", "ab", "abo" ... "about" etc. If you find a match, split it there.
> This needs some fine tuning, like checking if the rest is also a word and if
> the full strin
Hi Prashant,
Take a look at this...
http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
Cheers,
Phil
On Sun, Aug 2, 2009 at 9:33 PM, prashant
ullegaddi wrote:
> Hi,
>
> I've a single index of size 87GB containing around 50M documents. When I
> search for any query,
> best search time I obse
Hi Jim,
On Sun, Aug 2, 2009 at 12:12 PM, wrote:
> i.e., I was ignoring the 1st term in the TermEnum (since the .next() bumps
> the TermEnum to the 2nd term, initially).
Great! Glad you found the problem. I couldn't see it.
Phil
-
On Sun, Aug 2, 2009 at 10:58 AM, Andrzej Bialecki wrote:
> Thank you Phil for spotting this bug - this fix will be included in the next
> release of Luke.
Glad to help. Thanks for building this great tool!
Phil
-
To unsubscribe,
Hi Prashant,
I agree with Shai, that using Luke and printing out what the Document
looks like before it goes into the index, are going to be your best
bet for debugging this problem.
The problem you're having is that StandardAnalyzer does not break-up
the hostname into separate terms, as it has a
Hi Jim,
On Sun, Aug 2, 2009 at 9:08 AM, Phil Whelan wrote:
>
>> So then, I reviewed the index using Luke, and what I saw with that was that
>> there were indeed only 12 "path" terms (under "Term Count" on the left),
>> but, when I clicked the "Show
Hi Jim,
On Sun, Aug 2, 2009 at 1:32 AM, wrote:
> I first noticed the problem that I'm seeing while working on this latter app.
> Basically, what I noticed was that while I was adding 13 documents to the
> index, when I listed the "path" terms, there were only 12 of them.
Field text (the whole
Hi Jim,
I cannot see anything obvious, but both open() and terms() throw
IOException's. You could try putting these in separate try..catch
blocks to see which one it's coming from. Or using e.printStackTrace()
in the catch block will give more info to help you debug what's
happening.
On Sat, Aug
Hi Mike,
It's Jibo, not me, having the problem. But thanks for the link. I was
interested to look at the code. Will be buying the book soon.
Phil
On Sat, Aug 1, 2009 at 2:08 AM, Michael McCandless
wrote:
>
> (Please note that ThreadedIndexWriter is source code available with
> the upcoming revi
Hi,
I know you can use Field.Store.YES, but I want to inspect the terms /
tokens and their order related to the field name at search time. Is
this possible? Obviously this information is stored in the index, but
I can not find any API to access it. I'm guessing the answer might be
that Terms point
Hi Jibo,
Your mergeFactor is different, and the resulting numFiles (segment
files) is different. Maybe each thread is responsible for a segment
file. Just curious - do you have 3 threads?
Phil
-
To unsubscribe, e-mail: java-user
Hi Jibo,
Have you tried optimizing indexes? I do not know anything about the
implementation of ThreadedIndexWriter, but if they both optimize down
to the same size, it could just mean that ThreadedIndexWriter is not
as optimized.
Thanks,
Phil
On Fri, Jul 31, 2009 at 11:38 AM, Jibo John wrote:
>
Hi Jim,
There should not be much difference from the lucene end between a new
index and index you want to update (add more documents to). As stated
in the Lucene docs IndexWriter will create the index "if it does not
already exist".
http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/in
u do, just don't include stop word removal in the
> processing of your token stream.
>
> Matt
>
> Phil Whelan wrote:
>>
>> Hi Matthew / Paul,
>>
>> On Thu, Jul 30, 2009 at 4:32 PM, Paul Cowan wrote:
>>
>>>
>>> Matthew Hall wrote:
>&
On Thu, Jul 30, 2009 at 7:12 PM, wrote:
> I was wonder if there is a list of special characters for the standard
> analyzer?
>
> What I mean by "special" is characters that the analyzer considers break
> characters.
> For example, if I have something like "foo=something", apparently the analyzer
Hi Matthew / Paul,
On Thu, Jul 30, 2009 at 4:32 PM, Paul Cowan wrote:
> Matthew Hall wrote:
>>
>> Place a delimiter between the email addresses that doesn't get removed in
>> your analyzer. (preferably something you know will never be searched on)
>
> Or add them separately (rather than:
> doc.a
On Thu, Jul 30, 2009 at 11:22 AM, Matthew Hall
wrote:
>
> 1. Sure, just have an analyzer that splits on all non letter characters.
> 2. Phrase queries keep the order intact. (And yes, the positional
> information for the terms is kept, which is what allows span queries to work)
>
> So searching
Hi,
We have a very large lucene index that we're developing that has a
field of email addresses. (Actually mulitple fields with multiple
emails addresses, but I'll simplify here)
Each document will have one "email" field containing multiple email addresses.
I am indexing email addresses only usi
Hi Don,
On Wed, Jul 29, 2009 at 1:42 PM, Donal Murtagh wrote:
> Course.name Attendance.mandatory Student.name
> -
> cooking N Bob
> art Y
On Wed, Jul 22, 2009 at 12:28 PM, Matthew Hall wrote:
> Not sure if this helps you, but some of the issue you are facing seem
> similar to those in the "real time" search threads.
Hi Matthew,
Do you have a pointer of where to go to see the "real time" threads?
Thanks,
Phil
-
Hi Ganesh,
I'm not sure whether this will work for you, but one way I got around
this was with multiple searches. I only needed the first 50 results,
but wanted to sort by date,hour,min,sec. This could result in 5
results or millions of results.
I added the date to the query, so I'd search for r
On Wed, Jul 22, 2009 at 5:46 AM, m.harig wrote:
> Is there any article or forum for using Hadoop with lucene? Please any1 help
> me
Hi M,
Katta is a project that is combining Lucene and Hadoop. Check it out here...
http://katta.sourceforge.net/
Thanks,
Phil
If there are only have a few thousand documents, and the number of
results quite small is this a case where post-search filtering can be
done?
I have not done anything like this myself with Lucene, so is this a
bad idea? If not, what would be the best way to do this?
org.apache.lucene.search.Filte
33 matches
Mail list logo