using.
Instead of using a stored field, I would recommend using *payloads*.
If you store the field's valye as payload on a custom term, you basically
get a posting-list of the field value, which can be (theoretically, at least)
efficiently skipped on one hand - and read in sequen
the main text
and the attachments.
By the way, the method is called "deleteDocuments" - doesn't that imply
that it's perfectly acceptable to delete many documents with one term?
--
Nadav Har'El| Sunday, Jul 8 2007, 22 Tammuz 5767
IBM Ha
he count that has same field values.
You need just the counts? And you want to do just whole-field matching, not
word matching? In that case, Lucene might be an overkill for you. Or, if you
do use Lucene, make sure to use "keyword" (untokenized) fields, not
"tokenized" fields.
he implications.
If we had also a "TOKENIZED_NO_NORMS", why would new users accidentally
use it? I guess the javadoc of this parameter could also warn against its
use (something like "not recommended for general use", o
should refer to setOmitNorms()? (Or I should
learn to search the documentation better :-)).
--
Nadav Har'El| Tuesday, Jan 23 2007, 4 Shevat 5767
IBM Haifa Research Lab |-
|
e field's value *with* an Analyzer, but
still disable the storing of norms (because the field length should not be
considered in scoring)? Can't I do that? Was this intentional, or is this
an oversight and a fifth option should be added?
Thanks,
Nadav.
--
Nadav Har'El
On Mon, Jan 22, 2007, John Haxby wrote about "Re: Websphere and Dark Matter":
> Nadav Har'El wrote:
> Are you implying that the process memory shrinks, that memory is
> returned to the kernel? I didn't read the page you referenced that way.
> I know that if I a
ers as well. A good combination I once used is this:
-XX:NewRatio=2 -XX:MinHeapFreeRatio=20 -XX:MaxHeapFreeRatio=30
But your milage may vary.
[1] http://java.sun.com/javase/technologies/hotspot/vmoptions.jsp
--
N
lication, where
there is barely a handful of numeric fields, this slow encoding is shadowed
by the much slower process of indexing the document itself. Not to mention
that what usually really matters is the speed of the search or sort, not the
speed of the one-time indexing.
--
Nadav Har'El
t; OR "2.41" OR "2.42" OR "2.43" OR "2.44"
(note that this is an OR of just 7 posting lists, even if this range contains
thousands of distinct values).
I wonder if anybody ever done such a thing (or came up with an better
solution) in Lucene.
--
Nadav Har
I raised the idea of having a search() method which returns a Hits and
calls a HitCollector, but was convinced that TopDocs+HitCollector is
actually better. See:
http://www.gossamer-threads.com/lists/lucene/java-dev/37277
Maybe this should be in the FAQ.
--
Nadav Har'El
Otis Gospodnetic <[EMAIL PROTECTED]> wrote on 12/06/2006 04:36:45
PM:
> Nadav,
>
> Look up one of my onjava.com Lucene articles, where I talk about
> this. You may also want to tell Lucene to merge segments on disk
> less frequently, which is what mergeFactor does.
Thanks. Can you please point m
"Michael D. Curtin" <[EMAIL PROTECTED]> wrote on 12/06/2006 03:49:53 PM:
> Nadav Har'El wrote:
>
> > What I couldn't figure out how to use, however, was the abundant memory
(2
> > GB) that this machine has.
> >
> > I tried playing with Inde
e speed of huge merges, for example?
Thanks,
Nadav.
--
Nadav Har'El
[EMAIL PROTECTED]
+972-4-829-6326
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
;ConstantScoreRangeQuery" instead of "RangeQuery". It is still very
inefficient, and you still need to remember to pad all your numbers so
they sort properly *lexicographically* (e.g., 00-100), but at
least you should not have exceptions any more.
--
Nadav Har'El
ueryParser,
but you can do it with the SpanFirstQuery: for example if we index
Jason Bateman as the three tokens
Jason Bateman $
then we can search for it using something like
SpanQuery[] terms = {
new SpanTermQuery(new Term("actor", "Jason")),
pact on the index size, and it may
be possible to get similar results with no impact on index size
and just a small run-time slowdown by using something like
SpanNearQuery, or a variation on this idea. Again, I didn't yet
try to do this myself, so I'm not sure how successful that woul
ppear there, they actually appear very close, and in this
case even in order.
This sort of proximity-influenced scoring is missing from
Lucene's QueryParser, and I've been wondering recently
on how it is best to add it, and whether it is possible to
easily do it with existing Lucene machinary
ect
// every time, because that object is not thread safe. This may
// be a performance bottleneck.
DecimalFormat mantFormatter = new DecimalFormat(".##");
//
1234567890
String result
on), and the parsing will fail if these
features are used (or, alternatively, think of what else you can
do in this case).
--
Nadav Har'El
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
dition to breaking up the text on white spaces, also breaks it up in
other logical places (like punctuation, but not in every case), and
more importantly for you, it indexes the text in lowercase.
You should use StandardAnalyzer both during indexing, and du
ir.deleteDocument(doctodelete);
doctodelete=docs.doc();
}
}
idsReplaced.clear();
ir.close();
}
I did not test this idea too much, but in some initial experiments I tried,
it seems
to work.
--
Nadav Har
ocument doc = hits.doc(i);
TokenStream tokenStream = analyzer.tokenStream("storedContent",
new StringReader(doc.get("storedContent")));
summary = highlighter.getBestFragments(to
of this delete key isn't defined by Lucene,
but I believe that the concept of such a key was "officially"
sanctioned by Lucene with the deleteDocuments(Term) method (whose
documentation even mentions the "unique ID string" scenari
fied document if we search for
the term again we'll find two documents.
What about this idea? Does an implementation of something similar
already exist?
--
Nadav Har'El
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
25 matches
Mail list logo