Am 23.04.2013 16:17, schrieb Alan Woodward:
> It doesn't sound as though an inverted index is really what you want to be
> querying here, if I'm reading you right. You want to get the payloads for
> spans at a specific position, but you don't particularly care about the
> actual term at that p
Am 23.04.2013 15:27, schrieb Alan Woodward:
> There's the SpanPositionCheckQuery family - SpanRangeQuery, SpanFirstQuery,
> etc. Is that the sort of thing you're looking for?
Hi Alan,
thanks for the pointer, this is the right direction indeed. However,
these queries are based on a SpanQuery whic
Am 23.04.2013 13:47, schrieb Carsten Schnober:
> I'm trying to figure out a way to use a query as Uwe suggested. My
> scenario is to perform a query and then retrieve some of the payloads
> upon user request, so there no obvious way to wrap this into a query as
> I can't know
Am 23.04.2013 13:21, schrieb Michael McCandless:
> Actually, term vectors can store payloads now (LUCENE-1888), so if that
> field was indexed with FieldType.setStoreTermVectorPayloads they should be
> there.
>
> But I suspect the TokenSources.getTokenStream API (which I think un-inverts
> the ter
Hi,
I'm trying to extract payloads from an index for specific tokens the
following way (inserting sample document number and term):
Terms terms = reader.getTermVector(16504, "term");
TokenStream tokenstream = TokenSources.getTokenStream(terms);
while (tokenstream.incrementToken()) {
OffsetAttrib
Hi,
tying in with the previous thread "Statically store sub-collections for
search", I'm trying to focus on the root of the problem that has
occurred to me.
At first, I generate a TermsFilter with potentially many terms in one term:
-
List docnames = new Ar
Am 15.04.2013 13:43, schrieb Uwe Schindler:
Hi,
> Passing NULL means all documents are allowed, if this would not be the case,
> whole Lucene queries and filters would not work at all, so if you get 0 docs,
> you must have missed something else. If this is not the case, your filter may
> behav
Am 15.04.2013 11:27, schrieb Uwe Schindler:
Hi again,
>>> You are somehow "misusing" acceptDocs and DocIdSet here, so you have
>> to take care, semantics are different:
>>> - For acceptDocs "null" means "all documents allowed" -> no deleted
>>> documents
>>> - For DocIdSet "null" means "no docume
Am 15.04.2013 10:42, schrieb Uwe Schindler:
> Not every DocIdSet supports bits(). If it returns null, then bits are not
> supported. To enforce a bitset availabe use CachingWrapperFilter (which
> internally uses a BitSet to cache).
> It might also happen that Filter.getDocIdSet() returns null, w
Am 15.04.2013 10:04, schrieb Uwe Schindler:
> The limit also applies for filters. If you have a list of terms ORed
> together, the fastest way is not to use a BooleanQuery at all, but instead a
> TermsFilter (which has no limits).
Hi Uwe,
thanks for the pointer, this looks promising! The only mi
Am 12.04.2013 20:08, schrieb SUJIT PAL:
> Hi Carsten,
>
> Why not use your idea of the BooleanQuery but wrap it in a Filter instead?
> Since you are not doing any scoring (only filtering), the max boolean clauses
> limit should not apply to a filter.
Hi Sujit,
thanks for your suggestion! I wasn
Dear list,
I would like to create a sub-set of the documents in an index that is to
be used for further searches. However, the criteria that lead to the
creation of that sub-set are not predefined so I think that faceted
search cannot be applied my this use case.
For instance:
A user searches for
Hi,
I have the following scenario: I have an index of very large size
(although I'm testing with around 200,000 documents, but should scale to
many millions) and I want to perform a search on a certain field.
According to that search, I would like to manipulate a different field
for all the matchin
Am 13.03.2013 10:23, schrieb dizh:
> I just recompile it.
>
> Luckily, It doesn't need to do much work. Only a few modifications according
> to Lucene4.1 API change doc.
That's great news. Are you going to publish a ready-made version somewhere?
Also, I've made the experience that Luke 4.0.0-ALP
Hi,
here's another question involving MultiTermQuerys. My aim is to get a
frequency count for a MultiTermQuery while I don't need to execute the
query. The naive approach would be to create the Query, extract the
terms, and get each term's frequency, approximately as follows:
IndexSearcher searche
Am 12.03.2013 10:39, schrieb Uwe Schindler:
> I would suggest to use my example code with the fake query and custom
> rewrite. This does not have the overhead of BooleanQuery and more important:
> You don't need to change the *global* and *static* default in BooleanQuery.
> Otherwise you could
Am 11.03.2013 18:22, schrieb Michael McCandless:
> On Mon, Mar 11, 2013 at 9:32 AM, Carsten Schnober
> wrote:
>> Am 11.03.2013 13:38, schrieb Michael McCandless:
>>> On Mon, Mar 11, 2013 at 7:08 AM, Uwe Schindler wrote:
>>>
>>>> Set the rewrite method t
Am 11.03.2013 13:38, schrieb Michael McCandless:
> On Mon, Mar 11, 2013 at 7:08 AM, Uwe Schindler wrote:
>
>> Set the rewrite method to e.g. SCORING_BOOLEAN_QUERY_REWRITE, then this
>> should work (after rewrite your query is a BooleanQuery, which supports
>> extractTerms()).
>
> ... as long a
Am 11.03.2013 14:13, schrieb Uwe Schindler:
>> Regarding the application of IndexSearcher.rewrite(Query) instead: I don't
>> see a way to set the rewrite method there because the Query's rewrite
>> method does not seem to apply to IndexSearcher.rewrite().
>
> Replace:
>> BooleanQuery bq = (Boolea
Am 11.03.2013 12:08, schrieb Uwe Schindler:
> This works for this query, but in general you have to rewrite until it is
> completely rewritten: A while loop that exits when the result of the rewrite
> is identical to the original query. IndexSearcher.rewrite() does this for
> you.
>
>> 3. Wri
Hi,
I'm trying to get the terms that match a certain RegexpQuery. My (naive)
approach:
1. Create a RegexpQuery from the queryString (e.g. "abc.*"):
Query q = new RegexpQuery(new Term("text", queryString));
2. Rewrite the Query using the IndexReader reader:
q = q.rewrite(reader);
3. Write the ter
Hi,
I'm interested in the functionality supposedly implemented through
ProximityQueryNode. Currently, it seems like it is not used by the
default QueryParser or anywhere else in Lucene, right? This makes
perfectly sense since I don't see a Lucene index store any notion of
sentences, paragraphs, etc
Am 29.01.2013 00:24, schrieb Trejkaz:
> On Tue, Jan 29, 2013 at 3:42 AM, Andrew Gilmartin
> wrote:
>> When I first started using Lucene, Lucene's Query classes where not suitable
>> for use with the Visitor pattern and so I created my own query class
>> equivalants and other more specialized ones.
vaCC would probably be a valuable source, but going
through that is very tedious. That is why I would like to know whether
you might know of a tutorial or less complex examples.
Thank you very much!
Carsten Schnober
--
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP
Am 23.12.2012 12:11, schrieb vitaly_arte...@mcafee.com:
> This means that we need to index millions of document with TeraBytes of
> content and search in it.
> For now we want to define only one indexed field, contained the content of
> the documents, with possibility to search terms and retrie
Am 13.12.2012 18:00, schrieb Jack Krupansky:
> Can you provide some examples of terms that don't work and the index
> token stream they fail on?
>
> Make sure that the Analyzer you are using doesn't do any magic on the
> indexed terms - your query term is unanalyzed. Maybe multiple, but
> distinct
Hi,
I have a search scenario in which I search for multiple terms and retain
only those matches that share a common payload. I'm using this to
search for multiple terms that occur all in one sentence; I've stored a
sentence ID in the payload for each token.
So far, I've done so by specifying a li
Am 18.12.2012 12:36, schrieb Michael McCandless:
> On Thu, Dec 13, 2012 at 8:32 AM, Carsten Schnober
> wrote:
>> This is a relatively easy example, but how would deal with e.g.
>> annotations that include multiple tokens (as in spans), such as chunks,
>> or relations b
Am 17.12.2012 11:54, schrieb Carsten Schnober:
> Might this have to do with the docbase? I collect the document IDs from
> the BooleanQuery through a Collector, adding the actual ID to the
> current AtomicReaderContext.docbase. In the corresponding SpanQuery, I
> pass these docum
Am 13.12.2012 18:00, schrieb Jack Krupansky:
> Can you provide some examples of terms that don't work and the index
> token stream they fail on?
>
> Make sure that the Analyzer you are using doesn't do any magic on the
> indexed terms - your query term is unanalyzed. Maybe multiple, but
> distinct
Am 13.12.2012 18:00, schrieb Jack Krupansky:
> Can you provide some examples of terms that don't work and the index
> token stream they fail on?
The index I'm testing with is German Wikipedia and I've been testing
with different (arbitrarily chosen) terms. I'm listing some results, the
first numbe
Hi,
I'm following Grant's advice on how to combine BooleanQuery and
SpanQuery
(http://mail-archives.apache.org/mod_mbox/lucene-java-user/201003.mbox/%3c08c90e81-1c33-487a-9e7d-2f05b2779...@apache.org%3E).
The strategy is to perform a BooleanQuery, get the document ID set and
perform a SpanQuery re
Am 13.12.2012 12:27, schrieb Michael McCandless:
>> For example:
>> - part of speech of a token.
>> - syntactic parse subtree (over a span).
>> - semantically normalized phrase (to canonical text or ontological code).
>> - semantic group (of a span).
>> - coreference link.
>
> So for example
Hi,
I have a problem understanding and applying the BitSets concept in
Lucene 4.0. Unfortunately, there does not seem to be a lot of
documentation about the topic.
The general task is to extract Spans matching a SpanQuery which works
with the following snippet:
for (AtomicReaderContext atomic : r
Hi,
I'm indexing names in a dedicated Lucene field and I wonder which
analyzer to use for that purpose. Typically, the names are in the format
"John Smith", so the WhitespaceAnalyzer is likely the best in most
cases. The field type to choose seems to be the TextField.
Or, would you rather recommend
Hi,
I use a custom analyzer and tokenizer. The analyzer is very basic and it
merely comprises the method createComponents():
-
@Override
protected TokenStreamComponents createComponents(String fieldName,
Reader reader) {
return new Toke
Am 20.11.2012 10:22, schrieb Uwe Schindler:
Hi,
> The createComponents() method of Analyzers is only called *once* for each
> thread and the Tokenstream is *reused* for later documents. The Analyzer will
> call the final method Tokenizer#setReader() to notify the Tokenizer of a new
> Reader (t
Am 19.11.2012 17:44, schrieb Carsten Schnober:
Hi,
> However, after switching to Lucene 4 and TokenStreamComponents, I'm
> getting a strange behaviour: only the first document in the collection
> is tokenized properly. The others do appear in the index, but
> un-tokenized, alth
Am 19.11.2012 17:44, schrieb Carsten Schnober:
Hi again,
just a little update:
> However, after switching to Lucene 4 and TokenStreamComponents, I'm
> getting a strange behaviour: only the first document in the collection
> is tokenized properly. The others do appear in the i
Hi,
I have recently updated to Lucene 4.0, but having problems with my
custom Analyzer/Tokenizer.
In the days of Lucene 3.6, it would work like this:
0. define constants lucene_version and indexdir
1. create an Analyzer: analyzer = new KoraAnalyzer() (our custom Analyzer)
2. create an IndexWriter
Am 29.10.2012 13:40, schrieb Carsten Schnober:
> Now, I'd like to add the option to filter the resulting Spans object by
> another WildcardQuery on a different field that contains document
> titles. My intuitive approach would have been to use a filter like this:
I'd like to c
Hi,
I've got a setup in which I would like to perform an arbitrary query
over one field (typically realised through a WildcardQuery) and the
matches are returned as a SpanQuery because the result payloads are
further processed using Span.next() and Span.getPayload(). This works
fine with the follow
Hi,
in case someone is interested in an application of the Lucene indexing
engine in the field of corpus linguistics rather than information
retrieval: we have worked on that subject for some time and have
recently published a conference paper about it:
http://korap.ids-mannheim.de/2012/09/kon
Am 14.08.2012 11:00, schrieb Uwe Schindler:
> You have to rewrite the wrapper query.
Thanks, Uwe! I had tried that way but it failed because the rewrite()
method would return a Query (not a SpanQuery) object. A cast seems to
solve the problem, I'm re-posting the code snippet to the list for the
sa
Dear list,
I am trying to combine a WildcardQuery and a SpanQuery because I need to
extract spans from the index for further processing. I realise that
there have been a few public discussions about this topic around, but I
still fail to get what I am missing here. My code is this (Lucene 3.6.0):
Hi Danil,
>> Just transform your input like "brown fox" into "ADJ:brown|> payload> NOUN:fox|"
>
> I understand that this denotes "ADJ" and "NOUN" to be interpreted as the
> actual token and "brown" and "fox" as payloads (followed by payload>), right?
Sorry for replying to myself, but I've reali
Am 07.08.2012 10:20, schrieb Danil ŢORIN:
Hi Danil,
> If you do intersection (not join), maybe it make sense to put every
> thing into 1 index?
Just a note on that: my application performs intersections and joins
(unions) on the results, depending on the query. So the index structure
has to be r
Am 06.08.2012 20:29, schrieb Mike Sokolov:
Hi Mike,
> There was some interesting work done on optimizing queries including
> very common words (stop words) that I think overlaps with your problem.
> See this blog post
> http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-wo
Am 31.07.2012 12:10, schrieb Ian Lea:
Hi Ian,
> Lucene 4.0 allows you to use custom codecs and there may be one that
> would be better for this sort of data, or you could write one.
>
> In your tests is it the searching that is slow or are you reading lots
> of data for lots of docs? The latter
hether this approach is
promising at all.
Does Lucene (4.0?) provide optimization techniques for extremely small
vocabulary sizes?
Thank you very much,
Carsten Schnober
--
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-
Am 16.07.2012 13:07, schrieb karsten-s...@gmx.de:
Dear Karsten,
> abstract of your post:
> you need the offset to perform your search/ranking like the position is
> needed for phrase queries.
> You are using reader.getTermFreqVector to get the offset.
> This is to slow for your application and
he problem?
Thank you very much,
Carsten Schnober
--
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generat
the Field
> parameter Field.Index. The Field.Store parameter has nothing to do with
> indexing: if a field is marked as "stored", the full and unchanged string /
> binary is stored in the stored fields file (".fdt"). Stored fields are used
Thanks for that clarificatio
erm) attributes
generated by the TokenStream?
Thank you very much!
Best,
Carsten
--
Carsten Schnober
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP -- Korpusanalyseplattform der nächsten Generation
http://korap.ids-mannheim.de/ | Tel.: +49-
not seem to make sense since I don't want to use a
Lucene built-in analyzer and I'm not quite clear about what I should use
for the value in the latter approach.
Any help is very welcome! Thank you very much!
Best regards,
Carsten
--
Carsten Schnober
Institut für
be read from an external file.
In general, I am afraid that the Lucene almost hardwires the analysis
process. Even though it does allow for custom tokenizers to be
implemented, it does not seem to intended that one does come up with a
completely self-made text analysis process, is it?
Thank y
56 matches
Mail list logo