Re: Reading Payloads

2013-04-23 Thread Carsten Schnober
Am 23.04.2013 16:17, schrieb Alan Woodward: > It doesn't sound as though an inverted index is really what you want to be > querying here, if I'm reading you right. You want to get the payloads for > spans at a specific position, but you don't particularly care about the > actual term at that p

Re: Reading Payloads

2013-04-23 Thread Carsten Schnober
Am 23.04.2013 15:27, schrieb Alan Woodward: > There's the SpanPositionCheckQuery family - SpanRangeQuery, SpanFirstQuery, > etc. Is that the sort of thing you're looking for? Hi Alan, thanks for the pointer, this is the right direction indeed. However, these queries are based on a SpanQuery whic

Re: Reading Payloads

2013-04-23 Thread Carsten Schnober
Am 23.04.2013 13:47, schrieb Carsten Schnober: > I'm trying to figure out a way to use a query as Uwe suggested. My > scenario is to perform a query and then retrieve some of the payloads > upon user request, so there no obvious way to wrap this into a query as > I can't know

Re: Reading Payloads

2013-04-23 Thread Carsten Schnober
Am 23.04.2013 13:21, schrieb Michael McCandless: > Actually, term vectors can store payloads now (LUCENE-1888), so if that > field was indexed with FieldType.setStoreTermVectorPayloads they should be > there. > > But I suspect the TokenSources.getTokenStream API (which I think un-inverts > the ter

Reading Payloads

2013-04-23 Thread Carsten Schnober
Hi, I'm trying to extract payloads from an index for specific tokens the following way (inserting sample document number and term): Terms terms = reader.getTermVector(16504, "term"); TokenStream tokenstream = TokenSources.getTokenStream(terms); while (tokenstream.incrementToken()) { OffsetAttrib

No documents in TermsFilter.getDocIdSet()

2013-04-15 Thread Carsten Schnober
Hi, tying in with the previous thread "Statically store sub-collections for search", I'm trying to focus on the root of the problem that has occurred to me. At first, I generate a TermsFilter with potentially many terms in one term: - List docnames = new Ar

Re: Statically store sub-collections for search (faceted search?)

2013-04-15 Thread Carsten Schnober
Am 15.04.2013 13:43, schrieb Uwe Schindler: Hi, > Passing NULL means all documents are allowed, if this would not be the case, > whole Lucene queries and filters would not work at all, so if you get 0 docs, > you must have missed something else. If this is not the case, your filter may > behav

Re: Statically store sub-collections for search (faceted search?)

2013-04-15 Thread Carsten Schnober
Am 15.04.2013 11:27, schrieb Uwe Schindler: Hi again, >>> You are somehow "misusing" acceptDocs and DocIdSet here, so you have >> to take care, semantics are different: >>> - For acceptDocs "null" means "all documents allowed" -> no deleted >>> documents >>> - For DocIdSet "null" means "no docume

Re: Statically store sub-collections for search (faceted search?)

2013-04-15 Thread Carsten Schnober
Am 15.04.2013 10:42, schrieb Uwe Schindler: > Not every DocIdSet supports bits(). If it returns null, then bits are not > supported. To enforce a bitset availabe use CachingWrapperFilter (which > internally uses a BitSet to cache). > It might also happen that Filter.getDocIdSet() returns null, w

Re: Statically store sub-collections for search (faceted search?)

2013-04-15 Thread Carsten Schnober
Am 15.04.2013 10:04, schrieb Uwe Schindler: > The limit also applies for filters. If you have a list of terms ORed > together, the fastest way is not to use a BooleanQuery at all, but instead a > TermsFilter (which has no limits). Hi Uwe, thanks for the pointer, this looks promising! The only mi

Re: Statically store sub-collections for search (faceted search?)

2013-04-15 Thread Carsten Schnober
Am 12.04.2013 20:08, schrieb SUJIT PAL: > Hi Carsten, > > Why not use your idea of the BooleanQuery but wrap it in a Filter instead? > Since you are not doing any scoring (only filtering), the max boolean clauses > limit should not apply to a filter. Hi Sujit, thanks for your suggestion! I wasn

Statically store sub-collections for search (faceted search?)

2013-04-12 Thread Carsten Schnober
Dear list, I would like to create a sub-set of the documents in an index that is to be used for further searches. However, the criteria that lead to the creation of that sub-set are not predefined so I think that faceted search cannot be applied my this use case. For instance: A user searches for

Update a bunch of documents

2013-04-11 Thread Carsten Schnober
Hi, I have the following scenario: I have an index of very large size (although I'm testing with around 200,000 documents, but should scale to many millions) and I want to perform a search on a certain field. According to that search, I would like to manipulate a different field for all the matchin

Re: Luke?

2013-03-14 Thread Carsten Schnober
Am 13.03.2013 10:23, schrieb dizh: > I just recompile it. > > Luckily, It doesn't need to do much work. Only a few modifications according > to Lucene4.1 API change doc. That's great news. Are you going to publish a ready-made version somewhere? Also, I've made the experience that Luke 4.0.0-ALP

Term Statistics for MultiTermQuery

2013-03-12 Thread Carsten Schnober
Hi, here's another question involving MultiTermQuerys. My aim is to get a frequency count for a MultiTermQuery while I don't need to execute the query. The naive approach would be to create the Query, extract the terms, and get each term's frequency, approximately as follows: IndexSearcher searche

Re: Rewrite for RegexpQuery

2013-03-12 Thread Carsten Schnober
Am 12.03.2013 10:39, schrieb Uwe Schindler: > I would suggest to use my example code with the fake query and custom > rewrite. This does not have the overhead of BooleanQuery and more important: > You don't need to change the *global* and *static* default in BooleanQuery. > Otherwise you could

Re: Rewrite for RegexpQuery

2013-03-12 Thread Carsten Schnober
Am 11.03.2013 18:22, schrieb Michael McCandless: > On Mon, Mar 11, 2013 at 9:32 AM, Carsten Schnober > wrote: >> Am 11.03.2013 13:38, schrieb Michael McCandless: >>> On Mon, Mar 11, 2013 at 7:08 AM, Uwe Schindler wrote: >>> >>>> Set the rewrite method t

Re: Rewrite for RegexpQuery

2013-03-11 Thread Carsten Schnober
Am 11.03.2013 13:38, schrieb Michael McCandless: > On Mon, Mar 11, 2013 at 7:08 AM, Uwe Schindler wrote: > >> Set the rewrite method to e.g. SCORING_BOOLEAN_QUERY_REWRITE, then this >> should work (after rewrite your query is a BooleanQuery, which supports >> extractTerms()). > > ... as long a

Re: Rewrite for RegexpQuery

2013-03-11 Thread Carsten Schnober
Am 11.03.2013 14:13, schrieb Uwe Schindler: >> Regarding the application of IndexSearcher.rewrite(Query) instead: I don't >> see a way to set the rewrite method there because the Query's rewrite >> method does not seem to apply to IndexSearcher.rewrite(). > > Replace: >> BooleanQuery bq = (Boolea

Re: Rewrite for RegexpQuery

2013-03-11 Thread Carsten Schnober
Am 11.03.2013 12:08, schrieb Uwe Schindler: > This works for this query, but in general you have to rewrite until it is > completely rewritten: A while loop that exits when the result of the rewrite > is identical to the original query. IndexSearcher.rewrite() does this for > you. > >> 3. Wri

Rewrite for RegexpQuery

2013-03-11 Thread Carsten Schnober
Hi, I'm trying to get the terms that match a certain RegexpQuery. My (naive) approach: 1. Create a RegexpQuery from the queryString (e.g. "abc.*"): Query q = new RegexpQuery(new Term("text", queryString)); 2. Rewrite the Query using the IndexReader reader: q = q.rewrite(reader); 3. Write the ter

ProximityQueryNode

2013-02-21 Thread Carsten Schnober
Hi, I'm interested in the functionality supposedly implemented through ProximityQueryNode. Currently, it seems like it is not used by the default QueryParser or anywhere else in Lucene, right? This makes perfectly sense since I don't see a Lucene index store any notion of sentences, paragraphs, etc

Re: ANTLR and Custom Query Syntax/Parser

2013-01-30 Thread Carsten Schnober
Am 29.01.2013 00:24, schrieb Trejkaz: > On Tue, Jan 29, 2013 at 3:42 AM, Andrew Gilmartin > wrote: >> When I first started using Lucene, Lucene's Query classes where not suitable >> for use with the Visitor pattern and so I created my own query class >> equivalants and other more specialized ones.

Custom Query Syntax/Parser

2013-01-28 Thread Carsten Schnober
vaCC would probably be a valuable source, but going through that is very tedious. That is why I would like to know whether you might know of a tutorial or less complex examples. Thank you very much! Carsten Schnober -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP

Re: Lucene 4.0 scalability and performance.

2012-12-24 Thread Carsten Schnober
Am 23.12.2012 12:11, schrieb vitaly_arte...@mcafee.com: > This means that we need to index millions of document with TeraBytes of > content and search in it. > For now we want to define only one indexed field, contained the content of > the documents, with possibility to search terms and retrie

Re: Boolean and SpanQuery: different results

2012-12-19 Thread Carsten Schnober
Am 13.12.2012 18:00, schrieb Jack Krupansky: > Can you provide some examples of terms that don't work and the index > token stream they fail on? > > Make sure that the Analyzer you are using doesn't do any magic on the > indexed terms - your query term is unanalyzed. Maybe multiple, but > distinct

Match intersection by Payload

2012-12-19 Thread Carsten Schnober
Hi, I have a search scenario in which I search for multiple terms and retain only those matches that share a common payload. I'm using this to search for multiple terms that occur all in one sentence; I've stored a sentence ID in the payload for each token. So far, I've done so by specifying a li

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-18 Thread Carsten Schnober
Am 18.12.2012 12:36, schrieb Michael McCandless: > On Thu, Dec 13, 2012 at 8:32 AM, Carsten Schnober > wrote: >> This is a relatively easy example, but how would deal with e.g. >> annotations that include multiple tokens (as in spans), such as chunks, >> or relations b

Re: Boolean and SpanQuery: different results

2012-12-17 Thread Carsten Schnober
Am 17.12.2012 11:54, schrieb Carsten Schnober: > Might this have to do with the docbase? I collect the document IDs from > the BooleanQuery through a Collector, adding the actual ID to the > current AtomicReaderContext.docbase. In the corresponding SpanQuery, I > pass these docum

Re: Boolean and SpanQuery: different results

2012-12-17 Thread Carsten Schnober
Am 13.12.2012 18:00, schrieb Jack Krupansky: > Can you provide some examples of terms that don't work and the index > token stream they fail on? > > Make sure that the Analyzer you are using doesn't do any magic on the > indexed terms - your query term is unanalyzed. Maybe multiple, but > distinct

Re: Boolean and SpanQuery: different results

2012-12-13 Thread Carsten Schnober
Am 13.12.2012 18:00, schrieb Jack Krupansky: > Can you provide some examples of terms that don't work and the index > token stream they fail on? The index I'm testing with is German Wikipedia and I've been testing with different (arbitrarily chosen) terms. I'm listing some results, the first numbe

Boolean and SpanQuery: different results

2012-12-13 Thread Carsten Schnober
Hi, I'm following Grant's advice on how to combine BooleanQuery and SpanQuery (http://mail-archives.apache.org/mod_mbox/lucene-java-user/201003.mbox/%3c08c90e81-1c33-487a-9e7d-2f05b2779...@apache.org%3E). The strategy is to perform a BooleanQuery, get the document ID set and perform a SpanQuery re

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Carsten Schnober
Am 13.12.2012 12:27, schrieb Michael McCandless: >> For example: >> - part of speech of a token. >> - syntactic parse subtree (over a span). >> - semantically normalized phrase (to canonical text or ontological code). >> - semantic group (of a span). >> - coreference link. > > So for example

SpanQuery and Bits

2012-12-06 Thread Carsten Schnober
Hi, I have a problem understanding and applying the BitSets concept in Lucene 4.0. Unfortunately, there does not seem to be a lot of documentation about the topic. The general task is to extract Spans matching a SpanQuery which works with the following snippet: for (AtomicReaderContext atomic : r

Specialized Analyzer for names

2012-11-23 Thread Carsten Schnober
Hi, I'm indexing names in a dedicated Lucene field and I wonder which analyzer to use for that purpose. Typically, the names are in the format "John Smith", so the WhitespaceAnalyzer is likely the best in most cases. The field type to choose seems to be the TextField. Or, would you rather recommend

Potential Resource Leak warning in Analyer.createComponents()

2012-11-21 Thread Carsten Schnober
Hi, I use a custom analyzer and tokenizer. The analyzer is very basic and it merely comprises the method createComponents(): - @Override protected TokenStreamComponents createComponents(String fieldName, Reader reader) { return new Toke

Re: TokenStreamComponents in Lucene 4.0

2012-11-20 Thread Carsten Schnober
Am 20.11.2012 10:22, schrieb Uwe Schindler: Hi, > The createComponents() method of Analyzers is only called *once* for each > thread and the Tokenstream is *reused* for later documents. The Analyzer will > call the final method Tokenizer#setReader() to notify the Tokenizer of a new > Reader (t

Re: TokenStreamComponents in Lucene 4.0

2012-11-20 Thread Carsten Schnober
Am 19.11.2012 17:44, schrieb Carsten Schnober: Hi, > However, after switching to Lucene 4 and TokenStreamComponents, I'm > getting a strange behaviour: only the first document in the collection > is tokenized properly. The others do appear in the index, but > un-tokenized, alth

Re: TokenStreamComponents in Lucene 4.0

2012-11-19 Thread Carsten Schnober
Am 19.11.2012 17:44, schrieb Carsten Schnober: Hi again, just a little update: > However, after switching to Lucene 4 and TokenStreamComponents, I'm > getting a strange behaviour: only the first document in the collection > is tokenized properly. The others do appear in the i

TokenStreamComponents in Lucene 4.0

2012-11-19 Thread Carsten Schnober
Hi, I have recently updated to Lucene 4.0, but having problems with my custom Analyzer/Tokenizer. In the days of Lucene 3.6, it would work like this: 0. define constants lucene_version and indexdir 1. create an Analyzer: analyzer = new KoraAnalyzer() (our custom Analyzer) 2. create an IndexWriter

Re: SpanQuery, Filter, BooleanQuery

2012-10-30 Thread Carsten Schnober
Am 29.10.2012 13:40, schrieb Carsten Schnober: > Now, I'd like to add the option to filter the resulting Spans object by > another WildcardQuery on a different field that contains document > titles. My intuitive approach would have been to use a filter like this: I'd like to c

SpanQuery, Filter, BooleanQuery

2012-10-29 Thread Carsten Schnober
Hi, I've got a setup in which I would like to perform an arbitrary query over one field (typically realised through a WildcardQuery) and the matches are returned as a SpanQuery because the result payloads are further processed using Span.next() and Span.getPayload(). This works fine with the follow

Lucene in Corpus Linguistics

2012-09-26 Thread Carsten Schnober
Hi, in case someone is interested in an application of the Lucene indexing engine in the field of corpus linguistics rather than information retrieval: we have worked on that subject for some time and have recently published a conference paper about it: http://korap.ids-mannheim.de/2012/09/kon

Re: UnsupportedOperationException: Query should have been rewritten

2012-08-14 Thread Carsten Schnober
Am 14.08.2012 11:00, schrieb Uwe Schindler: > You have to rewrite the wrapper query. Thanks, Uwe! I had tried that way but it failed because the rewrite() method would return a Query (not a SpanQuery) object. A cast seems to solve the problem, I'm re-posting the code snippet to the list for the sa

UnsupportedOperationException: Query should have been rewritten

2012-08-14 Thread Carsten Schnober
Dear list, I am trying to combine a WildcardQuery and a SpanQuery because I need to extract spans from the index for further processing. I realise that there have been a few public discussions about this topic around, but I still fail to get what I am missing here. My code is this (Lucene 3.6.0):

Re: Small Vocabulary

2012-08-07 Thread Carsten Schnober
Hi Danil, >> Just transform your input like "brown fox" into "ADJ:brown|> payload> NOUN:fox|" > > I understand that this denotes "ADJ" and "NOUN" to be interpreted as the > actual token and "brown" and "fox" as payloads (followed by payload>), right? Sorry for replying to myself, but I've reali

Re: Small Vocabulary

2012-08-07 Thread Carsten Schnober
Am 07.08.2012 10:20, schrieb Danil ŢORIN: Hi Danil, > If you do intersection (not join), maybe it make sense to put every > thing into 1 index? Just a note on that: my application performs intersections and joins (unions) on the results, depending on the query. So the index structure has to be r

Re: Small Vocabulary

2012-08-07 Thread Carsten Schnober
Am 06.08.2012 20:29, schrieb Mike Sokolov: Hi Mike, > There was some interesting work done on optimizing queries including > very common words (stop words) that I think overlaps with your problem. > See this blog post > http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-wo

Re: Small Vocabulary

2012-08-02 Thread Carsten Schnober
Am 31.07.2012 12:10, schrieb Ian Lea: Hi Ian, > Lucene 4.0 allows you to use custom codecs and there may be one that > would be better for this sort of data, or you could write one. > > In your tests is it the searching that is slow or are you reading lots > of data for lots of docs? The latter

Small Vocabulary

2012-07-30 Thread Carsten Schnober
hether this approach is promising at all. Does Lucene (4.0?) provide optimization techniques for extremely small vocabulary sizes? Thank you very much, Carsten Schnober -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-

Re: Offsets in 3.6/4.0

2012-07-17 Thread Carsten Schnober
Am 16.07.2012 13:07, schrieb karsten-s...@gmx.de: Dear Karsten, > abstract of your post: > you need the offset to perform your search/ranking like the position is > needed for phrase queries. > You are using reader.getTermFreqVector to get the offset. > This is to slow for your application and

Offsets in 3.6/4.0

2012-07-13 Thread Carsten Schnober
he problem? Thank you very much, Carsten Schnober -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generat

Re: Field value vs TokenStream

2012-04-20 Thread Carsten Schnober
the Field > parameter Field.Index. The Field.Store parameter has nothing to do with > indexing: if a field is marked as "stored", the full and unchanged string / > binary is stored in the stored fields file (".fdt"). Stored fields are used Thanks for that clarificatio

Field value vs TokenStream

2012-04-18 Thread Carsten Schnober
erm) attributes generated by the TokenStream? Thank you very much! Best, Carsten -- Carsten Schnober Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP -- Korpusanalyseplattform der nächsten Generation http://korap.ids-mannheim.de/ | Tel.: +49-

Indexing Pre-analyzed Field

2012-04-11 Thread Carsten Schnober
not seem to make sense since I don't want to use a Lucene built-in analyzer and I'm not quite clear about what I should use for the value in the latter approach. Any help is very welcome! Thank you very much! Best regards, Carsten -- Carsten Schnober Institut für

Apply custom tokenization

2012-03-06 Thread Carsten Schnober
be read from an external file. In general, I am afraid that the Lucene almost hardwires the analysis process. Even though it does allow for custom tokenizers to be implemented, it does not seem to intended that one does come up with a completely self-made text analysis process, is it? Thank y