A full-text tokenizer for the NGramTokenFilter

2010-07-17 Thread Martin

Hi there,

I have been recently trying to build a lucene index out of ngrams and 
seem to have stumbled on to a number of issues. I first tried to use the 
NGramTokenizer, but that thing apparently only takes the first 1024 
characters to tokenize. Having searched around the web, I came upon this 
issue being discussed a couple of years ago and the proposed solution 
there has been using the NGramTokenFilter. Now that filter certainly 
works, but it needs an underlying tokenizer to work with, and I'm just 
wondering if there is a tokenizer that would return me the whole text. 
The reason I can't use something like the StandardTokenizer is that 
ngrams should really include spaces and pretty much every tokenizer gets 
rid of them.


Thank you very much in advance for any suggestions.

Regards,
Martin

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: A full-text tokenizer for the NGramTokenFilter

2010-07-17 Thread Martin
Ahh, I knew I saw it somewhere, then I lost it again... :) I guess the 
name is not quite intuitive, but anyway thanks a lot!

and I'm just wondering if there is a tokenizer
that would return me the whole text.
 

KeywordTokenizer does this.




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



   



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



JTRES 2012 Call for Paper

2012-02-21 Thread Martin Schoeberl

==

CALL FOR PAPERS

  The 10th Workshop on
   Java Technologies for Real-Time and Embedded Systems
   JTRES 2009

Technical University of Denmark
DTU Informatics

   24-26 October 2012
   Copenhagen, Denmark

   http://jtres2012.imm.dtu.dk/

==

Overview


Over 90 percent of all microprocessors are now used for real-time and
embedded applications, and the behavior of many of these applications
is constrained by the physical world. Higher-level programming
languages and middleware are needed to robustly and productively
design, implement, compose, integrate, validate, and enforce
real-time constraints along with conventional functional requirements
and reusable components. It is essential that the production of
real-time embedded systems can take advantage of languages, tools,
and methods that enable higher software productivity. The Java
programming language has become an attractive choice because of its
safety, productivity, its relatively low maintenance costs, and the
availability of well trained developers.

Although it features good software engineering characteristics,
standard Java is unsuitable for developing real-time embedded
systems, mainly due to under-specification of thread scheduling and
the presence of garbage collection. These problems are addressed by
the Real-Time Specification for Java (RTSJ). The intent of this
specification is the development of real-time applications by
providing several additions such as extending the Java memory model
and providing stronger semantics in thread scheduling.

Interest in real-time Java in both the research community and
industry has recently increased significantly, because of its
challenges and its potential impact on the development of embedded
and real-time applications. The goal of the proposed workshop is to
gather researchers working on real-time and embedded Java to identify
the challenging problems that still need to be solved in order to
assure the success of real-time Java as a technology, and to report
results and experiences gained by researchers.

Submission Requirements
---

Participants are expected to submit a paper of at most 10 pages (ACM
Conference Format, i.e., two-columns, 10 point font - see formatting
instructions at http://www.acm.org/sigs/publications/proceedings-templates). 
Accepted papers will be published in the ACM International Conference
Proceedings Series via the ACM Digital Library and have to be presented by
one author at the JTRES.  Papers should be submitted through Easychair. 
Please use the submission link:

https://www.easychair.org/account/signin.cgi?conf=jtres2012

Papers describing open source projects shall include a description
how to obtain the source and how to run the experiments in the
appendix. The source version for the published paper will be hosted
at the JTRES web site.

Accepted papers will be invited for submission to a special issue of
the Journal on Concurrency and Computation: Practice and Experience.

Topics of interest to this workshop include, but are not limited to:

* New real-time programming paradigms and language features

* Industrial experience and practitioner reports

* Open source solutions for real-time Java

* Real-time design patterns and programming idioms

* High-integrity and safety critical system support

* Java-based real-time operating systems and processors

* Extensions to the RTSJ

* Virtual machines and execution environments

* Memory management and real-time garbage collection

* Compiler analysis and implementation techniques

* Scheduling frameworks, feasibility analysis, and timing analysis

* Reproduction studies

* Multiprocessor and distributed real-time Java


Important Dates
---

* Paper Submission: July  1, 2012
* Notification of Acceptance: August  5, 2012
* Camera Ready Paper Due: August 20, 2012
* Workshop:   October 24-26, 2012


Program Chair:
--

Andy Wellings, University of York


Workshop Chair:
--

Martin Schoeberl, Technical University of Denmark


Program Committee:
--

Pablo Basanta-Val, Universidad Carlos III de Madrid
Theresa Higuera, Universidad Complutense de Madrid
James Hunt, Aicas
Doug Jensen, MITRE
Tomas Kalibera, University of Kent
Doug Locke, LC Systems Services
Kelvin Nilsen, Aonix
Damien Masson, Laboratoire d'informatique Gaspard-Monge
Ales Plsek, INRIA Lille
Marek Prochazka, European Space Agency
Wolfgang Puffitsch, Vienna University of Technology
Anders Ravn, Aalborg Unive

Using a Lucene ShingleFilter to extract frequencies of bigrams in Lucene

2012-09-04 Thread Martin O'Shea
If a Lucene ShingleFilter can be used to tokenize a string into shingles, or
ngrams, of different sizes, e.g.:

 

"please divide this sentence into shingles"

 

Becomes:

 

shingles "please divide", "divide this", "this sentence", "sentence
into", and "into shingles"

 

Does anyone know if this can be used in conjunction with other analyzers to
return the frequencies of the bigrams or trigrams found, e.g.:

 

"please divide this please divide sentence into shingles"

 

Would return 2 for "please divide"?

 

I'm currently using Lucene 3.0.2 to extract frequencies of unigrams from a
string using a combination of a TermVectorMapper and Standard/Snowball
analyzers.

 

I should add that my strings are built up from a database and then indexed
by Lucene in memory and are not persisted beyond this. Use of other products
like Solr is not intended.

 

Thanks

 

Mr Morgan.

 

 



RE: Using a Lucene ShingleFilter to extract frequencies of bigrams in Lucene

2012-09-06 Thread Martin O'Shea
Thanks for that piece of advice.

 I ended up passing my snowballAnalyzer and standardAnalyzers as parameters to 
ShingleFilterWrappers and processing the outputs via a TermVectorMapper. 

It seems to work quite well.

-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: 05 Sep 2012 01 53
To: java-user@lucene.apache.org
Subject: Re: Using a Lucene ShingleFilter to extract frequencies of bigrams in 
Lucene

On Tue, Sep 4, 2012 at 12:37 PM, Martin O'Shea  wrote:
>
> Does anyone know if this can be used in conjunction with other 
> analyzers to return the frequencies of the bigrams or trigrams found, e.g.:
>
>
>
> "please divide this please divide sentence into shingles"
>
>
>
> Would return 2 for "please divide"?
>
>
>
> I'm currently using Lucene 3.0.2 to extract frequencies of unigrams 
> from a string using a combination of a TermVectorMapper and 
> Standard/Snowball analyzers.
>
>
>
> I should add that my strings are built up from a database and then 
> indexed by Lucene in memory and are not persisted beyond this. Use of 
> other products like Solr is not intended.
>

The bigrams etc generated by shingles are terms just like the unigrams. So you 
can wrap any other analyzer with a ShingleAnalyzerWrapper if you want the 
shingles.

If you just want to use Lucene's analyzers to tokenize the text and compute 
within-document frequencies for a one-off purpose, I think indexing and 
creating term vectors could be overkill: you could just consume the tokens from 
the Analyzer and make a hashmap or whatever you need...

There are examples in the org.apache.lucene.analysis package javadocs.

--
lucidworks.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Using stop words with snowball analyzer and shingle filter

2012-09-20 Thread Martin O'Shea
Thanks for the responses. They've given me much food for thought.

-Original Message-
From: Steven A Rowe [mailto:sar...@syr.edu] 
Sent: 20 Sep 2012 02 19
To: java-user@lucene.apache.org
Subject: RE: Using stop words with snowball analyzer and shingle filter

Hi Martin,

SnowballAnalyzer was deprecated in Lucene 3.0.3 and will be removed in
Lucene 5.0.

Looks like you're using Lucene 3.X; here's an (untested) Analyzer, based on
Lucene 3.6 EnglishAnalyzer, (except substituting SnowballFilter for
PorterStemmer; disabling stopword holes' position increments; and adding
ShingleFilter), that should basically do what you want:

--
String[] stopWords = new String[] { ... }; Set stopSet =
StopFilter.makeStopSet(matchVersion, stopWords); String[] stemExclusions =
new String[] { ... }; Set stemExclusionsSet = new HashSet();
stemExclusionsSet.addAll(Arrays.asList(stemExclusions));
matchVersion = Version.LUCENE_3X;

Analyzer analyzer = new ReusableAnalyzerBase() {
  @Override
  protected TokenStreamComponents createComponents(String fieldName, Reader
reader) {
final Tokenizer source = new StandardTokenizer(matchVersion, reader);
TokenStream result = new StandardFilter(matchVersion, source);
// prior to this we get the classic behavior, standardfilter does it for
us.
if (matchVersion.onOrAfter(Version.LUCENE_31))
  result = new EnglishPossessiveFilter(matchVersion, result);
result = new LowerCaseFilter(matchVersion, result);
result = new StopFilter(matchVersion, result, stopSet);
((StopFilter)result).setEnablePositionIncrements(false);  // Disable
holes' position increments
if (stemExclusionsSet.size() > 0) {
  result = new KeywordMarkerFilter(result, stemExclusionsSet);
}
result = new SnowballFilter(result, "English");
result = new ShingleFilter(result, this.getnGramLength());
return new TokenStreamComponents(source, result);
  }
};
--

Steve

-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Wednesday, September 19, 2012 7:16 PM
To: java-user@lucene.apache.org
Subject: Re: Using stop words with snowball analyzer and shingle filter

The underscores are due to the fact that the StopFilter defaults to "enable
position increments", so there are no terms at the positions where the stop
words appeared in the source text.

Unfortunately, SnowballAnalyzer does not pass that in as a parameter and is
"final" so you can't subclass it to override the "createComponents" method
that creates the StopFilter, so you would essentially have to copy the
source for SnowballAnalyzer and then add in the code to invoke
StopFilter.setEnablePositionIncrements the way StopFilterFactory does.

-- Jack Krupansky

-Original Message-
From: Martin O'Shea
Sent: Wednesday, September 19, 2012 4:24 AM
To: java-user@lucene.apache.org
Subject: Using stop words with snowball analyzer and shingle filter

I'm currently giving the user an option to include stop words or not when
filtering a body of text for ngram frequencies. Typically, this is done as
follows:



snowballAnalyzer = new SnowballAnalyzer(Version.LUCENE_30, "English",
stopWords);

shingleAnalyzer = new ShingleAnalyzerWrapper(snowballAnalyzer,
this.getnGramLength());



stopWords is set to either a full list of words to include in ngrams or to
remove from them. this.getnGramLength()); simply contains the current ngram
length up to a maximum of three.



If I use stopwords in filtering text "satellite is definitely falling to
Earth" for trigrams, the output is:



No=1, Key=to, Freq=1

No=2, Key=definitely, Freq=1

No=3, Key=falling to earth, Freq=1

No=4, Key=satellite, Freq=1

No=5, Key=is, Freq=1

No=6, Key=definitely falling to, Freq=1

No=7, Key=definitely falling, Freq=1

No=8, Key=falling, Freq=1

No=9, Key=to earth, Freq=1

No=10, Key=satellite is, Freq=1

No=11, Key=is definitely, Freq=1

No=12, Key=falling to, Freq=1

No=13, Key=is definitely falling, Freq=1

No=14, Key=earth, Freq=1

No=15, Key=satellite is definitely, Freq=1



But if I don't use stopwords for trigrams , the output is this:



No=1, Key=satellite, Freq=1

No=2, Key=falling _, Freq=1

No=3, Key=satellite _ _, Freq=1

No=4, Key=_ earth, Freq=1

No=5, Key=falling, Freq=1

No=6, Key=satellite _, Freq=1

No=7, Key=_ _, Freq=1

No=8, Key=_ falling _, Freq=1

No=9, Key=falling _ earth, Freq=1

No=10, Key=_, Freq=3

No=11, Key=earth, Freq=1

No=12, Key=_ _ falling, Freq=1

No=13, Key=_ falling, Freq=1



Why am I seeing underscores? I would have thought to see simple unigrams,
"satellite falling" and "falling earth", and "satellite falling earth"?








-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


---

NPE while decrement ref count

2012-11-12 Thread Martin Sachs
Hi ,

i'm hanging with a NPE Problem. This occurs only on production
environment from day to day.

Do anyone know some thing about this ?

java.lang.NullPointerException
at
org.apache.lucene.index.SegmentNorms.decRef(SegmentNorms.java:102)
at
org.apache.lucene.index.SegmentReader.doClose(SegmentReader.java:394)
at org.apache.lucene.index.IndexReader.decRef(IndexReader.java:222)
at
org.apache.lucene.index.DirectoryReader.doClose(DirectoryReader.java:904)
at org.apache.lucene.index.IndexReader.decRef(IndexReader.java:222)


Martin

-- 
** Dipl. Inform. Martin Sachs
** Senior Software-Developer / Software-Architect
T  +49 (30) 443 50 99 - 33
F  +49 (30) 443 50 99 - 99
E  martin.sa...@artnology.com
 Google+: martin.sachs.artnol...@gmail.com
   skype: ms

** artnology GmbH
A  Milastraße 4 / D-10437 Berlin
T  +49 (30) 443 50 99 - 0
F  +49 (30) 443 50 99 - 99
E  i...@artnology.com
I  http://www.artnology.com 

Geschäftsführer: Ekkehard Blome (CEO), Felix Kuschnick (CCO)
Registergericht: Amtsgericht Berlin Charlottenburg HRB 76376
UST-Id. DE 217652550



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Java HotSpot problem with search and 64-bit JVM

2012-11-12 Thread Martin Sachs
Hi,

i know, this can be a little off topic. But maybe someone knows something.

Background: I'm using RHEL 5.8 with 64-bit JVM. With a 32-bit JVM the
Searcher works fine.

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x2b060334, pid=12235, tid=1522518336
#
# JRE version: 6.0_33-b03
# Java VM: Java HotSpot(TM) 64-Bit Server VM (20.8-b03 mixed mode
linux-amd64 compressed oops)
# Problematic frame:
# v  ~StubRoutines::jbyte_disjoint_arraycopy
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
#

---  T H R E A D  ---

Current thread (0x2aaab8c01000):  JavaThread
"ParallelMultiSearcher-17618-thread-3" [_thread_in_Java, id=23763,
stack(0x5aafc000,0x5abfd000)]

siginfo:si_signo=SIGBUS: si_errno=0, si_code=2 (BUS_ADRERR),
si_addr=0x2aaad8495267

Registers:
RAX=0x2aaad849526e, RBX=0x00259267, RCX=0x0007,
RDX=0x
RSP=0x5abfb730, RBP=0x5abfb730, RSI=0xb7581547,
RDI=0x2aaad849525f
R8 =0x00145e05, R9 =0x000e, R10=0x2b060aa0,
R11=0x0010
R12=0x, R13=0x001e, R14=0x2aaad849526e,
R15=0x2aaab8c01000
RIP=0x2b060334, EFLAGS=0x00010202,
CSGSFS=0x0033, ERR=0x0004
  TRAPNO=0x000e


I have also contacted the Redhatsupport, but with no results yet.

martin


-- 
** Dipl. Inform. Martin Sachs
** Senior Software-Developer / Software-Architect
T  +49 (30) 443 50 99 - 33
F  +49 (30) 443 50 99 - 99
E  martin.sa...@artnology.com
 Google+: martin.sachs.artnol...@gmail.com
   skype: ms

** artnology GmbH
A  Milastraße 4 / D-10437 Berlin
T  +49 (30) 443 50 99 - 0
F  +49 (30) 443 50 99 - 99
E  i...@artnology.com
I  http://www.artnology.com 

Geschäftsführer: Ekkehard Blome (CEO), Felix Kuschnick (CCO)
Registergericht: Amtsgericht Berlin Charlottenburg HRB 76376
UST-Id. DE 217652550



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: NPE while decrement ref count

2012-11-12 Thread Martin Sachs
oh yes i missed the version:

I'm using lucene 3.6.1

Martin

Am 12.11.2012 09:40, schrieb Uwe Schindler:
> Which Lucene version?
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
>> -Original Message-
>> From: Martin Sachs [mailto:martin.sa...@artnology.com]
>> Sent: Monday, November 12, 2012 9:18 AM
>> To: java-user@lucene.apache.org
>> Subject: NPE while decrement ref count
>>
>> Hi ,
>>
>> i'm hanging with a NPE Problem. This occurs only on production environment
>> from day to day.
>>
>> Do anyone know some thing about this ?
>>
>> java.lang.NullPointerException
>> at
>> org.apache.lucene.index.SegmentNorms.decRef(SegmentNorms.java:102)
>> at
>> org.apache.lucene.index.SegmentReader.doClose(SegmentReader.java:394)
>> at org.apache.lucene.index.IndexReader.decRef(IndexReader.java:222)
>> at
>> org.apache.lucene.index.DirectoryReader.doClose(DirectoryReader.java:904
>> )
>> at org.apache.lucene.index.IndexReader.decRef(IndexReader.java:222)
>>
>>
>> Martin
>>
>> --
>> ** Dipl. Inform. Martin Sachs
>> ** Senior Software-Developer / Software-Architect T  +49 (30) 443 50 99 - 33
>> F  +49 (30) 443 50 99 - 99 E  martin.sa...@artnology.com
>>  Google+: martin.sachs.artnol...@gmail.com
>>skype: ms
>>
>> ** artnology GmbH
>> A  Milastraße 4 / D-10437 Berlin
>> T  +49 (30) 443 50 99 - 0
>> F  +49 (30) 443 50 99 - 99
>> E  i...@artnology.com
>> I  http://www.artnology.com
>>
>> Geschäftsführer: Ekkehard Blome (CEO), Felix Kuschnick (CCO)
>> Registergericht: Amtsgericht Berlin Charlottenburg HRB 76376 UST-Id. DE
>> 217652550
>>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
** Dipl. Inform. Martin Sachs
** Senior Software-Developer / Software-Architect
T  +49 (30) 443 50 99 - 33
F  +49 (30) 443 50 99 - 99
E  martin.sa...@artnology.com
 Google+: martin.sachs.artnol...@gmail.com
   skype: ms

** artnology GmbH
A  Milastraße 4 / D-10437 Berlin
T  +49 (30) 443 50 99 - 0
F  +49 (30) 443 50 99 - 99
E  i...@artnology.com
I  http://www.artnology.com 

Geschäftsführer: Ekkehard Blome (CEO), Felix Kuschnick (CCO)
Registergericht: Amtsgericht Berlin Charlottenburg HRB 76376
UST-Id. DE 217652550



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: NPE while decrement ref count

2012-11-12 Thread Martin Sachs
hi,

thanks for your fast response!

I'm using :
/
java version "1.6.0_25"
Java(TM) SE Runtime Environment (build 1.6.0_25-b06)
Java HotSpot(TM) Server VM (build 20.0-b11, mixed mode)
/
This is a 32-bit Oracle JVM (JDK), not the 64-bit on RedHat, but maybe
its a bug on REHL.

While I write this, I download the newest oracle version and try it.
Also I just enabled assertions in JVM. I have to wait for occurrence.

martin

Am 12.11.2012 09:56, schrieb Uwe Schindler:
> Hi,
>
> I opened the code, the NPE occurs here:
>
>   if (bytes != null) {
> assert bytesRef != null;
> bytesRef.decrementAndGet(); // <-- LINE 102, NPE occurs here
> bytes = null;
> bytesRef = null;
>   } else {
> assert bytesRef == null;
>   }
>
> This is completely impossible - can you enable assertions in your running JVM 
> to be sure ("-ea:org.apache.lucene..." JVM parameter)? It cannot be a 
> threading issue, because all this code is synchronized.
>
> As you reported another issue with your JVM has segmentation faults, could it 
> be that you have a buggy JVM. Are you using the official Oracle ones, or is 
> this a modified one, e.g. by RedHat? If it is OpenJDK, it is better to use 
> Java 7, as the Java 6 ones may not behave identical to Oracle's official JDKs 
> and may be more buggy.
>
> Can you also post your full JVM version (java -version) and where you got it 
> from?
>
> Uwe
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
>> -Original Message-
>> From: Martin Sachs [mailto:martin.sa...@artnology.com]
>> Sent: Monday, November 12, 2012 9:43 AM
>> To: java-user@lucene.apache.org
>> Subject: Re: NPE while decrement ref count
>>
>> oh yes i missed the version:
>>
>> I'm using lucene 3.6.1
>>
>> Martin
>>
>> Am 12.11.2012 09:40, schrieb Uwe Schindler:
>>> Which Lucene version?
>>>
>>> -
>>> Uwe Schindler
>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>> http://www.thetaphi.de
>>> eMail: u...@thetaphi.de
>>>
>>>
>>>> -Original Message-
>>>> From: Martin Sachs [mailto:martin.sa...@artnology.com]
>>>> Sent: Monday, November 12, 2012 9:18 AM
>>>> To: java-user@lucene.apache.org
>>>> Subject: NPE while decrement ref count
>>>>
>>>> Hi ,
>>>>
>>>> i'm hanging with a NPE Problem. This occurs only on production
>>>> environment from day to day.
>>>>
>>>> Do anyone know some thing about this ?
>>>>
>>>> java.lang.NullPointerException
>>>> at
>>>>
>> org.apache.lucene.index.SegmentNorms.decRef(SegmentNorms.java:102)
>>>> at
>>>>
>> org.apache.lucene.index.SegmentReader.doClose(SegmentReader.java:394)
>>>> at
>> org.apache.lucene.index.IndexReader.decRef(IndexReader.java:222)
>>>> at
>>>> org.apache.lucene.index.DirectoryReader.doClose(DirectoryReader.java:
>>>> 904
>>>> )
>>>> at
>>>> org.apache.lucene.index.IndexReader.decRef(IndexReader.java:222)
>>>>
>>>>
>>>> Martin
>>>>
>>>> --
>>>> ** Dipl. Inform. Martin Sachs
>>>> ** Senior Software-Developer / Software-Architect T  +49 (30) 443 50
>>>> 99 - 33 F  +49 (30) 443 50 99 - 99 E  martin.sa...@artnology.com
>>>>  Google+: martin.sachs.artnol...@gmail.com
>>>>skype: ms
>>>>
>>>> ** artnology GmbH
>>>> A  Milastraße 4 / D-10437 Berlin
>>>> T  +49 (30) 443 50 99 - 0
>>>> F  +49 (30) 443 50 99 - 99
>>>> E  i...@artnology.com
>>>> I  http://www.artnology.com
>>>>
>>>> Geschäftsführer: Ekkehard Blome (CEO), Felix Kuschnick (CCO)
>>>> Registergericht: Amtsgericht Berlin Charlottenburg HRB 76376 UST-Id.
>>>> DE
>>>> 217652550
>>>>
>>>
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>
>> --
>> ** Dipl. Inform. Martin Sachs
>> ** Senior Software-Developer / Software-Architect T  +49 (30) 443 50 99 - 33
>> F  +49 (30) 443 50 99 - 99 E  martin.sa...@artnology.com
>>  Go

on-the-fly "filters" from docID lists

2010-07-21 Thread Martin J
Hello, we are trying to implement a query type for Lucene (with eventual
target being Solr) where the query string passed in needs to be "filtered"
through a large list of document IDs per user. We can't store the user ID
information in the lucene index per document so we were planning to pull the
list of documents owned by user X from a key-value store at query time and
then build some sort of filter in memory before doing the Lucene/Solr query.
For example:

content:"cars" user_id:X567

would first pull the list of docIDs that user_id:X567 has "access" to from a
keyvalue store and then we'd query the main index with content:"cars" but
only allow the docIDs that came back to be part of the response. The list of
docIDs can near the hundreds of thousands.

What should I be looking at to implement such a feature?

Thank you
Martin


RE: Use of Lucene to store data from RSS feeds

2010-10-15 Thread Martin O'Shea
@Pulkit Singhal: Thanks for the reply. Just to clarify my post yesterday, I'm 
not sure if each row in the database table would form a document or not because 
I do not know if Lucene works in this manner. In my case, each row of the table 
represents a single polling of an RSS feed to retrieve any new postings over a 
given number of hours. If Lucene allows a document to have separate time-based 
entries, then I am happy to use it for indexing. But if a separate document is 
needed per row of the table, then I'm uncertain. I always do have the option of 
using Lucene for in-memory indexing of postings to calculate the keyword 
frequencies. This I know how to do.

The individual columns of my table represent the only two elements of each RSS 
item that I'm interested in retrieving text from, i.e. the title and 
description.

-Original Message-
From: Pulkit Singhal [mailto:pulkitsing...@gmail.com] 
Sent: 15 Oct 2010 13 36
To: java-user@lucene.apache.org
Subject: Re: Use of Lucene to store data from RSS feeds

When you ask:
a) will each feed would form a Lucene document, or
b) will each database row would form a lucene document
I'm inclined to say that really depends on what type of aggregation
tool or logic you are using.

I don't know if "Tika" does it but if there is a tool out there that
can be pointed to a feed and tweaked to spit out documents with each
field having the settings that you want then you can go with that
approach. But if you are already parsing the feed and storing the raw
data into a database table then there is no reason that you can't
leverage that. From a database row perspective you have already done a
good deal of work to collect the data and breaking it down into chunks
that Lucene can happily index as separate fields in a document.

By the way I think there are tools that read from the database
directly too but I won't try to make things too complicated.

The way I see it, if you were to use the row at this moment and index
the 4 columns as fields ... plus you could set the feed body to be
ANALYZED (why don't I see the feed body in your database table?) ...
then lucene range queries on the date/time field could possibly return
some results. I am not sure how to get keyword frequencies but if the
analyzed tokens that lucene is keeping in its index sort of represent
the keywords that you are talking about then i do know that lucene
keeps some sort of inverted index per token in terms of how many
occurrences of it are there ... may be someone else on the list can
comment on how to extract that info in a query.

Sounds doable.

On Thu, Oct 14, 2010 at 10:17 AM,   wrote:
> Hello
>
> I would like to store data retrieved hourly from RSS feeds in a database or 
> in Lucene so that the text can be easily
> indexed for word frequencies.
>
> I need to get the text from the title and description elements of RSS items.
>
> Ideally, for each hourly retrieval from a given feed, I would add a row to a 
> table in a dataset made up of the
> following columns:
>
> feed_url, title_element_text, description_element_text, polling_date_time
>
> From this, I can look up any element in a feed and calculate keyword 
> frequencies based upon the length of time required.
>
> This can be done as a database table and hashmaps used to calculate word 
> frequencies. But can I do this in Lucene to
> this degree of granularity at all? If so, would each feed form a Lucene 
> document or would each 'row' from the
> database table form one?
>
> Can anyone advise?
>
> Thanks
>
> Martin O'Shea.
> --
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Using a TermFreqVector to get counts of all words in a document

2010-10-20 Thread Martin O'Shea
Hello

 

I am trying to use a TermFreqVector to get a count of all words in a
Document as follows:

 

   // Search.

int hitsPerPage = 10;

IndexSearcher searcher = new IndexSearcher(index, true);

TopScoreDocCollector collector =
TopScoreDocCollector.create(hitsPerPage, true);

searcher.search(q, collector);

ScoreDoc[] hits = collector.topDocs().scoreDocs;

 

// Display results.

int docId = 0;

System.out.println("Found " + hits.length + " hits.");

for (int i = 0; i < hits.length; ++i) {

docId = hits[i].doc;

Document d = searcher.doc(docId);

System.out.println((i + 1) + ". " + d.get("title"));

IndexReader trd = IndexReader.open(index);

TermFreqVector tfv = trd.getTermFreqVector(docId, "title");

System.out.println(tfv.getTerms().toString());

System.out.println(tfv.getTermFrequencies().toString());

}

 

The code is very rough as its only an experiment but I'm under the
impression that the getTerms and getTermFrequencies methods for a
TermFreqVector should allow each word and its frequency in the document to
be displayed. All I get though is a NullPointerError. The index consists of
a single document made up of a simple string:

 

IndexWriter w = new IndexWriter(index, analyzer, true,
IndexWriter.MaxFieldLength.UNLIMITED);

addDoc(w, "Lucene for Dummies"); 

 

And the queryString being used is simply "dummies".  

 

Thanks

 

Martin O'Shea.



RE: Using a TermFreqVector to get counts of all words in a document

2010-10-20 Thread Martin O'Shea
Uwe

Thanks - I figured that bit out. I'm a Lucene 'newbie'.

What I would like to know though is if it is practical to search a single
document of one field simply by doing this:

IndexReader trd = IndexReader.open(index);
TermFreqVector tfv = trd.getTermFreqVector(docId, "title");
String[] terms = tfv.getTerms();
int[] freqs = tfv.getTermFrequencies();
for (int i = 0; i < tfv.getTerms().length; i++) {
System.out.println("Term " + terms[i] + " Freq: " + freqs[i]);
}
trd.close();

where docId is set to 0.

The code works but can this be improved upon at all?

My situation is where I don't want to calculate the number of documents with
a particular string. Rather I want to get counts of individual words in a
field in a document. So I can concatenate the strings before passing it to
Lucene.

-Original Message-
From: Uwe Schindler [mailto:u...@thetaphi.de] 
Sent: 20 Oct 2010 19 40
To: java-user@lucene.apache.org
Subject: RE: Using a TermFreqVector to get counts of all words in a document

TermVectors are only available when enabled for the field/document.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: Martin O'Shea [mailto:app...@dsl.pipex.com]
> Sent: Wednesday, October 20, 2010 8:23 PM
> To: java-user@lucene.apache.org
> Subject: Using a TermFreqVector to get counts of all words in a document
> 
> Hello
> 
> 
> 
> I am trying to use a TermFreqVector to get a count of all words in a
Document
> as follows:
> 
> 
> 
>// Search.
> 
> int hitsPerPage = 10;
> 
> IndexSearcher searcher = new IndexSearcher(index, true);
> 
> TopScoreDocCollector collector =
> TopScoreDocCollector.create(hitsPerPage, true);
> 
> searcher.search(q, collector);
> 
> ScoreDoc[] hits = collector.topDocs().scoreDocs;
> 
> 
> 
> // Display results.
> 
> int docId = 0;
> 
> System.out.println("Found " + hits.length + " hits.");
> 
> for (int i = 0; i < hits.length; ++i) {
> 
> docId = hits[i].doc;
> 
> Document d = searcher.doc(docId);
> 
> System.out.println((i + 1) + ". " + d.get("title"));
> 
> IndexReader trd = IndexReader.open(index);
> 
> TermFreqVector tfv = trd.getTermFreqVector(docId, "title");
> 
> System.out.println(tfv.getTerms().toString());
> 
> System.out.println(tfv.getTermFrequencies().toString());
> 
> }
> 
> 
> 
> The code is very rough as its only an experiment but I'm under the
impression
> that the getTerms and getTermFrequencies methods for a TermFreqVector
> should allow each word and its frequency in the document to be displayed.
All I
> get though is a NullPointerError. The index consists of a single document
made
> up of a simple string:
> 
> 
> 
> IndexWriter w = new IndexWriter(index, analyzer, true,
> IndexWriter.MaxFieldLength.UNLIMITED);
> 
> addDoc(w, "Lucene for Dummies");
> 
> 
> 
> And the queryString being used is simply "dummies".
> 
> 
> 
> Thanks
> 
> 
> 
> Martin O'Shea.



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Using a TermFreqVector to get counts of all words in a document

2010-10-20 Thread Martin O'Shea
http://mail-archives.apache.org/mod_mbox/lucene-java-user/201010.mbox/%3c128
7065863.4cb7110774...@netmail.pipex.net%3e will give you a better idea of
what I'm moving towards.

It's all a bit grey at the moment so further investigation is inevitable.

I expect that a combination of MySQL database storage and Lucene indexing is
going to be the end result.



-Original Message-
From: Grant Ingersoll [mailto:gsing...@apache.org] 
Sent: 20 Oct 2010 21 20
To: java-user@lucene.apache.org
Subject: Re: Using a TermFreqVector to get counts of all words in a document


On Oct 20, 2010, at 2:53 PM, Martin O'Shea wrote:

> Uwe
> 
> Thanks - I figured that bit out. I'm a Lucene 'newbie'.
> 
> What I would like to know though is if it is practical to search a single
> document of one field simply by doing this:
> 
> IndexReader trd = IndexReader.open(index);
>TermFreqVector tfv = trd.getTermFreqVector(docId, "title");
>String[] terms = tfv.getTerms();
>int[] freqs = tfv.getTermFrequencies();
>for (int i = 0; i < tfv.getTerms().length; i++) {
>System.out.println("Term " + terms[i] + " Freq: " + freqs[i]);
>}
>trd.close();
> 
> where docId is set to 0.
> 
> The code works but can this be improved upon at all?
> 
> My situation is where I don't want to calculate the number of documents
with
> a particular string. Rather I want to get counts of individual words in a
> field in a document. So I can concatenate the strings before passing it to
> Lucene.

Can you describe the bigger problem you are trying to solve?  This looks
like a classic XY problem: http://people.apache.org/~hossman/#xyproblem

What you are doing above will work OK for what you describe (up to the
"passing it to Lucene" part), but you probably should explore the use of the
TermVectorMapper which provides a callback mechanism (similar to a SAX
parser) that will allow you to build your data structures on the fly instead
of having to serialize them into two parallel arrays and then loop over
those arrays to create some other structure.


--
Grant Ingersoll
http://www.lucidimagination.com


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Use of hyphens in StandardAnalyzer

2010-10-24 Thread Martin O'Shea
Hello

 

I have a StandardAnalyzer working which retrieves words and frequencies from
a single document using a TermVectorMapper which is populating a HashMap.

 

But if I use the following text as a field in my document, i.e. 

 

addDoc(w, "lucene Lawton-Browne Lucene");

 

The word frequencies returned in the HashMap are:

 

browne 1

lucene 2

lawton 1

 

The problem is the words 'lawton' and 'browne'. If this is an actual
'double-barreled' name, can Lucene recognise it as 'Lawton-Browne' where the
name is actually a single word?

 

I've tried combinations of:

 

addDoc(w, "lucene \"Lawton-Browne\" Lucene");

 

And single quotes but without success.

 

Thanks

 

Martin O'Shea.

 

 

 



RE: Use of hyphens in StandardAnalyzer

2010-10-24 Thread Martin O'Shea
A good suggestion. But I'm using Lucene 3.2 and the constructor for a 
StandardAnalyzer has Version_30 as its highest value.

-Original Message-
From: Steven A Rowe [mailto:sar...@syr.edu] 
Sent: 24 Oct 2010 21 31
To: java-user@lucene.apache.org
Subject: RE: Use of hyphens in StandardAnalyzer

Hi Martin,

StandardTokenizer and -Analyzer have been changed, as of future version 3.1 
(the next release) to support the Unicode segmentation rules in UAX#29.  My 
(untested) guess is that your hyphenated word will be kept as a single token if 
you set the version to 3.1 or higher in the constructor.

Steve

> -Original Message-
> From: Martin O'Shea [mailto:app...@dsl.pipex.com]
> Sent: Sunday, October 24, 2010 3:59 PM
> To: java-user@lucene.apache.org
> Subject: Use of hyphens in StandardAnalyzer
> 
> Hello
> 
> 
> 
> I have a StandardAnalyzer working which retrieves words and frequencies
> from
> a single document using a TermVectorMapper which is populating a HashMap.
> 
> 
> 
> But if I use the following text as a field in my document, i.e.
> 
> 
> 
> addDoc(w, "lucene Lawton-Browne Lucene");
> 
> 
> 
> The word frequencies returned in the HashMap are:
> 
> 
> 
> browne 1
> 
> lucene 2
> 
> lawton 1
> 
> 
> 
> The problem is the words 'lawton' and 'browne'. If this is an actual
> 'double-barreled' name, can Lucene recognise it as 'Lawton-Browne' where
> the
> name is actually a single word?
> 
> 
> 
> I've tried combinations of:
> 
> 
> 
> addDoc(w, "lucene \"Lawton-Browne\" Lucene");
> 
> 
> 
> And single quotes but without success.
> 
> 
> 
> Thanks
> 
> 
> 
> Martin O'Shea.
> 
> 
> 
> 
> 
> 




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



FW: Use of hyphens in StandardAnalyzer

2010-10-24 Thread Martin O'Shea
A good suggestion. But I'm using Lucene 3.0.2 and the constructor for a 
StandardAnalyzer has Version_30 as its highest value. Do you know when 3.1 is 
due?

-Original Message-
From: Steven A Rowe [mailto:sar...@syr.edu] 
Sent: 24 Oct 2010 21 31
To: java-user@lucene.apache.org
Subject: RE: Use of hyphens in StandardAnalyzer

Hi Martin,

StandardTokenizer and -Analyzer have been changed, as of future version 3.1 
(the next release) to support the Unicode segmentation rules in UAX#29.  My 
(untested) guess is that your hyphenated word will be kept as a single token if 
you set the version to 3.1 or higher in the constructor.

Steve

> -Original Message-
> From: Martin O'Shea [mailto:app...@dsl.pipex.com]
> Sent: Sunday, October 24, 2010 3:59 PM
> To: java-user@lucene.apache.org
> Subject: Use of hyphens in StandardAnalyzer
> 
> Hello
> 
> 
> 
> I have a StandardAnalyzer working which retrieves words and frequencies
> from
> a single document using a TermVectorMapper which is populating a HashMap.
> 
> 
> 
> But if I use the following text as a field in my document, i.e.
> 
> 
> 
> addDoc(w, "lucene Lawton-Browne Lucene");
> 
> 
> 
> The word frequencies returned in the HashMap are:
> 
> 
> 
> browne 1
> 
> lucene 2
> 
> lawton 1
> 
> 
> 
> The problem is the words 'lawton' and 'browne'. If this is an actual
> 'double-barreled' name, can Lucene recognise it as 'Lawton-Browne' where
> the
> name is actually a single word?
> 
> 
> 
> I've tried combinations of:
> 
> 
> 
> addDoc(w, "lucene \"Lawton-Browne\" Lucene");
> 
> 
> 
> And single quotes but without success.
> 
> 
> 
> Thanks
> 
> 
> 
> Martin O'Shea.
> 
> 
> 
> 
> 
> 






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Combining analyzers in Lucene

2011-03-05 Thread Martin O'Shea
Hello
I have a situation where I'm using two methods in a Java class to implement
a StandardAnalyzer in Lucene to index text strings and return their word
frequencies as follows:

public void indexText(String suffix, boolean includeStopWords)  {

StandardAnalyzer analyzer = null;

if (includeStopWords) {
analyzer = new StandardAnalyzer(Version.LUCENE_30);
}
else {

// Get Stop_Words to exclude them.
Set stopWords = (Set)
Stop_Word_Listener.getStopWords();  
analyzer = new StandardAnalyzer(Version.LUCENE_30, stopWords);
}

try {

// Index text.
Directory index = new RAMDirectory();
IndexWriter w = new IndexWriter(index, analyzer, true,
IndexWriter.MaxFieldLength.UNLIMITED);
this.addTextToIndex(w, this.getTextToIndex());
w.close();

// Read index.
IndexReader ir = IndexReader.open(index);
Text_TermVectorMapper ttvm = new Text_TermVectorMapper();

int docId = 0;

ir.getTermFreqVector(docId, "text", ttvm);

// Set output.
this.setWordFrequencies(ttvm.getWordFrequencies());
w.close();
}
catch(Exception ex) {
logger.error("Error indexing elements of RSS_Feed for object " +
suffix + "\n", ex);
}
}

private void addTextToIndex(IndexWriter w, String value) throws
IOException {
Document doc = new Document();
doc.add(new Field("text"), value, Field.Store.YES,
Field.Index.ANALYZED, Field.TermVector.YES));
w.addDocument(doc);
}

Which works perfectly well but I would like to combine this with stemming
using a SnowballAnalyzer as well. 

This class also has two instance variables shown in a constructor below:

public Text_Indexer(String textToIndex) {
this.textToIndex = textToIndex;
this.wordFrequencies = new HashMap();
}

Can anyone tell me how best to achieve this with the code above? Should I
re-index the text when it is returned by the above code or can the stemming
be introduced into the above at all?

Thanks

Mr Morgan.




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



JTRES 2011 Call for Papers

2011-04-25 Thread Martin Schoeberl
---

* Paper Submission:June 26,   2011 
* Notification of Acceptance:  August 4,  2011
* Camera Ready Paper Due:  August 20, 2011
* Workshop: September 26-28th 2011


Program Chair:
--

Anders P. Ravn, University of Aalborg, Denmark


Workshop Chair:
--

Andy Wellings, University of York 


Steering Committee:
---

Andy Wellings, University of York
Angelo Corsaro, PrismTech
Corrado Santoro, University of Catania
Doug Lea, State University of New York at Oswego
Gregory Bollella, Oracle
Jan Vitek, Purdue University
Peter Dibble, TimeSys


Program Committee:
--

Ted Baker, Florida State University
Angelo Corsaro, PrismTech
Peter Dibble, TimeSys
Rene R. Hansen, Aalborg University
Theresa Higuera, Universidad Complutense de Madrid
Tomas Kalibera, University of Kent
Christoph Kirsch, University of Salzburg
Gary T. Leavens, University of Central Florida
Doug Locke, LC Systems Services
Kelvin Nilsen, Aonix
Marek Prochazka, European Space Agency
Anders Ravn, Aalborg University
Corrado Santoro, University of Catania
Martin Schoeberl, Technical University of Denmark
Fridtjof Siebert, Aicas
Jan Vitek, Purdue University
Andy Wellings, University of York
Lukasz Ziarek, Fiji Systems


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



SpanTermQuery getSpans

2014-04-01 Thread Martin Líška
Dear all,

I'm experiencing troubles with SpanTermQuery.getSpans(AtomicReaderContext
context, Bits acceptDocs, Map termContexts) method in
version 4.6. I want to use it to retrieve payloads of matched spans.

First, I search the index with IndexSearcher.search(query, limit) and I get
TopDocs. In these TopDocs, there is a certain document A. I know, that the
query is an instance of PayloadTermQuery, so I search for spans
using query.getSpans(indexReader.leaves().get(0), null, new HashMap()); but
this wont return the spans for the document A.

I observed, that getSpans method won't return any spans for documents with
IDs greater than say 900, even though documents with IDs greater than 900
were returned in the original search. All other documents below ID 900 are
returned sucessfully from getSpans method.

I also tried passing all the leaves of indexReader to getSpans with no
effect.

Please help.

Thank you


Re: SpanTermQuery getSpans

2014-04-02 Thread Martin Líška
Gregory,

that was indeed my problem. Thank you very much for your support.

Martin

This is a reply to
http://mail-archives.apache.org/mod_mbox/lucene-java-user/201404.mbox/%3CCAASL1-8jRbEG%3DLi96eDLY-Pr_zwev6vk4vk4BW_ryKF1Dnb4KA%40mail.gmail.com%3E


On 1 April 2014 23:52, Martin Líška  wrote:
>
> Dear all,
>
> I'm experiencing troubles with SpanTermQuery.getSpans(AtomicReaderContext 
> context, Bits acceptDocs, Map termContexts) method in 
> version 4.6. I want to use it to retrieve payloads of matched spans.
>
> First, I search the index with IndexSearcher.search(query, limit) and I get 
> TopDocs. In these TopDocs, there is a certain document A. I know, that the 
> query is an instance of PayloadTermQuery, so I search for spans using 
> query.getSpans(indexReader.leaves().get(0), null, new HashMap()); but this 
> wont return the spans for the document A.
>
> I observed, that getSpans method won't return any spans for documents with 
> IDs greater than say 900, even though documents with IDs greater than 900 
> were returned in the original search. All other documents below ID 900 are 
> returned sucessfully from getSpans method.
>
> I also tried passing all the leaves of indexReader to getSpans with no effect.
>
> Please help.
>
> Thank you

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: A really hairy token graph case

2014-10-24 Thread Will Martin
HI Benson:

This is the case with n-gramming (though you have a more complicated start 
chooser than most I imagine).  Does that help get your ideas unblocked?

Will

-Original Message-
From: Benson Margulies [mailto:bimargul...@gmail.com] 
Sent: Friday, October 24, 2014 4:43 PM
To: java-user@lucene.apache.org
Subject: A really hairy token graph case

Consider a case where we have a token which can be subdivided in several ways. 
This can happen in German. We'd like to represent this with 
positionIncrement/positionLength, but it does not seem possible.

Once the position has moved out from one set of 'subtokens', we see no way to 
move it back for the second set of alternatives.

Is this something that was considered?

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: A really hairy token graph case

2014-10-24 Thread Will Martin
Benson:  I'm in danger of trying to remember CPL's german decompounder and how 
we used it. That would be a very unreliable  memory.

However at the link below David and Rupert have a resoundingly informative 
discussion about making similar work for synonyms. It might bear reading 
through the kb info captured there.

https://github.com/OpenSextant/SolrTextTagger/issues/10




-Original Message-
From: Benson Margulies [mailto:ben...@basistech.com] 
Sent: Friday, October 24, 2014 5:54 PM
To: java-user@lucene apache. org; Richard Barnes
Subject: Re: A really hairy token graph case

I don't think so ... Let me be specific:

First, consider the case of one 'analysis': an input token maps to a lemma and 
a sequence of components.

So, we product

 surface form
 lemmaPI 0
 comp1PI 0
 comp2PI 1
 .

with PL set appropriately to cover the pieces. All the information is there.

Now, if we have another analysis, we want to 'rewind' position, and deliver 
another lemma and another set of components, but, of course, we can't do that.

The best we could do is something like:

surface form
lemma1  PI 0
lemma2 PI 0

lemmaN PI 0

comp0-1  PI 0
comp1-1 PI 0

 
 comp0-N
compM-N

That is, group all the first-components, and all the second-components.

But now the bits and pieces of the compounds are interspersed. Maybe that's OK.


On Fri, Oct 24, 2014 at 5:44 PM, Will Martin  wrote:

> HI Benson:
>
> This is the case with n-gramming (though you have a more complicated 
> start chooser than most I imagine).  Does that help get your ideas unblocked?
>
> Will
>
> -Original Message-
> From: Benson Margulies [mailto:bimargul...@gmail.com]
> Sent: Friday, October 24, 2014 4:43 PM
> To: java-user@lucene.apache.org
> Subject: A really hairy token graph case
>
> Consider a case where we have a token which can be subdivided in 
> several ways. This can happen in German. We'd like to represent this 
> with positionIncrement/positionLength, but it does not seem possible.
>
> Once the position has moved out from one set of 'subtokens', we see no 
> way to move it back for the second set of alternatives.
>
> Is this something that was considered?
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2

2014-11-10 Thread Martin O'Shea
I realise that 3.0.2 is an old version of Lucene but if I have Java code as
follows:

 

int nGramLength = 3;

Set stopWords = new Set();

stopwords.add("the");

stopwords.add("and");

...

SnowballAnalyzer snowballAnalyzer = new SnowballAnalyzer(Version.LUCENE_30,
"English", stopWords);  

ShingleAnalyzerWrapper shingleAnalyzer = new
ShingleAnalyzerWrapper(snowballAnalyzer, nGramLength);

 

Which will generate the frequency of ngrams from a particular a string of
text without stop words, how can I disable the LowerCaseFilter which forms
part of the SnowBallAnalyzer? I want to preserve the case of the ngrams
generated so that I can perform various counts according to the presence /
absence of upper case characters in the ngrams.

 

I am something of a Lucene newbie. And I should add that upgrading the
version of Lucene is not an option here.



RE: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2

2014-11-10 Thread Martin O'Shea
Uwe

Thanks for the reply. Given that SnowBallAnalyzer is made up of a series of 
filters, I was thinking about something like this where I 'pipe' output from 
one filter to the next:

standardTokenizer =new StandardTokenizer (...);
standardFilter = new StandardFilter(standardTokenizer,...);
stopFilter = new StopFilter(standardFilter,...);
snowballFilter = new SnowballFilter(stopFilter,...);

But ignore LowerCaseFilter. Does this make sense?

Thanks

Martin O'Shea.
-Original Message-
From: Uwe Schindler [mailto:u...@thetaphi.de] 
Sent: 10 Nov 2014 14 06
To: java-user@lucene.apache.org
Subject: RE: How to disable LowerCaseFilter when using SnowballAnalyzer in 
Lucene 3.0.2

Hi,

In general, you cannot change Analyzers, they are "examples" and can be seen as 
"best practise". If you want to modify them, write your own Analyzer subclass 
which uses the wanted Tokenizers and TokenFilters as you like. You can for 
example clone the source code of the original and remove LowercaseFilter. 
Analyzers are very simple, there is no logic in them, it's just some 
"configuration" (which Tokenizer and which TokenFilters). In later Lucene 3 and 
Lucene 4, this is very simple: You just need to override createComponents in 
Analyzer class and add your "configuration" there.

If you use Apache Solr or Elasticsearch you can create your analyzers by XML or 
JSON configuration.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: Martin O'Shea [mailto:m.os...@dsl.pipex.com]
> Sent: Monday, November 10, 2014 2:54 PM
> To: java-user@lucene.apache.org
> Subject: How to disable LowerCaseFilter when using SnowballAnalyzer in 
> Lucene 3.0.2
> 
> I realise that 3.0.2 is an old version of Lucene but if I have Java 
> code as
> follows:
> 
> 
> 
> int nGramLength = 3;
> 
> Set stopWords = new Set();
> 
> stopwords.add("the");
> 
> stopwords.add("and");
> 
> ...
> 
> SnowballAnalyzer snowballAnalyzer = new 
> SnowballAnalyzer(Version.LUCENE_30,
> "English", stopWords);
> 
> ShingleAnalyzerWrapper shingleAnalyzer = new 
> ShingleAnalyzerWrapper(snowballAnalyzer, nGramLength);
> 
> 
> 
> Which will generate the frequency of ngrams from a particular a string 
> of text without stop words, how can I disable the LowerCaseFilter 
> which forms part of the SnowBallAnalyzer? I want to preserve the case 
> of the ngrams generated so that I can perform various counts according 
> to the presence / absence of upper case characters in the ngrams.
> 
> 
> 
> I am something of a Lucene newbie. And I should add that upgrading the 
> version of Lucene is not an option here.



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2

2014-11-10 Thread Martin O'Shea
Thanks Uwe.

-Original Message-
From: Uwe Schindler [mailto:u...@thetaphi.de] 
Sent: 10 Nov 2014 14 43
To: java-user@lucene.apache.org
Subject: RE: How to disable LowerCaseFilter when using SnowballAnalyzer in 
Lucene 3.0.2

Hi,

> Uwe
> 
> Thanks for the reply. Given that SnowBallAnalyzer is made up of a 
> series of filters, I was thinking about something like this where I 
> 'pipe' output from one filter to the next:
> 
> standardTokenizer =new StandardTokenizer (...); standardFilter = new 
> StandardFilter(standardTokenizer,...);
> stopFilter = new StopFilter(standardFilter,...); snowballFilter = new 
> SnowballFilter(stopFilter,...);
> 
> But ignore LowerCaseFilter. Does this make sense?

Exactly. Create a clone of SnowballAnalyzer (from Lucene source package) in 
your own package and remove LowercaseFilter. But be aware, it could be that 
snowball needs lowercased terms to correctly do stemming!!! I don't know about 
this filter, I just want to make you aware.

The same applies to stop filter, but this one allows to handle that: You should 
make stop-filter case insensitive (there is a boolean to do this):
StopFilter(boolean enablePositionIncrements, TokenStream input, Set 
stopWords, boolean ignoreCase)

Uwe

> Martin O'Shea.
> -Original Message-
> From: Uwe Schindler [mailto:u...@thetaphi.de]
> Sent: 10 Nov 2014 14 06
> To: java-user@lucene.apache.org
> Subject: RE: How to disable LowerCaseFilter when using 
> SnowballAnalyzer in Lucene 3.0.2
> 
> Hi,
> 
> In general, you cannot change Analyzers, they are "examples" and can 
> be seen as "best practise". If you want to modify them, write your own 
> Analyzer subclass which uses the wanted Tokenizers and TokenFilters as 
> you like. You can for example clone the source code of the original 
> and remove LowercaseFilter. Analyzers are very simple, there is no 
> logic in them, it's just some "configuration" (which Tokenizer and 
> which TokenFilters). In later Lucene 3 and Lucene 4, this is very 
> simple: You just need to override createComponents in Analyzer class and add 
> your "configuration" there.
> 
> If you use Apache Solr or Elasticsearch you can create your analyzers 
> by XML or JSON configuration.
> 
> Uwe
> 
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
> 
> 
> > -Original Message-
> > From: Martin O'Shea [mailto:m.os...@dsl.pipex.com]
> > Sent: Monday, November 10, 2014 2:54 PM
> > To: java-user@lucene.apache.org
> > Subject: How to disable LowerCaseFilter when using SnowballAnalyzer 
> > in Lucene 3.0.2
> >
> > I realise that 3.0.2 is an old version of Lucene but if I have Java 
> > code as
> > follows:
> >
> >
> >
> > int nGramLength = 3;
> >
> > Set stopWords = new Set();
> >
> > stopwords.add("the");
> >
> > stopwords.add("and");
> >
> > ...
> >
> > SnowballAnalyzer snowballAnalyzer = new 
> > SnowballAnalyzer(Version.LUCENE_30,
> > "English", stopWords);
> >
> > ShingleAnalyzerWrapper shingleAnalyzer = new 
> > ShingleAnalyzerWrapper(snowballAnalyzer, nGramLength);
> >
> >
> >
> > Which will generate the frequency of ngrams from a particular a 
> > string of text without stop words, how can I disable the 
> > LowerCaseFilter which forms part of the SnowBallAnalyzer? I want to 
> > preserve the case of the ngrams generated so that I can perform 
> > various counts according to the presence / absence of upper case characters 
> > in the ngrams.
> >
> >
> >
> > I am something of a Lucene newbie. And I should add that upgrading 
> > the version of Lucene is not an option here.
> 
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2

2014-11-11 Thread Martin O'Shea
In the end I edited the code of the StandardAnalyzer and the
SnowballAnalyzer to disable the calls to the LowerCaseFilter. This seems to
work.

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID] 
Sent: 10 Nov 2014 15 19
To: java-user@lucene.apache.org
Subject: Re: How to disable LowerCaseFilter when using SnowballAnalyzer in
Lucene 3.0.2

Hi,

Regarding Uwe's warning, 

"NOTE: SnowballFilter expects lowercased text." [1]

[1]
https://lucene.apache.org/core/4_3_0/analyzers-common/org/apache/lucene/anal
ysis/snowball/SnowballFilter.html



On Monday, November 10, 2014 4:43 PM, Uwe Schindler  wrote:
Hi,

> Uwe
> 
> Thanks for the reply. Given that SnowBallAnalyzer is made up of a 
> series of filters, I was thinking about something like this where I 
> 'pipe' output from one filter to the next:
> 
> standardTokenizer =new StandardTokenizer (...); standardFilter = new 
> StandardFilter(standardTokenizer,...);
> stopFilter = new StopFilter(standardFilter,...); snowballFilter = new 
> SnowballFilter(stopFilter,...);
> 
> But ignore LowerCaseFilter. Does this make sense?

Exactly. Create a clone of SnowballAnalyzer (from Lucene source package) in
your own package and remove LowercaseFilter. But be aware, it could be that
snowball needs lowercased terms to correctly do stemming!!! I don't know
about this filter, I just want to make you aware.

The same applies to stop filter, but this one allows to handle that: You
should make stop-filter case insensitive (there is a boolean to do this):
StopFilter(boolean enablePositionIncrements, TokenStream input, Set
stopWords, boolean ignoreCase)

Uwe

> Martin O'Shea.
> -Original Message-
> From: Uwe Schindler [mailto:u...@thetaphi.de]
> Sent: 10 Nov 2014 14 06
> To: java-user@lucene.apache.org
> Subject: RE: How to disable LowerCaseFilter when using 
> SnowballAnalyzer in Lucene 3.0.2
> 
> Hi,
> 
> In general, you cannot change Analyzers, they are "examples" and can 
> be seen as "best practise". If you want to modify them, write your own 
> Analyzer subclass which uses the wanted Tokenizers and TokenFilters as 
> you like. You can for example clone the source code of the original 
> and remove LowercaseFilter. Analyzers are very simple, there is no 
> logic in them, it's just some "configuration" (which Tokenizer and 
> which TokenFilters). In later Lucene 3 and Lucene 4, this is very 
> simple: You just need to override createComponents in Analyzer class and
add your "configuration" there.
> 
> If you use Apache Solr or Elasticsearch you can create your analyzers 
> by XML or JSON configuration.
> 
> Uwe
> 
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
> 
> 
> > -Original Message-
> > From: Martin O'Shea [mailto:m.os...@dsl.pipex.com]
> > Sent: Monday, November 10, 2014 2:54 PM
> > To: java-user@lucene.apache.org
> > Subject: How to disable LowerCaseFilter when using SnowballAnalyzer 
> > in Lucene 3.0.2
> >
> > I realise that 3.0.2 is an old version of Lucene but if I have Java 
> > code as
> > follows:
> >
> >
> >
> > int nGramLength = 3;
> >
> > Set stopWords = new Set();
> >
> > stopwords.add("the");
> >
> > stopwords.add("and");
> >
> > ...
> >
> > SnowballAnalyzer snowballAnalyzer = new 
> > SnowballAnalyzer(Version.LUCENE_30,
> > "English", stopWords);
> >
> > ShingleAnalyzerWrapper shingleAnalyzer = new 
> > ShingleAnalyzerWrapper(snowballAnalyzer, nGramLength);
> >
> >
> >
> > Which will generate the frequency of ngrams from a particular a 
> > string of text without stop words, how can I disable the 
> > LowerCaseFilter which forms part of the SnowBallAnalyzer? I want to 
> > preserve the case of the ngrams generated so that I can perform 
> > various counts according to the presence / absence of upper case
characters in the ngrams.
> >
> >
> >
> > I am something of a Lucene newbie. And I should add that upgrading 
> > the version of Lucene is not an option here.
> 
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org



> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2

2014-11-11 Thread Martin O'Shea
Ahmet, 

Yes that is quite true. But as this is only a proof of concept application,
I'm prepared for things to be 'imperfect'.

Martin O'Shea.

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID] 
Sent: 11 Nov 2014 18 26
To: java-user@lucene.apache.org
Subject: Re: How to disable LowerCaseFilter when using SnowballAnalyzer in
Lucene 3.0.2

Hi,

With that analyser, your searches (for same word, but different capitalised)
could return different results.

Ahmet


On Tuesday, November 11, 2014 6:57 PM, Martin O'Shea 
wrote:
In the end I edited the code of the StandardAnalyzer and the
SnowballAnalyzer to disable the calls to the LowerCaseFilter. This seems to
work.

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID]
Sent: 10 Nov 2014 15 19
To: java-user@lucene.apache.org
Subject: Re: How to disable LowerCaseFilter when using SnowballAnalyzer in
Lucene 3.0.2

Hi,

Regarding Uwe's warning, 

"NOTE: SnowballFilter expects lowercased text." [1]

[1]
https://lucene.apache.org/core/4_3_0/analyzers-common/org/apache/lucene/anal
ysis/snowball/SnowballFilter.html



On Monday, November 10, 2014 4:43 PM, Uwe Schindler  wrote:
Hi,

> Uwe
> 
> Thanks for the reply. Given that SnowBallAnalyzer is made up of a 
> series of filters, I was thinking about something like this where I 
> 'pipe' output from one filter to the next:
> 
> standardTokenizer =new StandardTokenizer (...); standardFilter = new 
> StandardFilter(standardTokenizer,...);
> stopFilter = new StopFilter(standardFilter,...); snowballFilter = new 
> SnowballFilter(stopFilter,...);
> 
> But ignore LowerCaseFilter. Does this make sense?

Exactly. Create a clone of SnowballAnalyzer (from Lucene source package) in
your own package and remove LowercaseFilter. But be aware, it could be that
snowball needs lowercased terms to correctly do stemming!!! I don't know
about this filter, I just want to make you aware.

The same applies to stop filter, but this one allows to handle that: You
should make stop-filter case insensitive (there is a boolean to do this):
StopFilter(boolean enablePositionIncrements, TokenStream input, Set
stopWords, boolean ignoreCase)

Uwe

> Martin O'Shea.
> -Original Message-
> From: Uwe Schindler [mailto:u...@thetaphi.de]
> Sent: 10 Nov 2014 14 06
> To: java-user@lucene.apache.org
> Subject: RE: How to disable LowerCaseFilter when using 
> SnowballAnalyzer in Lucene 3.0.2
> 
> Hi,
> 
> In general, you cannot change Analyzers, they are "examples" and can 
> be seen as "best practise". If you want to modify them, write your own 
> Analyzer subclass which uses the wanted Tokenizers and TokenFilters as 
> you like. You can for example clone the source code of the original 
> and remove LowercaseFilter. Analyzers are very simple, there is no 
> logic in them, it's just some "configuration" (which Tokenizer and 
> which TokenFilters). In later Lucene 3 and Lucene 4, this is very
> simple: You just need to override createComponents in Analyzer class 
> and
add your "configuration" there.
> 
> If you use Apache Solr or Elasticsearch you can create your analyzers 
> by XML or JSON configuration.
> 
> Uwe
> 
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
> 
> 
> > -Original Message-
> > From: Martin O'Shea [mailto:m.os...@dsl.pipex.com]
> > Sent: Monday, November 10, 2014 2:54 PM
> > To: java-user@lucene.apache.org
> > Subject: How to disable LowerCaseFilter when using SnowballAnalyzer 
> > in Lucene 3.0.2
> >
> > I realise that 3.0.2 is an old version of Lucene but if I have Java 
> > code as
> > follows:
> >
> >
> >
> > int nGramLength = 3;
> >
> > Set stopWords = new Set();
> >
> > stopwords.add("the");
> >
> > stopwords.add("and");
> >
> > ...
> >
> > SnowballAnalyzer snowballAnalyzer = new 
> > SnowballAnalyzer(Version.LUCENE_30,
> > "English", stopWords);
> >
> > ShingleAnalyzerWrapper shingleAnalyzer = new 
> > ShingleAnalyzerWrapper(snowballAnalyzer, nGramLength);
> >
> >
> >
> > Which will generate the frequency of ngrams from a particular a 
> > string of text without stop words, how can I disable the 
> > LowerCaseFilter which forms part of the SnowBallAnalyzer? I want to 
> > preserve the case of the ngrams generated so that I can perform 
> > various counts according to the presence / absence of upper case
characters in the ngrams.
> >
> >
> >
> > I am something of a Lucene newb

CfP: ISORC 2015 - IEEE International Symposium On Real-Time Computing

2014-11-19 Thread Martin Schoeberl

==

CALL FOR PAPERS

 The 18th IEEE International Symposium On Real-Time Computing
  ISORC 2015




 Auckland, New Zealand
  April 13 - 17, 2015

 http://isorc2015.org/

==


Scope and Topics of Interest


ISORC has become established as the leading event devoted to state-of-the-art 
research in the field of object/component/service-oriented real-time 
distributed computing (ORC) technology. We invite original submissions from 
academia and industry pertaining to all aspects of ORC technology. These 
include, but are not limited to:

* Programming and system engineering: ORC paradigms, object/component 
models, languages, synchronous languages, RT CORBA, Embedded .NET, RT RMI, RT 
Java, UML, model-driven development, specification, design, verification, 
validation, testing, maintenance, system of systems, time-predictable systems 
and hardware.

* Distributed computing and communication infrastructures: real-time 
communication, networked platforms, protocols, Internet QoS, peer-to-peer 
computing, sensor networks, trusted and dependable systems.

* System software: real-time kernels and OS, middleware support for ORC, 
QoS management, extensibility, synchronization, resource allocation, 
scheduling, fault tolerance, security.

* Applications: embedded systems (automotive, avionics, consumer 
electronics, building systems, sensors, etc), multimedia processing, RT 
Web-based applications, real-time object-oriented simulations.

* System evaluation: output accuracy, timing, dependability, end-to-end 
QoS, overhead, fault detection and recovery time.

* Cyber-physical systems: mobile systems, wireless sensor networks, 
real-time analytics, autonomous automotive systems, process control systems, 
distributed robotics.


Guidelines for Manuscripts
--

* Research Papers:

Papers should describe original work and be maximum 8 pages in length using the 
IEEE paper format (link to templates). A maximum of two extra pages may be 
purchased.

* Papers presenting Industrial Advances:

Industrial papers and practitioner reports, describing experiences of using ORC 
technology in application or tool development projects, are an integral part of 
the technical program of ISORC. A majority of these papers are expected to be 
shorter and less formal than research papers. They should clearly identify, and 
discuss in detail, the issues that represent notable industrial advances. 
Reports with project metrics supporting their claims are particularly sought, 
as well as those that show both benefits and drawbacks of the approaches used 
in the given project.

* Short Synopses:

Short papers (4 pages or less using the IEEE format) on substantial real-time 
applications are also invited, and should contain enough information for the 
program committee to understand the scope of the project and evaluate the 
novelty of the problem or approach.

According to program committee guidelines, papers presenting practical 
techniques, ideas, or evaluations will be favored. Papers reporting 
experimentation results and industrial experiences are particularly welcome. 
Originality will not be interpreted too narrowly.

Papers that are based on severely unrealistic assumptions will not be accepted 
however mathematically or logically sophisticated the discussion may be.

All accepted submissions will appear in the proceedings published by IEEE. A 
person will not be allowed to present more than 2 papers at the symposium. 

Papers are submitted through Easychair. Please use the submission link:

https://www.easychair.org/conferences/?conf=isorc2015


Journal Special Issue
-

The best papers from ISORC 2015 will be invited for submission to the ISORC 
special issue of the ACM Transaction on Embedded Computing (TECS).


Important Dates
---

* Paper Submission: 12 December, 2014
* Notification of Acceptance: 30 January, 2015
* Camera Ready Paper Due: 20 February, 2015
* Conference: 13-17 April, 2015

General Co-Chairs:
--

Anirudda Gokhale, Vanderbilt University, USA
Parthasarathi Roop, University of Auckland, New Zealand 
Paul Townend, University of Leeds, United Kingdom


Program Co-Chairs:
--

Martin Schoeberl, Technical University of Denmark, Denmark
Chunming Hu, Beihang University, China


Workshop Chair:
---

Marco Aurelio Wehrmeister, Federal Univ. Technology - Parana, Brazil

Program Committee:
--

Sidharta Andalam, TUM Create, Singapore
Takuya Azumi, Ritsumeikan University, Japan
Farokh Bastani, University of Texas Dallas, USA
Uwe Brinkschulte, University

CfP: ISORC 2015 - IEEE International Symposium On Real-Time Computing

2014-12-05 Thread Martin Schoeberl

==

CALL FOR PAPERS

 The 18th IEEE International Symposium On Real-Time Computing
  ISORC 2015




 Auckland, New Zealand
  April 13 - 17, 2015

 http://isorc2015.org/

==


Scope and Topics of Interest


ISORC has become established as the leading event devoted to state-of-the-art 
research in the field of object/component/service-oriented real-time 
distributed computing (ORC) technology. We invite original submissions from 
academia and industry pertaining to all aspects of ORC technology. These 
include, but are not limited to:

* Programming and system engineering: ORC paradigms, object/component 
models, languages, synchronous languages, RT CORBA, Embedded .NET, RT RMI, RT 
Java, UML, model-driven development, specification, design, verification, 
validation, testing, maintenance, system of systems, time-predictable systems 
and hardware.

* Distributed computing and communication infrastructures: real-time 
communication, networked platforms, protocols, Internet QoS, peer-to-peer 
computing, sensor networks, trusted and dependable systems.

* System software: real-time kernels and OS, middleware support for ORC, 
QoS management, extensibility, synchronization, resource allocation, 
scheduling, fault tolerance, security.

* Applications: embedded systems (automotive, avionics, consumer 
electronics, building systems, sensors, etc), multimedia processing, RT 
Web-based applications, real-time object-oriented simulations.

* System evaluation: output accuracy, timing, dependability, end-to-end 
QoS, overhead, fault detection and recovery time.

* Cyber-physical systems: mobile systems, wireless sensor networks, 
real-time analytics, autonomous automotive systems, process control systems, 
distributed robotics.


Guidelines for Manuscripts
--

* Research Papers:

Papers should describe original work and be maximum 8 pages in length using the 
IEEE paper format (link to templates). A maximum of two extra pages may be 
purchased.

* Papers presenting Industrial Advances:

Industrial papers and practitioner reports, describing experiences of using ORC 
technology in application or tool development projects, are an integral part of 
the technical program of ISORC. A majority of these papers are expected to be 
shorter and less formal than research papers. They should clearly identify, and 
discuss in detail, the issues that represent notable industrial advances. 
Reports with project metrics supporting their claims are particularly sought, 
as well as those that show both benefits and drawbacks of the approaches used 
in the given project.

* Short Synopses:

Short papers (4 pages or less using the IEEE format) on substantial real-time 
applications are also invited, and should contain enough information for the 
program committee to understand the scope of the project and evaluate the 
novelty of the problem or approach.

According to program committee guidelines, papers presenting practical 
techniques, ideas, or evaluations will be favored. Papers reporting 
experimentation results and industrial experiences are particularly welcome. 
Originality will not be interpreted too narrowly.

Papers that are based on severely unrealistic assumptions will not be accepted 
however mathematically or logically sophisticated the discussion may be.

All accepted submissions will appear in the proceedings published by IEEE. A 
person will not be allowed to present more than 2 papers at the symposium. 

Papers are submitted through Easychair. Please use the submission link:

https://www.easychair.org/conferences/?conf=isorc2015


Journal Special Issue
-

The best papers from ISORC 2015 will be invited for submission to the ISORC 
special issue of the ACM Transaction on Embedded Computing (TECS).


Important Dates
---

* Paper Submission: 12 December, 2014
* Notification of Acceptance: 30 January, 2015
* Camera Ready Paper Due: 20 February, 2015
* Conference: 13-17 April, 2015

General Co-Chairs:
--

Anirudda Gokhale, Vanderbilt University, USA
Parthasarathi Roop, University of Auckland, New Zealand 
Paul Townend, University of Leeds, United Kingdom


Program Co-Chairs:
--

Martin Schoeberl, Technical University of Denmark, Denmark
Chunming Hu, Beihang University, China


Workshop Chair:
---

Marco Aurelio Wehrmeister, Federal Univ. Technology - Parana, Brazil

Program Committee:
--

Sidharta Andalam, TUM Create, Singapore
Takuya Azumi, Ritsumeikan University, Japan
Farokh Bastani, University of Texas Dallas, USA
Uwe Brinkschulte, University

ISORC 2015 - Deadline Extension: 28/12/2014

2014-12-12 Thread Martin Schoeberl
Extended submission deadline: 28 December 2014
Submission at: https://www.easychair.org/conferences/?conf=isorc2015

==

CALL FOR PAPERS

 The 18th IEEE International Symposium On Real-Time Computing
  ISORC 2015




 Auckland, New Zealand
  April 13 - 17, 2015

 http://isorc2015.org/

==


Scope and Topics of Interest


ISORC has become established as the leading event devoted to state-of-the-art 
research in the field of object/component/service-oriented real-time 
distributed computing (ORC) technology. We invite original submissions from 
academia and industry pertaining to all aspects of ORC technology. These 
include, but are not limited to:

* Programming and system engineering: ORC paradigms, object/component 
models, languages, synchronous languages, RT CORBA, Embedded .NET, RT RMI, RT 
Java, UML, model-driven development, specification, design, verification, 
validation, testing, maintenance, system of systems, time-predictable systems 
and hardware.

* Distributed computing and communication infrastructures: real-time 
communication, networked platforms, protocols, Internet QoS, peer-to-peer 
computing, sensor networks, trusted and dependable systems.

* System software: real-time kernels and OS, middleware support for ORC, 
QoS management, extensibility, synchronization, resource allocation, 
scheduling, fault tolerance, security.

* Applications: embedded systems (automotive, avionics, consumer 
electronics, building systems, sensors, etc), multimedia processing, RT 
Web-based applications, real-time object-oriented simulations.

* System evaluation: output accuracy, timing, dependability, end-to-end 
QoS, overhead, fault detection and recovery time.

* Cyber-physical systems: mobile systems, wireless sensor networks, 
real-time analytics, autonomous automotive systems, process control systems, 
distributed robotics.


Guidelines for Manuscripts
--

* Research Papers:

Papers should describe original work and be maximum 8 pages in length using the 
IEEE paper format (link to templates). A maximum of two extra pages may be 
purchased.

* Papers presenting Industrial Advances:

Industrial papers and practitioner reports, describing experiences of using ORC 
technology in application or tool development projects, are an integral part of 
the technical program of ISORC. A majority of these papers are expected to be 
shorter and less formal than research papers. They should clearly identify, and 
discuss in detail, the issues that represent notable industrial advances. 
Reports with project metrics supporting their claims are particularly sought, 
as well as those that show both benefits and drawbacks of the approaches used 
in the given project.

* Short Synopses:

Short papers (4 pages or less using the IEEE format) on substantial real-time 
applications are also invited, and should contain enough information for the 
program committee to understand the scope of the project and evaluate the 
novelty of the problem or approach.

According to program committee guidelines, papers presenting practical 
techniques, ideas, or evaluations will be favored. Papers reporting 
experimentation results and industrial experiences are particularly welcome. 
Originality will not be interpreted too narrowly.

Papers that are based on severely unrealistic assumptions will not be accepted 
however mathematically or logically sophisticated the discussion may be.

All accepted submissions will appear in the proceedings published by IEEE. A 
person will not be allowed to present more than 2 papers at the symposium. 

Papers are submitted through Easychair. Please use the submission link:

https://www.easychair.org/conferences/?conf=isorc2015


Journal Special Issue
-

The best papers from ISORC 2015 will be invited for submission to the ISORC 
special issue of the ACM Transaction on Embedded Computing (TECS).


Important Dates
---

* Paper Submission (extended deadline): 28 December, 2014
* Notification of Acceptance: 30 January, 2015
* Camera Ready Paper Due: 20 February, 2015
* Conference: 13-17 April, 2015

General Co-Chairs:
--

Anirudda Gokhale, Vanderbilt University, USA
Parthasarathi Roop, University of Auckland, New Zealand 
Paul Townend, University of Leeds, United Kingdom


Program Co-Chairs:
--

Martin Schoeberl, Technical University of Denmark, Denmark
Chunming Hu, Beihang University, China


Workshop Chair:
---

Marco Aurelio Wehrmeister, Federal Univ. Technology - Parana, Brazil

Program Committee:
--

Sidharta Andalam, TUM Create

Upgrading Lucene from 3.5 to 4.10 - how to handle Java API changes

2015-01-11 Thread Martin Wunderlich
Hi all, 



I am currently in the process of upgrading a search engine application from 
Lucene 3.5.0 to version 4.10.3. There have been some substantial API changes in 
version 4 that break backward compatibility. I have managed to fix most of 
them, but a few issues remain that I could use some help with:

"cannot override final method from Analyzer"
The original code extended the Analyzer class and the overrode tokenStream(...).

@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
CharStream charStream = CharReader.get(reader);
return
new LowerCaseFilter(version,
new SeparationFilter(version,
new WhitespaceTokenizer(version,
new HTMLStripFilter(charStream;
}
But this method is final now and I am not sure how to understand the following 
note from the change log:

"ReusableAnalyzerBase has been renamed to Analyzer. All Analyzer 
implementations must now use Analyzer.TokenStreamComponents, rather than 
overriding .tokenStream() and .reusableTokenStream() (which are now final). "

There is another problem in the method quoted above:

"The method get(Reader) is undefined for the type CharReader"
There seem to have been some considerable changes here, too.

"TermPositionVector cannot be resolved to a type"
This class is gone now in Lucene 4. Are there any simple fixes for this? From 
the change log: "The term vectors APIs (TermFreqVector, TermPositionVector, 
TermVectorMapper) have been removed in favor of the above flexible indexing 
APIs, presenting a single-document inverted index of the document from the term 
vectors."

Probably related to this: 4. "The method getTermFreqVector(int, String) is 
undefined for the type IndexReader."

Both problems occur here, for instance:

TermPositionVector termVector = (TermPositionVector) 
reader.getTermFreqVector(...);
("reader" is of Type IndexReader)

I would appreciate any help with these issues. Thanks a lot in advance.

Cheers,

Martin


PS: FYI, I have posted the same question on Stackoverflow: 
http://stackoverflow.com/questions/27881296/upgrading-lucene-from-3-5-to-4-10-how-to-handle-java-api-changes?noredirect=1#comment44166161_27881296


signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Upgrading Lucene from 3.5 to 4.10 - how to handle Java API changes

2015-01-11 Thread Martin Wunderlich
Hi Uwe, 

Thanks a lot for the detailed reply. I'll see how far I get with it, but being 
quite new to Lucene, it seems I am lacking a bit of background information to 
fully understand the response below. In particular, I need to do some 
background reading on how token streams and readers work, I guess. 

Cheers, 

Martin
 

Am 11.01.2015 um 11:05 schrieb Uwe Schindler :

> Hi, 
> 
> 
> 
> First, there is also a migrate guide next to the changes log: 
> http://lucene.apache.org/core/4_10_3/MIGRATE.html
> 
> 
> 
> 1. If you implement analyzer, you have to override createComponents() which 
> return TokenStreamComponents objects. See other Analyzer’s source code to 
> understand how to use it. One simple example is in the Javadocs: 
> http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/analysis/Analyzer.html
> 
> 
> 
> 2. Use initReader() to wrap filters around readers. This class is protected 
> and can be overridden. CharFilter implements Reader, so you can wrap any 
> CharFilter there. Your HTMLStripCharsFilter have to wrapped around the given 
> reader here.
> 
> 
> 
> 3./4. Term vectors are different in Lucene 4. Basically term vectors are a 
> small index for each document. And this is how its implemented. You get back 
> a Fields/Terms instances, which are basically like AtomicReader’s backend – 
> you can even execute a Query on the vectors:
> 
> IndexReader#getTermVector() returns Terms for a specific field:
> 
> <http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/index/IndexReader.html#getTermVector(int,%20java.lang.String)>
> 
> For all Fields (harder to use, unwrapping for a specific field is done above 
> – this one is more to execute Querys and so on):
> 
> <http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/index/IndexReader.html#getTermVectors(int)>
> 
> 
> 
> Uwe
> 
> 
> 
> -
> 
> Uwe Schindler
> 
> H.-H.-Meier-Allee 63, D-28213 Bremen
> 
> <http://www.thetaphi.de/> http://www.thetaphi.de
> 
> eMail: u...@thetaphi.de
> 
> 
> 
> From: Martin Wunderlich [mailto:martin.wunderl...@gmx.net] 
> Sent: Sunday, January 11, 2015 9:18 AM
> To: java-user@lucene.apache.org
> Subject: Upgrading Lucene from 3.5 to 4.10 - how to handle Java API changes
> 
> 
> 
> Hi all, 
> 
> 
> 
> I am currently in the process of upgrading a search engine application from 
> Lucene 3.5.0 to version 4.10.3. There have been some substantial API changes 
> in version 4 that break backward compatibility. I have managed to fix most of 
> them, but a few issues remain that I could use some help with:
> 
> 1."cannot override final method from Analyzer"
> 
> The original code extended the Analyzer class and the overrode 
> tokenStream(...). 
> 
> @Override
> public TokenStream tokenStream(String fieldName, Reader reader) {
>CharStream charStream = CharReader.get(reader);
>return
>new LowerCaseFilter(version,
>new SeparationFilter(version,
>new WhitespaceTokenizer(version,
>new HTMLStripFilter(charStream;
> }
> 
> But this method is final now and I am not sure how to understand the 
> following note from the change log: 
> 
> "ReusableAnalyzerBase has been renamed to Analyzer. All Analyzer 
> implementations must now use Analyzer.TokenStreamComponents, rather than 
> overriding .tokenStream() and .reusableTokenStream() (which are now final). "
> 
> There is another problem in the method quoted above: 
> 
> 2."The method get(Reader) is undefined for the type CharReader"
> 
> There seem to have been some considerable changes here, too. 
> 
> 3."TermPositionVector cannot be resolved to a type"
> 
> This class is gone now in Lucene 4. Are there any simple fixes for this? From 
> the change log: "The term vectors APIs (TermFreqVector, TermPositionVector, 
> TermVectorMapper) have been removed in favor of the above flexible indexing 
> APIs, presenting a single-document inverted index of the document from the 
> term vectors."
> 
> Probably related to this: 4. "The method getTermFreqVector(int, String) is 
> undefined for the type IndexReader."
> 
> Both problems occur here, for instance: 
> 
> TermPositionVector termVector = (TermPositionVector) 
> reader.getTermFreqVector(...);
> 
> ("reader" is of Type IndexReader)
> 
> I would appreciate any help with these issues. Thanks a lot in advance.
> 
> Cheers, 
> 
> Martin
> 
> 
> 
> PS: FYI, I have posted the same question on Stackoverflow: 
> http://stackoverflow.com/questions/27881296/upgrading-lucene-from-3-5-to-4-10-how-to-handle-java-api-changes?noredirect=1#comment44166161_27881296
> 



signature.asc
Description: Message signed with OpenPGP using GPGMail


RE: hello,I have a problem about lucene,please help me to explain ,thank you

2015-09-22 Thread will martin
Hi: 
Would you mind doing websearch and cataloging the relevant pages into a
primer?
Thx,
Will
-Original Message-
From: 王建军 [mailto:jianjun200...@163.com] 
Sent: Tuesday, September 22, 2015 4:02 AM
To: java-user@lucene.apache.org
Subject: hello,I have a problem about lucene,please help me to explain
,thank you

There is a Class org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter
which have two parameters,one is DEFAULT_MIN_BLOCK_SIZE,the other is
DEFAULT_MAX_BLOCK_SIZE;their default values is 25 and 48;when I  make their
values to bigger,for example,200 and 398;And then to make index,the result
is that the use of memory is become less,what's more ,there is a good
performance。
Can you tell me why;what's more,if I change that ,whether or not will
make other problem。

Thank you very much。


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Solr java.lang.OutOfMemoryError: Java heap space

2015-09-28 Thread will martin
http://opensourceconnections.com/blog/2014/07/13/reindexing-collections-with-solrs-cursor-support/



-Original Message-
From: Ajinkya Kale [mailto:kaleajin...@gmail.com] 
Sent: Monday, September 28, 2015 2:46 PM
To: solr-u...@lucene.apache.org; java-user@lucene.apache.org
Subject: Solr java.lang.OutOfMemoryError: Java heap space

Hi,

I am trying to retrieve all the documents from a solr index in a batched manner.
I have 100M documents. I am retrieving them using the method proposed here 
https://nowontap.wordpress.com/2014/04/04/solr-exporting-an-index-to-an-external-file/
I am dumping 10M document splits in each file. I get "OutOfMemoryError" if 
start is at 50M. I get the same error even if rows=10 for start=50M.
Curl on start=0 rows=50M in one go works fine too. But things go bad when start 
is at 50M.
My Solr version is 4.4.0.

Caused by: java.lang.OutOfMemoryError: Java heap space at
org.apache.lucene.search.TopDocsCollector.topDocs(TopDocsCollector.java:146)
at
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1502)
at
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1363)
at
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:474)
at
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:434)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1904)

--aj


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Lucene 5 : any merge performance metrics compared to 4.x?

2015-09-29 Thread will martin
So, if its new, it adds to pre-existing time? So it is a cost that needs to be 
understood I think.

 

And, I'm really curious, what happens to the result of the post merge 
checkIntegrity IFF (if and only if) there was corruption pre-merge: I mean if 
you let it merge anyway could you get a false positive for integrity?  [see the 
concept of lazy-evaluation]

 

These are, imo, the kinds of engineering questions Selva's post raised in my 
triage mode of the scenario.

 

 

-Original Message-
From: Adrien Grand [mailto:jpou...@gmail.com] 
Sent: Tuesday, September 29, 2015 8:46 AM
To: java-user@lucene.apache.org
Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?

 

Indeed this is new but I'm a bit surprised this is the source of your issues as 
it should be much faster than the merge itself. I don't understand your 
proposal to check the index after merge: the goal is to make sure that we do 
not propagate corruptions so it's better to check the index before the merge 
starts so that we don't even try to merge if there are corruptions?

 

Le mar. 15 sept. 2015 à 00:40, Selva Kumar < 
 selva.kumar.at.w...@gmail.com> a écrit :

 

> it appears Lucene 5.2 index merge is running checkIntegrity on 

> existing index prior to merging additional indices.

> This seems to be new.

> 

> We have an existing checkIndex but this is run post index merge.

> 

> Two follow up questions :

> * Is there way to turn off built-in checkIntegrity? Just for my understand.

> No plan to turn this off.

> * Is running checkIntegrity prior to index merge better than running 

> post merge?

> 

> 

> On Mon, Sep 14, 2015 at 12:24 PM, Selva Kumar < 

>   selva.kumar.at.w...@gmail.com

> > wrote:

> 

> > We observe some merge slowness after we migrated from 4.10 to 5.2.

> > Is this expected? Any new tunable merge parameters in Lucene 5 ?

> >

> > -Selva

> >

> >

> 



RE: Lucene 5 : any merge performance metrics compared to 4.x?

2015-09-29 Thread will martin
This sounds robust. Is the index batch creation workflow a separate process?
Distributed shared filesystems?

--will

-Original Message-
From: McKinley, James T [mailto:james.mckin...@cengage.com] 
Sent: Tuesday, September 29, 2015 2:22 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?

Hi Adrien and Will,

Thanks for your responses.  I work with Selva and he's busy right now with
other things, so I'll add some more context to his question in an attempt to
improve clarity.

The merge in question is part of our batch indexing workflow wherein we
index new content for a given partition and then merge this new index with
the big index of everything that was previously loaded on the given
partition.  The increase in merge time we've seen since upgrading from 4.10
to 5.2 is on the order of 25%.  It varies from partition to partition, but
25% is a good ballpark estimate I think.  Maybe our case is non-standard, we
have a large number of fields (> 200).

The reason we perform an index check after the merge is that this is the
final index state that will be used for a given batch.  Since we have a
batch-oriented workflow we are able to roll back to a previous batch if we
find a problem with a given batch (Lucene or other problem).  However due to
disk space constraints we can only keep a couple batches.  If our indexing
workflow completes without errors but the index is corrupt, we may not know
right away and we might delete the previous good batch thinking the latest
batch is OK, which would be very bad requiring a full reload of all our
content.

Checking the index prior to the merge would no doubt catch many issues, but
it might not catch corruption that occurs during the merge step itself, so
we implemented a check step once the index is in its final state to ensure
that it is OK.

So, since we want to do the check post-merge, is there a way to disable the
check during merge so we don't have to do two checks?

Thanks!

Jim 

____
From: will martin 
Sent: 29 September 2015 12:08
To: java-user@lucene.apache.org
Subject: RE: Lucene 5 : any merge performance metrics compared to 4.x?

So, if its new, it adds to pre-existing time? So it is a cost that needs to
be understood I think.



And, I'm really curious, what happens to the result of the post merge
checkIntegrity IFF (if and only if) there was corruption pre-merge: I mean
if you let it merge anyway could you get a false positive for integrity?
[see the concept of lazy-evaluation]



These are, imo, the kinds of engineering questions Selva's post raised in my
triage mode of the scenario.





-Original Message-
From: Adrien Grand [mailto:jpou...@gmail.com]
Sent: Tuesday, September 29, 2015 8:46 AM
To: java-user@lucene.apache.org
Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?



Indeed this is new but I'm a bit surprised this is the source of your issues
as it should be much faster than the merge itself. I don't understand your
proposal to check the index after merge: the goal is to make sure that we do
not propagate corruptions so it's better to check the index before the merge
starts so that we don't even try to merge if there are corruptions?



Le mar. 15 sept. 2015 à 00:40, Selva Kumar <
<mailto:selva.kumar.at.w...@gmail.com> selva.kumar.at.w...@gmail.com> a
écrit :



> it appears Lucene 5.2 index merge is running checkIntegrity on

> existing index prior to merging additional indices.

> This seems to be new.

>

> We have an existing checkIndex but this is run post index merge.

>

> Two follow up questions :

> * Is there way to turn off built-in checkIntegrity? Just for my
understand.

> No plan to turn this off.

> * Is running checkIntegrity prior to index merge better than running

> post merge?

>

>

> On Mon, Sep 14, 2015 at 12:24 PM, Selva Kumar <

>  <mailto:selva.kumar.at.w...@gmail.com> selva.kumar.at.w...@gmail.com

> > wrote:

>

> > We observe some merge slowness after we migrated from 4.10 to 5.2.

> > Is this expected? Any new tunable merge parameters in Lucene 5 ?

> >

> > -Selva

> >

> >

>


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Lucene 5 : any merge performance metrics compared to 4.x?

2015-09-29 Thread will martin
Ok So I'm a little confused:

The 4.10 JavaDoc for LiveIndexWriterConfig supports volatile access on a
flag to setCheckIntegrityAtMerge ... 

Method states it controls pre-merge cost.

Ref: 

https://lucene.apache.org/core/4_10_0/core/org/apache/lucene/index/LiveIndex
WriterConfig.html#setCheckIntegrityAtMerge%28boolean%29

And it seems to be gone in 5.3 folks? Meaning Adrien's comment is a whole
lot significant? Merges ALWAYS pre-merge CheckIntegrity? Is this a 5.0
feature drop? You can't deprecate, um, er totally remove an index time audit
feature on a point release of any level IMHO.


-Original Message-
From: McKinley, James T [mailto:james.mckin...@cengage.com] 
Sent: Tuesday, September 29, 2015 2:42 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?

Yes, the indexing workflow is completely separate from the runtime system.
The file system is EMC Isilon via NFS.

Jim

____
From: will martin 
Sent: 29 September 2015 14:29
To: java-user@lucene.apache.org
Subject: RE: Lucene 5 : any merge performance metrics compared to 4.x?

This sounds robust. Is the index batch creation workflow a separate process?
Distributed shared filesystems?

--will

-Original Message-
From: McKinley, James T [mailto:james.mckin...@cengage.com]
Sent: Tuesday, September 29, 2015 2:22 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?

Hi Adrien and Will,

Thanks for your responses.  I work with Selva and he's busy right now with
other things, so I'll add some more context to his question in an attempt to
improve clarity.

The merge in question is part of our batch indexing workflow wherein we
index new content for a given partition and then merge this new index with
the big index of everything that was previously loaded on the given
partition.  The increase in merge time we've seen since upgrading from 4.10
to 5.2 is on the order of 25%.  It varies from partition to partition, but
25% is a good ballpark estimate I think.  Maybe our case is non-standard, we
have a large number of fields (> 200).

The reason we perform an index check after the merge is that this is the
final index state that will be used for a given batch.  Since we have a
batch-oriented workflow we are able to roll back to a previous batch if we
find a problem with a given batch (Lucene or other problem).  However due to
disk space constraints we can only keep a couple batches.  If our indexing
workflow completes without errors but the index is corrupt, we may not know
right away and we might delete the previous good batch thinking the latest
batch is OK, which would be very bad requiring a full reload of all our
content.

Checking the index prior to the merge would no doubt catch many issues, but
it might not catch corruption that occurs during the merge step itself, so
we implemented a check step once the index is in its final state to ensure
that it is OK.

So, since we want to do the check post-merge, is there a way to disable the
check during merge so we don't have to do two checks?

Thanks!

Jim

____
From: will martin 
Sent: 29 September 2015 12:08
To: java-user@lucene.apache.org
Subject: RE: Lucene 5 : any merge performance metrics compared to 4.x?

So, if its new, it adds to pre-existing time? So it is a cost that needs to
be understood I think.



And, I'm really curious, what happens to the result of the post merge
checkIntegrity IFF (if and only if) there was corruption pre-merge: I mean
if you let it merge anyway could you get a false positive for integrity?
[see the concept of lazy-evaluation]



These are, imo, the kinds of engineering questions Selva's post raised in my
triage mode of the scenario.





-Original Message-
From: Adrien Grand [mailto:jpou...@gmail.com]
Sent: Tuesday, September 29, 2015 8:46 AM
To: java-user@lucene.apache.org
Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?



Indeed this is new but I'm a bit surprised this is the source of your issues
as it should be much faster than the merge itself. I don't understand your
proposal to check the index after merge: the goal is to make sure that we do
not propagate corruptions so it's better to check the index before the merge
starts so that we don't even try to merge if there are corruptions?



Le mar. 15 sept. 2015 à 00:40, Selva Kumar <
<mailto:selva.kumar.at.w...@gmail.com> selva.kumar.at.w...@gmail.com> a
écrit :



> it appears Lucene 5.2 index merge is running checkIntegrity on

> existing index prior to merging additional indices.

> This seems to be new.

>

> We have an existing checkIndex but this is run post index merge.

>

> Two follow up questions :

> * Is there way to turn off built-in checkIntegrity? Just for my
understand.

> No p

RE: Lucene 5 : any merge performance metrics compared to 4.x?

2015-09-30 Thread will martin
Thanks Mike. This is very informative. 



-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Tuesday, September 29, 2015 3:22 PM
To: Lucene Users
Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?

No, it is not possible to disable, and, yes, we removed that API in 5.x because 
1) the risk of silent index corruption is too high to warrant this small 
optimization and 2) we re-worked how merging works so that this checkIntegrity 
has IO locality with what's being merged next.

There were other performance gains for merging in 5.x, e.g. using much less 
memory in the many-fields case, not decompressing + recompressing stored fields 
and term vectors, etc.

As Adrien pointed out, the cost should be much lower than 25% for a local 
filesystem ... I suspect something about your NFS setup is making it more 
costly.

NFS is in general a dangerous filesystem to use with Lucene (no delete on last 
close, locking is tricky to get right, incoherent client file contents and 
directory listing caching).

If you want to also checkIntegrity of the merged segment you could e.g. install 
an IndexReaderWarmer in your IW and call IndexReader.checkIntegrity.

Mike McCandless

http://blog.mikemccandless.com


On Tue, Sep 29, 2015 at 9:00 PM, will martin  wrote:
> Ok So I'm a little confused:
>
> The 4.10 JavaDoc for LiveIndexWriterConfig supports volatile access on 
> a flag to setCheckIntegrityAtMerge ...
>
> Method states it controls pre-merge cost.
>
> Ref:
>
> https://lucene.apache.org/core/4_10_0/core/org/apache/lucene/index/Liv
> eIndex
> WriterConfig.html#setCheckIntegrityAtMerge%28boolean%29
>
> And it seems to be gone in 5.3 folks? Meaning Adrien's comment is a 
> whole lot significant? Merges ALWAYS pre-merge CheckIntegrity? Is this 
> a 5.0 feature drop? You can't deprecate, um, er totally remove an 
> index time audit feature on a point release of any level IMHO.
>
>
> -Original Message-
> From: McKinley, James T [mailto:james.mckin...@cengage.com]
> Sent: Tuesday, September 29, 2015 2:42 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?
>
> Yes, the indexing workflow is completely separate from the runtime system.
> The file system is EMC Isilon via NFS.
>
> Jim
>
> 
> From: will martin 
> Sent: 29 September 2015 14:29
> To: java-user@lucene.apache.org
> Subject: RE: Lucene 5 : any merge performance metrics compared to 4.x?
>
> This sounds robust. Is the index batch creation workflow a separate process?
> Distributed shared filesystems?
>
> --will
>
> -Original Message-
> From: McKinley, James T [mailto:james.mckin...@cengage.com]
> Sent: Tuesday, September 29, 2015 2:22 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?
>
> Hi Adrien and Will,
>
> Thanks for your responses.  I work with Selva and he's busy right now 
> with other things, so I'll add some more context to his question in an 
> attempt to improve clarity.
>
> The merge in question is part of our batch indexing workflow wherein 
> we index new content for a given partition and then merge this new 
> index with the big index of everything that was previously loaded on 
> the given partition.  The increase in merge time we've seen since 
> upgrading from 4.10 to 5.2 is on the order of 25%.  It varies from 
> partition to partition, but 25% is a good ballpark estimate I think.  
> Maybe our case is non-standard, we have a large number of fields (> 200).
>
> The reason we perform an index check after the merge is that this is 
> the final index state that will be used for a given batch.  Since we 
> have a batch-oriented workflow we are able to roll back to a previous 
> batch if we find a problem with a given batch (Lucene or other 
> problem).  However due to disk space constraints we can only keep a 
> couple batches.  If our indexing workflow completes without errors but 
> the index is corrupt, we may not know right away and we might delete 
> the previous good batch thinking the latest batch is OK, which would 
> be very bad requiring a full reload of all our content.
>
> Checking the index prior to the merge would no doubt catch many 
> issues, but it might not catch corruption that occurs during the merge 
> step itself, so we implemented a check step once the index is in its 
> final state to ensure that it is OK.
>
> So, since we want to do the check post-merge, is there a way to 
> disable the check during merge so we don't have to do two checks?
>
> Thanks!
>
> Jim
>
> 
> Fro

Re: debugging growing index size

2015-11-13 Thread will martin
Hi Rob:


Doesn’t this look like known SE issue JDK-4724038 and discussed by Peter Levart 
and Uwe Schindler on a lucene-dev thread 9/9/2015?

MappedByteBuffer …. what OS are you on Rob? What JVM?

http://bugs.java.com/view_bug.do?bug_id=4724038

http://mail-archives.apache.org/mod_mbox/lucene-dev/201509.mbox/%3c55f0461a.2070...@gmail.com%3E

hth 
-will



> On Nov 13, 2015, at 11:23 AM, Rob Audenaerde  wrote:
> 
> I'm currently running using NIOFS. It seems to prevent the issue from
> appearing.
> 
> This is a second run (with applied deletes etc)
> 
> raudenaerd@:/<6>index/index$sudo ls -lSra *.dvd
> -rw-r--r--. 1 apache apache  7993 Nov 13 16:09 _y_Lucene50_0.dvd
> -rw-r--r--. 1 apache apache  39048886 Nov 13 17:12 _xod_Lucene50_0.dvd
> -rw-r--r--. 1 apache apache  53699972 Nov 13 17:17 _110e_Lucene50_0.dvd
> -rw-r--r--. 1 apache apache 112855516 Nov 13 17:19 _12r5_Lucene50_0.dvd
> -rw-r--r--. 1 apache apache 151149886 Nov 13 17:13 _y0s_Lucene50_0.dvd
> -rw-r--r--. 1 apache apache 222062059 Nov 13 17:17 _z20_Lucene50_0.dvd
> 
> raudenaerde:/<6>index/index$sudo ls -lSaa *.dvd
> -rw-r--r--. 1 apache apache 222062059 Nov 13 17:17 _z20_Lucene50_0.dvd
> -rw-r--r--. 1 apache apache 151149886 Nov 13 17:13 _y0s_Lucene50_0.dvd
> -rw-r--r--. 1 apache apache 112855516 Nov 13 17:19 _12r5_Lucene50_0.dvd
> -rw-r--r--. 1 apache apache  53699972 Nov 13 17:17 _110e_Lucene50_0.dvd
> -rw-r--r--. 1 apache apache  39048886 Nov 13 17:12 _xod_Lucene50_0.dvd
> -rw-r--r--. 1 apache apache  7993 Nov 13 16:09 _y_Lucene50_0.dvd
> 
> 
> 
> On Thu, Nov 12, 2015 at 3:40 PM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
> 
>> Hi Rob,
>> 
>> A couple more things:
>> 
>> Can you print the value of MMapDirectory.UNMAP_SUPPORTED?
>> 
>> Also, can you try your test using NIOFSDirectory instead?  Curious if
>> that changes things...
>> 
>> Mike McCandless
>> 
>> http://blog.mikemccandless.com
>> 
>> 
>> On Thu, Nov 12, 2015 at 7:28 AM, Rob Audenaerde
>>  wrote:
>>> Curious indeed!
>>> 
>>> I will turn on the IndexFileDeleter.VERBOSE_REF_COUNTS and recreate the
>>> logs. Will get back with them in a day hopefully.
>>> 
>>> Thanks for the extra logging!
>>> 
>>> -Rob
>>> 
>>> On Thu, Nov 12, 2015 at 11:34 AM, Michael McCandless <
>>> luc...@mikemccandless.com> wrote:
>>> 
 Hmm, curious.
 
 I looked at the [large] infoStream output and I see segment _3ou7
 present on init of IW, a few getReader calls referencing it, then a
 forceMerge that indeed merges it away, yet I do NOT see IW attempting
 deletion of its files.
 
 And indeed I see plenty (too many: many times per second?) of commits
 after that, so the index itself is no longer referencing _3ou7.
 
 If you are failing to close all NRT readers then I would expect _3ou7
 to be in the lsof output, but it's not.
 
 The NRT readers close method has logic that notifies IndexWriter when
 it's done "needing" the files, to emulate "delete on last close"
 semantics for filesystems like HDFS that don't do that ... it's
 possible something is wrong here.
 
 Can you set the (public, static) boolean
 IndexFileDeleter.VERBOSE_REF_COUNTS to true, and then re-generate this
 log?  This causes IW to log the ref count of each file it's tracking
 ...
 
 I'll also add a bit more verbosity to IW when NRT readers are opened
 and close, for 5.4.0.
 
 Mike McCandless
 
 http://blog.mikemccandless.com
 
 
 On Wed, Nov 11, 2015 at 6:09 AM, Rob Audenaerde
  wrote:
> Hi all,
> 
> I'm still debugging the growing-index size. I think closing index
>> readers
> might help (work in progress), but I can't really see them holding on
>> to
> files (at least, using lsof ). Restarting the application sheds some
 light,
> I see logging on files that are no longer referenced.
> 
> What I see is that there are files in the index-directory, that seem
>> to
> longer referenced..
> 
> I put the output of the infoStream online, because is it rather big
>> (30MB
> gzipped):  http://www.audenaerde.org/lucene/merges.log.gz
> 
> Output of lsof:  (executed 'sudo lsof *' in the index directory  ).
>> This
 is
> on an CentOS box (maybe that influences stuff as well?)
> 
> COMMAND   PID   USER   FD   TYPE DEVICE   SIZE/OFF NODE NAME
> java30581 apache  memREG  253,0 3176094924 18880508
> _4gs5_Lucene50_0.dvd
> java30581 apache  memREG  253,0  505758610 18880546 _4gs5.fdt
> java30581 apache  memREG  253,0  369563337 18880631
> _4gs5_Lucene50_0.tim
> java30581 apache  memREG  253,0  176344058 18880623
> _4gs5_Lucene50_0.pos
> java30581 apache  memREG  253,0  378055201 18880606
> _4gs5_Lucene50_0.doc
> java30581 apache  memREG  253,0  372579599 18880400
> _4i5a_Lucene50_0.dvd
> java30581 apache  memREG  253,0   82017

Re: Jensen–Shannon divergence

2015-12-13 Thread will martin
expand your due diligence beyond wikipedia:
i.e.

http://ciir.cs.umass.edu/pubfiles/ir-464.pdf



> On Dec 13, 2015, at 8:30 AM, Shay Hummel  wrote:
> 
> LMDiricletbut its feasibilit


Re: Jensen–Shannon divergence

2015-12-13 Thread will martin
Sorry it was early.

If you go looking on the web, you can find, as I did reputable work on 
implementing DiricletLanguage Models. However, at this hour you might get 
answers here. Extrapolating others work into a lucene implantation is only 
slightly different from getting answers here. imo

g'luck


> On Dec 13, 2015, at 10:55 AM, Shay Hummel  wrote:
> 
> Hi
> 
> I am sorry but I didn't understand your answer. Can you please elaborate?
> 
> Shay
> 
> On Sun, Dec 13, 2015 at 3:41 PM will martin  wrote:
> 
>> expand your due diligence beyond wikipedia:
>> i.e.
>> 
>> http://ciir.cs.umass.edu/pubfiles/ir-464.pdf
>> 
>> 
>> 
>>> On Dec 13, 2015, at 8:30 AM, Shay Hummel  wrote:
>>> 
>>> LMDiricletbut its feasibilit
>> 
> -- 
> Regards,
> Shay Hummel


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Jensen–Shannon divergence

2015-12-14 Thread will martin
cool list. Thanks Uwe.

Opportunities to gain competitive advantage in selected domains.

> On Dec 14, 2015, at 6:02 PM, Uwe Schindler  wrote:
> 
> Hi,
> 
> Next to BM25 and TF-IDF, Lucene also privides many more similarity 
> implementations:
> 
> https://lucene.apache.org/core/5_4_0/core/org/apache/lucene/search/similarities/LMDirichletSimilarity.html
> https://lucene.apache.org/core/5_4_0/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html
> https://lucene.apache.org/core/5_4_0/core/org/apache/lucene/search/similarities/IBSimilarity.html
> https://lucene.apache.org/core/5_4_0/core/org/apache/lucene/search/similarities/DFRSimilarity.html
> 
> If you want to implement your own, choose the closest one and implement the 
> formula as you described. I'll start with SimilarityBase, which is ideal base 
> class for such types like Dirichlet / DFR /..., because it has a default 
> implementation for stuff like phrases.
> 
>> LMDiricletbut its feasibilit
> 
> I am not sure what you want to say with this mistyped sentence fragment.
> 
> Uwe
> 
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
> 
>> -Original Message-
>> From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
>> Sent: Monday, December 14, 2015 11:21 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: Jensen–Shannon divergence
>> 
>> Is there any particular reason that you find Lucene's builtin TF/IDF and
>> BM25 similarity models insufficient for your needs? In any case,
>> examination of their source code should get you started if you with to do
>> your own:
>> 
>> https://lucene.apache.org/core/5_3_0/core/org/apache/lucene/search/simi
>> larities/TFIDFSimilarity.html
>> https://lucene.apache.org/core/5_3_0/core/org/apache/lucene/search/simi
>> larities/BM25Similarity.html
>> 
>> -- Jack Krupansky
>> 
>> On Sun, Dec 13, 2015 at 8:30 AM, Shay Hummel 
>> wrote:
>> 
>>> Hi
>>> 
>>> I need help to implement similarity between query model and document
>> model.
>>> I would like to use the JS-Divergence
>>> 
>> for
>>> ranking documents. The documents and the query will be represented
>>> according to the language models approach - specifically the LMDiriclet.
>>> The similarity will be calculated using the JS-Div between the document
>>> model and the query model.
>>> Is it possible?
>>> if so how?
>>> 
>>> Thank you,
>>> Shay Hummel
>>> --
>>> Regards,
>>> Shay Hummel
>>> 
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Any lucene query sorts docs by Hamming distance?

2015-12-22 Thread will martin
Yonghui:

Do you mean sort, rank or score?

Thanks,
Will



> On Dec 22, 2015, at 4:02 AM, Yonghui Zhao  wrote:
> 
> Hi,
> 
> Is there any query can sort docs by hamming distance if field values are
> same length,
> 
> Seems fuzzy query only works on edit distance.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: range query highlighting

2015-12-23 Thread will martin
Todd:

"This trick just converts the multi term queries like PrefixQuery or RangeQuery 
to boolean query by expanding the terms using index reader."

http://stackoverflow.com/questions/7662829/lucene-net-range-queries-highlighting

beware cost. (my comment)


g’luck
will

> On Dec 23, 2015, at 4:49 PM, Fielder, Todd Patrick  wrote:
> 
> I have a NumericRangeQuery and a TermQuery that I am combining into a Boolean 
> query.  I would then like to pass the Boolean query to the highlighter to 
> highlight both the range and term hits.  Currently, only the terms are being 
> highlighted.
> 
> Any help on how to get the range values to highlight would be appreciated
> 
> Thanks
> 
> -Todd


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Any lucene query sorts docs by Hamming distance?

2015-12-24 Thread will martin
here’s a thought from the algorithm world:

hamming is the upper bound on levenshtein.

does that help you?

-w


> On Dec 24, 2015, at 4:10 AM, Yonghui Zhao  wrote:
> 
> I mean sort and filter.  I want to filter all documents within some
> hamming distances say 3,  and sort them from distance 0 to 3.
> 
> 2015-12-22 21:42 GMT+08:00 will martin :
> 
>> Yonghui:
>> 
>> Do you mean sort, rank or score?
>> 
>> Thanks,
>> Will
>> 
>> 
>> 
>>> On Dec 22, 2015, at 4:02 AM, Yonghui Zhao  wrote:
>>> 
>>> Hi,
>>> 
>>> Is there any query can sort docs by hamming distance if field values are
>>> same length,
>>> 
>>> Seems fuzzy query only works on edit distance.
>> 
>> 
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
>> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: SolrIndexSearcher throws Misleading Error Message When timeAllowed is Specified.

2016-01-08 Thread will martin
Please read the javadoc for System.nanoTime().  I won’t bore you with the 
details about how computer clocks work.

> On Jan 8, 2016, at 4:14 AM, Vishnu Mishra  wrote:
> 
> I am using Solr 5.3.1 and we are facing OutOfMemory exception while doing
> some complex wildcard and proximity query (even for simple wildcard query).
> We are doing distributed solr search using shard across 20 cores. 
> 
> The problem description is given below.
> 
> For example simple query like
> 
> *q=Tile:(eleme* OR proces*)&timeAllowed=50*
> 
> It gives warning given below
> 
> *2016-01-08 14:14:03,874 WARN  org.apache.solr.search.SolrIndexSearcher  –
> Query: Tile:(eleme* OR proces*); The request took too long to iterate over
> terms. Timeout: timeoutAt: 5804340135470 (System.nanoTime(): 5804342454166),
> TermsEnum=org.apache.lucene.codecs.blocktree.IntersectTermsEnum@1d2d4fb*
> 
> I don't understand why the timeout thrown by SolrIndexSearcher is showing
> 5804340135470 nanosecond (5804.342454166001 seconds), even I already given
> timeout of 1000 ms (1 second). Is the log message is correct ? Help me to
> understand this problem.
> 
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/SolrIndexSearcher-throws-Misleading-Error-Message-When-timeAllowed-is-Specified-tp4249356.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: how to backup index files with Replicator

2016-01-23 Thread will martin
Hi Dancer:

Found this thread with good info that may be irrelevant to your scenario but, 
this in particular struck me

 writer.waitForMerges();
 writer.commit();
 replicator. replicate(new IndexRevision(writer));
writer.close();
—
even though writer.close() can trigger a commit. hmmm


thread:

http://grokbase.com/t/lucene/java-user/143dsnrxh8/replicator-how-to-use-it 


-will



> On Jan 23, 2016, at 4:39 AM, Dancer <462921...@qq.com> wrote:
> 
> Hi,
> here is my code to backup index files with Lucene Replicator,but It doesn't 
> work well, No files were backuped.
> Could you check my code and give me your advice?
> 
> 
> public class IndexFiles {
> 
> 
>   private static Directory dir;
>   private static Path bakPath;
>   private static LocalReplicator replicator;
> 
> 
>   public static LocalReplicator getInstance() {
>   if (replicator == null) {
>   replicator = new LocalReplicator();
>   }
>   return replicator;
>   }
>   public static Directory getDirInstance() {
>   if (dir == null) {
>   try {
>   dir = FSDirectory.open(Paths.get("/tmp/index"));
>   } catch (IOException e) {
>   e.printStackTrace();
>   }
>   }
>   return dir;
>   }
>   public static Path getPathInstance() {
>   if (bakPath == null) {
>   bakPath = Paths.get("/tmp/indexBak");
>   }
>   return bakPath;
>   }
> 
> 
>   
>   /** Index all text files under a directory. */
>   public static void main(String[] args) {
>   String id = "-oderfilssdhsjs";
>   String title = "足球周刊";
>   String body = "今天野狗,我们将关注欧冠赛场,曼联在客场先进一球的情况下,遭对手沃尔夫斯堡以总比分3:2淘汰,"
>   + 
> "遗憾出局,将参加欧联杯的比赛,当红球星马夏尔贡献一球,狼堡进了一个乌龙球,狼堡十号球员德拉克斯勒" + 
> "表现惊艳,多次导演攻势,希望22岁的他能在足球之路上走的更远。";
>   try {
>   // Directory dir = 
> FSDirectory.open(Paths.get(indexPath));
>   Analyzer analyzer = new IKAnalyzer(true);
>   IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
>   iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
>   SnapshotDeletionPolicy snapshotter = new 
> SnapshotDeletionPolicy(new KeepOnlyLastCommitDeletionPolicy());
>   iwc.setIndexDeletionPolicy(snapshotter);
>   IndexWriter writer = new 
> IndexWriter(IndexFiles.getDirInstance(), iwc);// the
>   LocalReplicator replicator = IndexFiles.getInstance();
> 
> 
>   Document doc = new Document();
>   Field articleId = new StringField("id", id, 
> Field.Store.YES);
>   doc.add(articleId);
>   Field articleTitle = new TextField("title", title, 
> Field.Store.YES);
>   doc.add(articleTitle);
>   Field articleBody = new TextField("body", body, 
> Field.Store.NO);
>   doc.add(articleBody);
>   Field tag1 = new TextField("tags", "野狗", 
> Field.Store.NO);
>   doc.add(tag1);
>   // Field tag2 = new TextField("tags", "运动", 
> Field.Store.NO);
>   // doc.add(tag2);
>   // Field tag3 = new TextField("tags", "国足", 
> Field.Store.NO);
>   // doc.add(tag3);
>   // Field tag4 = new TextField("tags", "席大大", 
> Field.Store.NO);
>   // doc.add(tag4);
> 
> 
>   writer.updateDocument(new Term("id", id), doc);
>   writer.commit();
>   ReplicatorThread p = new ReplicatorThread(); 
>   new Thread(p, "ReplicatorThread").start();
>   replicator.publish(new IndexRevision(writer));
>   Thread.sleep(5);
>   writer.close();
>   } catch (IOException e) {
>   System.out.println(" caught a " + e.getClass() + "\n 
> with message: " + e.getMessage());
>   } catch (InterruptedException e) {
>   e.printStackTrace();
>   }
>   }
> }
> 
> 
> class ReplicatorThread implements Runnable {
> 
> 
>   public void run() {
>   Callable callback = null; 
>   ReplicationHandler handler = null;
>   try {
>   handler = new 
> IndexReplicationHandler(IndexFiles.getDirInstance(), callback);
>   SourceDirectoryFactory factory = new 
> PerSessionDirectoryFactory(IndexFiles.getPathInstance());
>   ReplicationClient clien

Lucene 5.4 - scoring divided by number of search terms?

2016-03-13 Thread Martin Krämer
I have a simple setup with IndexSearcher, QueryParser, SimpleAnalyzer.
Running some queries I recognised that a query with more than one term
returns a different ScoreDoc[i].score than shown in explain query
statement. Apparently it is the score shown in explain divided by the
number of search terms. any explanation for this behaviour?

Running search(TERM1 TERM2 TERM3)
line:term1 line:term2 line:term3
2.167882 = sum of:
  0.6812867 = weight(line:term1 in 6594) [DefaultSimilarity], result of:
0.6812867 = score(doc=6594,freq=2.0), product of:
  0.5389907 = queryWeigh

totalHits 1
1678413725, TERM1 TERM2 TERM3, score: 0.72262734

I understand the coord() statement would be used to penalise documents
which include only a subset of the search terms provided. However this
document includes all terms. Any suggestions?
--

More details

These two scores are result of the same query. Only the second query gets
divided

0.114700586 = product of:
  0.34410176 = sum of:
0.34410176 = weight(line:term1 in 24) [DefaultSimilarity], result of:
  0.34410176 = score(doc=24,freq=1.0), product of:
0.5389907 = queryWeight, product of:
  8.17176 = idf(docFreq=14, maxDocs=19532)
  0.065957725 = queryNorm
0.63841873 = fieldWeight in 24, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
  8.17176 = idf(docFreq=14, maxDocs=19532)
  0.078125 = fieldNorm(doc=24)
  0.3334 = coord(1/3)

item_id: 1495958818, item_name: term 1 dolor sit met, score: 0.114700586


0.18352094 = product of:
  0.5505628 = sum of:
0.5505628 = weight(line:term 1 in 6112) [DefaultSimilarity], result of:
  0.5505628 = score(doc=6112,freq=1.0), product of:
0.5389907 = queryWeight, product of:
  8.17176 = idf(docFreq=14, maxDocs=19532)
  0.065957725 = queryNorm
1.02147 = fieldWeight in 6112, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
  8.17176 = idf(docFreq=14, maxDocs=19532)
  0.125 = fieldNorm(doc=6112)
  0.3334 = coord(1/3)

item_id: 1677761523, item_name: some text term 1, score: 0.061173648


-- 
Test Signature


Re: Searching in a bitMask

2016-08-27 Thread will martin
hi

aren’t we waltzing terribly close to the use of a bit vector in your field 
caches?
there’s no reason to not filter longword operations on a cache if alignment is 
consistent across multiple caches

just be sure to abstract your operations away from individual bits….imo



-will

> On Aug 27, 2016, at 2:30 PM, Cristian Lorenzetto 
>  wrote:
> 
> Yes thinking a bit more about my question , i understood to make a query to
> process every document will be not a good solution. I preferred to use
> boolean properties with traditional inverted index.  Thanks for
> confirmation :)
> 
> 2016-08-27 20:24 GMT+02:00 Mikhail Khludnev :
> 
>> My guess is that you need to implement own MultyTermQuery, and I guess it's
>> gonna be slow.
>> 
>> On Sat, Aug 27, 2016 at 8:41 AM, Cristian Lorenzetto <
>> cristian.lorenze...@gmail.com> wrote:
>> 
>>> How it is possible to search in a bitmask for soddisfying a request as
>>> 
>>> bitmask&0xf == 0xf ?
>>> 
>> 
>> 
>> 
>> --
>> Sincerely yours
>> Mikhail Khludnev
>> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Multi-field IDF

2016-11-17 Thread Will Martin
are you familiar with pivoted normalized document length practice or 
theory? or croft's recent work on relevance algorithms accounting for 
structured field presence?




On 11/17/2016 5:20 PM, Nicolás Lichtmaier wrote:
That depends on what you want. In this case I want to use a 
discrimination power based in all the body text, not just the titles. 
Because otherwise terms that are really not that relevant end up being 
very high!



El 17/11/16 a las 18:25, Ahmet Arslan escribió:

Hi Nicholas,

IDF, among others, is a measure of term specificity. If 'or' is not 
so usual in titles, then it has some discrimination power in that 
domain.


I think it's OK 'or' to get a high IDF value in this case.

Ahmet



On Thursday, November 17, 2016 9:09 PM, Nicolás Lichtmaier 
 wrote:

IDF measures the selectivity of a term. But the calculation is
per-field. That can be bad for very short fields (like titles). One
example of this problem: If I don't delete stop words, then "or", "and",
etc. should be dealt with low IDF values, however "or" is, perhaps, not
so usual in titles. Then, "or" will have a high IDF value and be treated
as an important term. That's bad.

One solution I see is to modify the Similarity to have a global, or
multi-field IDF value. This value would include in its calculation
longer fields that has more "normal text"-like stats. However this is
not trivial because I can't just add document-frequencies (I would be
counting some documents several times if "or" is present in more than
one field). I would need need to OR the bit-vectors that signal the
presence of the term, right? Not trivial.

Has anyone encountered this issue? Has it been solved? Is my thinking 
wrong?


Should I also try the developers' list?

Thanks!

Nicolás.-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





Re: Multi-field IDF

2016-11-18 Thread Will Martin

In this work, we aim to improve the field weighting for structured doc-
ument retrieval. We first introduce the notion of field relevance as the
generalization of field weights, and discuss how it can be estimated using
relevant documents, which effectively implements relevance feedback for
field weighting. We then propose a framework for estimating field rele-
vance based on the combination of several sources. Evaluation on several
structured document collections show that field weighting based on the
suggested framework improves retrieval effectiveness signicantly.


https://ciir-publications.cs.umass.edu/pub/web/getpdf.php?id=1051




On 11/18/2016 3:57 AM, Ahmet Arslan wrote:

Hi Nicholas,

Aha, I see that you are into field-based scoring, which is an unsolved problem.

Then, you might find BlendedTermQuery and SynonymQuery relevant.

Ahmet




On Friday, November 18, 2016 12:22 AM, Nicolás Lichtmaier 
 wrote:
That depends on what you want. In this case I want to use a
discrimination power based in all the body text, not just the titles.
Because otherwise terms that are really not that relevant end up being
very high!


El 17/11/16 a las 18:25, Ahmet Arslan escribió:

Hi Nicholas,

IDF, among others, is a measure of term specificity. If 'or' is not so usual in 
titles, then it has some discrimination power in that domain.

I think it's OK 'or' to get a high IDF value in this case.

Ahmet



On Thursday, November 17, 2016 9:09 PM, Nicolás Lichtmaier 
 wrote:
IDF measures the selectivity of a term. But the calculation is
per-field. That can be bad for very short fields (like titles). One
example of this problem: If I don't delete stop words, then "or", "and",
etc. should be dealt with low IDF values, however "or" is, perhaps, not
so usual in titles. Then, "or" will have a high IDF value and be treated
as an important term. That's bad.

One solution I see is to modify the Similarity to have a global, or
multi-field IDF value. This value would include in its calculation
longer fields that has more "normal text"-like stats. However this is
not trivial because I can't just add document-frequencies (I would be
counting some documents several times if "or" is present in more than
one field). I would need need to OR the bit-vectors that signal the
presence of the term, right? Not trivial.

Has anyone encountered this issue? Has it been solved? Is my thinking wrong?

Should I also try the developers' list?

Thanks!

Nicolás.-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





Re: Explain Scoring function in LMJelinekMercerSimilarity Class

2016-12-20 Thread Will Martin

https://doi.org/10.3115/981574.981579



On 12/20/2016 12:21 PM, Dwaipayan Roy wrote:

Hello,

Can anyone help me understand the scoring function in the
LMJelinekMercerSimilarity class?

The scoring function in LMJelinekMercerSimilarity is shown below:

float score = stats.getTotalBoost() *
(float)Math.log(1 + ((1 - lambda) * freq / docLen) / (lambda *
((LMStats)stats).getCollectionProbability()));


Can anyone help explain the equation? I can understand the scoring effect
when calculating the stat in the document, i.e.: (1 - lambda) * freq /
docLen).

I hope getCollectionProbability() returns col_freq(t) / col_size. Am I
right?

Also the boosting part is not clear to me (stats.getTotalBoost()).

I want to reproduce the result of the scoring using LM-JM. Hence I want the
details.

Thanks.
Dwaipayan Roy..




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Format of Wikipedia Index

2018-01-22 Thread Will Martin

From the javadoc for DocMaker:


 * *doc.stored* - specifies whether fields should be stored (default
   *false*).
 * *doc.body.stored* - specifies whether the body field should be
   stored (default = *doc.stored*).

So ootb you won't get content stored. Does this help?

regards
-will


On 1/22/2018 10:27 PM, Armins Stepanjans wrote:

Hi,

I have a question regarding the format of the Index created by DocMaker,
from EnWikiContentSource.

After creating the Index from dump of all Wikipedia's articles (
https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-
pages-articles-multistream.xml.bz2), I'm having trouble understanding the
format of Documents created, because when I get a document from the Index,
its only field is docid.
Is this an indicator of incorrect indexation and if not, how should I use
the index, in order to search for occurrences of a term, within an article
(I was imagining of doing a boolean query, with on sub-query being the
article's name and the other the term I'm searching for within the article)?

Regards,
Armīns





Re: How groupingSearch specifies SortedNumericDocValuesField

2019-05-14 Thread Martin Grigorov
Hi,

On Tue, May 14, 2019 at 8:28 PM 顿顿  wrote:

> When I use groupingSearch specified as SortedNumericDocValuesField,
> I got an "unexpected docvalues type NUMERIC for field 'id'
> (expected=SORTED)" Exception.
>
> My code is as follows:
>  String indexPath = "tmp/grouping";
> Analyzer standardAnalyzer = new StandardAnalyzer();
> Directory indexDir = FSDirectory.open(Paths.get(indexPath));
> IndexWriterConfig indexWriterConfig = new
> IndexWriterConfig(standardAnalyzer);
> indexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
> IndexWriter masterIndex = new IndexWriter(indexDir,
> indexWriterConfig);
>
> String name = "Tom";
> for (int i = 1; i < 5; i++) {
> Document doc = new Document();
> doc.add(new StringField("name", name + "_" + i,
> Field.Store.YES));
> doc.add(new SortedNumericDocValuesField("id", i));
> doc.add(new StoredField("id", i));
>

are you sure both fields should have the same name ("id") ?


> masterIndex.addDocument(doc);
>
> }
> masterIndex.commit();
> masterIndex.commit();
>
> IndexReader reader =
> DirectoryReader.open(FSDirectory.open(Paths.get(indexPath)));
> IndexSearcher searcher = new IndexSearcher(reader);
>
> GroupingSearch groupingSearch = new GroupingSearch("id");
> TopGroups topGroups = groupingSearch.search(searcher, new
> MatchAllDocsQuery(), 0, 100);
>
> System.out.println(topGroups.totalHitCount);
> reader.close();
>
>
> The exception is as follows:
> Exception in thread "main" java.lang.IllegalStateException: unexpected
> docvalues type SORTED_NUMERIC for field 'id' (expected=SORTED). Re-index
> with correct docvalues type.
> at org.apache.lucene.index.DocValues.checkField(DocValues.java:317)
> at org.apache.lucene.index.DocValues.getSorted(DocValues.java:369)
> at
>
> org.apache.lucene.search.grouping.TermGroupSelector.setNextReader(TermGroupSelector.java:56)
> at
>
> org.apache.lucene.search.grouping.FirstPassGroupingCollector.doSetNextReader(FirstPassGroupingCollector.java:348)
> at
>
> org.apache.lucene.search.SimpleCollector.getLeafCollector(SimpleCollector.java:33)
> at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:643)
> at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:443)
> at
>
> org.apache.lucene.search.grouping.GroupingSearch.groupByFieldOrFunction(GroupingSearch.java:141)
> at
>
> org.apache.lucene.search.grouping.GroupingSearch.search(GroupingSearch.java:113)
>
>
> The version of Lucene I am using is 8.0.0.
>
>
> Finally, I want to know how groupingSearch specifies three fields:
> NumericDocValuesField, SortedNumericDocValuesField,
> SortedSetDocValuesField?
>
>
>
>
> Thank you for your attention  to this  matter!
>


Re: AlphaNumeric analyzer/tokenizer

2019-08-19 Thread Martin Grigorov
Hi,


On Mon, Aug 19, 2019 at 9:31 AM Uwe Schindler  wrote:

> You already got many responses. Check you inbox.
>

"many" made me think that I've also missed something.
https://markmail.org/message/ohv5qcvxilj3n3fb


>
> Uwe
>
> Am August 19, 2019 6:23:20 AM UTC schrieb Abhishek Chauhan <
> abhishek.chauhan...@gmail.com>:
> >Hi,
> >
> >Can someone please check the above mail and provide some feedback?
> >
> >Thanks and Regards,
> >Abhishek
> >
> >On Fri, Aug 16, 2019 at 2:52 PM Abhishek Chauhan <
> >abhishek.chauhan...@gmail.com> wrote:
> >
> >> Hi,
> >>
> >> We have been using SimpleAnalyzer which keeps only letters in its
> >tokens.
> >> This limits us to search in strings that contains both letters and
> >numbers.
> >> For e.g. "axt1234". SimpleAnalyzer would only enable us to search for
> >"axt"
> >> successfully, but search strings like "axt1", "axt123" etc would give
> >no
> >> results because while indexing it ignored the numbers.
> >>
> >> I can use StandardAnalyzer or WhitespaceAnalyzer but I want to
> >tokenize on
> >> underscores also
> >> which these analyzers don't do. I have also looked at
> >WordDelimiterFilter
> >> which will split "axt1234" into "axt" and "1234". However, using this
> >also,
> >> I cannot search for "axt12" etc.
> >>
> >> Is there something like an Alphanumeric analyzer which would be very
> >> similar to SimpleAnalzyer but in addition to letters it would also
> >keep
> >> digits in its tokens? I am willing contribute such an analyzer if one
> >is
> >> not available.
> >>
> >> Thanks and Regards,
> >> Abhishek
> >>
> >>
> >>
>
> --
> Uwe Schindler
> Achterdiek 19, 28357 Bremen
> https://www.thetaphi.de


Re: Limitations of StempelStemmer

2019-09-24 Thread Martin Grigorov
Hi,

On Tue, Sep 10, 2019, 22:31 Maciej Gawinecki  wrote:

> Hi,
>
> I have just checked out the latest version of Lucene from Git master
> branch.
>
> I have tried to stem a few words using StempelStemmer for Polish.
> However, it looks it cannot handle some words properly, e.g.
>
> joyce -> ąć
> wielce -> ąć
> piwko -> ąć
> royce -> ąć
> pip -> ąć
> xyz -> xyz
>
> 1. I surprised it cannot handle Polish words like wielce, piwko and
> royce. Is this a limitation of the stemming algorithm or a training of
> the algorithm or something else? The latter would help improve the
> situation. How can I improve that behaviour?
> 2. I am surprised that for non-Polish words it returns "ać". I would
> expect that for words it has not be trained for it will return their
> original forms, as it happens, for instance, when stemming words like
> "xyz".
>
> With kind regards,
> Maciej Gawinecki
>
> Here's minimal example to reproduce the issue:
>
> package org.apache.lucene.analysis;
>
> import java.io.InputStream;
> import org.apache.lucene.analysis.stempel.StempelStemmer;
>
> public class Try {
>
>   public static void main(String[] args) throws Exception {
> InputStream stemmerTabke = ClassLoader.getSystemClassLoader()
>
> .getResourceAsStream("org/apache/lucene/analysis/pl/stemmer_2.tbl");
> StempelStemmer stemmer = new StempelStemmer(stemmerTabke);
> String[] words = {"joyce", "wielce", "piwko", "royce", "pip", "xyz"};
> for (String word : words) {
>   System.out.println(String.format("%s -> %s", word,
> stemmer.stem("piwko")));
>

You always pass "piwko" for stemming.

}
>
>   }
>
> }
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Translating Lucene Query Syntax to Traditional Boolean Syntax

2007-09-24 Thread Martin Bayly

We have an application that performs searches against a Lucene based index
and also against a Windows Desktop Search based index.

For simple queries we'd like to offer our users a consistent interface that
allows them to build basic Lucene style queries using the 'MUST HAVE' (+),
'MUST NOT HAVE' (-) and 'SHOULD HAVE' style of operators as this is probably
more intuitive for non 'Boolean Logic' literate users.  We would not allow
them to use any grouping (parenthesis).

Clearly we can pass this directly to Lucene, but for the Windows Desktop
Search we need to translate the Lucene style query into a more traditional
Boolean query.  So this is the opposite of the much discussed Boolean Query
to Lucene Query conversion.

I'm wondering if anyone has ever done this or whether there is a concept
mismatch in there somewhere that will make it difficult to do?

My thought was that you could take the standard Lucene operators and simply
group them together as follows:

e.g. (assuming the Lucene default OR operator)

Lucene: +a +b -c -d e f

would translate to:

(a AND b NOT c NOT d) OR (a AND b NOT c NOT d AND (e OR f))

If I put this back into Lucene (actually Lucene.NET but hopefully its the
same) I get back:

(+a +b -c -d)(+a +b -c -d +(e f))

which I think is equivalent but not as concise!  But I have not tested this
against a big index to see if it's equivalent and I have a suspicion that
Lucene might score the two versions of the Lucene representation
differently.  But that's probably not an issue provided the Boolean
representation is semantically equivalent to the first Lucene
representation.

Anyone ever tried this before or have any comments on whether my 'logic' is
flawed!

Thanks
Martin
-- 
View this message in context: 
http://www.nabble.com/Translating-Lucene-Query-Syntax-to-Traditional-Boolean-Syntax-tf4512730.html#a12871390
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Weird operator precedence with default operator AND

2007-10-09 Thread Martin Dietze
Hi,

 I've been going nuts trying to use LuceneParser parse query
strings using the default operator AND correctly:

String queryString = getQueryString();
QueryParser parser = new QueryParser("text", new StandardAnalyzer());
parser.setDefaultOperator(QueryParser.AND_OPERATOR);
try {
  Query q = parser.parse(queryString);
  LOG.info("q: " + q.toString());
  /* [...] */

Here's two example queries and the results I get with and
without the `setDefaultOperator()' statetment:

Query: hose AND cat:Wohnen cat:Mode OR color:blau

- Default-Op OR:  (+text:hose +cat:Wohnen) cat:Mode color:blau
- Default-Op AND: +(+text:hose +cat:Wohnen) cat:Mode color:blau

Query: hose AND ( cat:Wohnen cat:Mode ) OR color:blau

- Default-Op OR:  (+text:hose +(cat:Wohnen cat:Mode)) color:blau
- Default-Op AND: (+text:hose +(+cat:Wohnen +cat:Mode)) color:blau

It seems like theparser handles the default case well, but what
I get with the default operator set to AND is completely
incorrect. I've seen this behaviour with both version 2.1.0 and
2.2.0.

Any hints?

Cheers,

Martin

-- 
--- / http://herbert.the-little-red-haired-girl.org / -
=+= 
I got it good, I got it bad. I got the sweetest sadness I ever had.
  --- the The

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Weird operator precedence with default operator AND

2007-10-10 Thread Martin Dietze
On Tue, October 09, 2007, Daniel Naber wrote:

> The operator precedence is known to be buggy. You need to use parenthesis, 
> e.g. (aa AND bb) OR (cc AND dd)

This would be fine with me but unfortunately not for my users.
More precisely, I need to analyze a query string from one search
engine, filter out a black list of facette queries and pass the
result on to a second search engine. This means that I have no
control over the way people enter their queries.

Is there any known query parser which handles this correctly?

Also, how does solr do this? It uses a parser derived from the
Lucene QueryParser, and I found it produces the same output,
however the search queries are still handled correctly, i.e. the
results I get indicate that deep down inside it seems to get it
right in the end.

Cheers,

Martin

-- 
--- / http://herbert.the-little-red-haired-girl.org / -
=+= 
My name is spelled Luxury Yacht but it's pronounced Throatwabbler Mangrove.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Weird operator precedence with default operator AND

2007-10-10 Thread Martin Dietze
Mark,

 this reply was just in time :)

On Wed, October 10, 2007, Mark Miller wrote:

> Precedence QueryParser (I think its in Lucene contrib packages - I don't 
> believe its perfect but I have not tried it)

I checked that one out, and while it improves things with
default settings I found it to exhibit the same incorrect
behaviour with default operator AND.

> Qsol: myhardshadow.com/qsol (A query parser I wrote that has fully 
> customizable precedence support - don't be fooled by the stale website...I 
> am actually working on version 2 as i have time)

That sounds promising, I will check this out right now!

Thannk you!

Martin

-- 
--- / http://herbert.the-little-red-haired-girl.org / -
=+= 
Die Freiheit ist uns ein schoenes Weib. 
Sie hat einen Ober- und Unterleib.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Weird operator precedence with default operator AND

2007-10-10 Thread Martin Dietze
Mark,

On Wed, October 10, 2007, Martin Dietze wrote:

> > Qsol: myhardshadow.com/qsol (A query parser I wrote that has fully 
> > customizable precedence support - don't be fooled by the stale website...I 
> > am actually working on version 2 as i have time)
> 
> That sounds promising, I will check this out right now!

 as far as I can judge this from what I've tested now it seem
like qsol does handle operator precedence correctly for my
test cases. However - excuse a possibly dumb question - how
do I get out my query in a form accepted by solr?

Cheers,

Martin

-- 
--- / http://herbert.the-little-red-haired-girl.org / -
=+= 
I now declare this bizarre open!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Weird operator precedence with default operator AND

2007-10-11 Thread Martin Dietze
On Wed, October 10, 2007, Chris Hostetter wrote:

> Eh ... not really.  it would be easier to just load the Qsol parser in 
> solr ... or toString() the query...

This would be nice, but unfortunately I do not have direct access
to the solr server in my application. I need to parse queries,
filter out blacklisted facettes and then parse them on to solr
using solrj.

Maybe I am missing out on something obvious, and there's an
entirely simple way to accomplish this?

Cheers,

Martin

-- 
--- / http://herbert.the-little-red-haired-girl.org / -
=+= 
Yoda of Borg I am. Assimilated you will be.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Weird operator precedence with default operator AND

2007-10-11 Thread Martin Dietze
On Wed, October 10, 2007, Mark Miller wrote:

> Back in the day you might have been able to call Query.toString() as the 
> Query contract says that toString() should output valid QueryParser syntax. 
> This does not work for many queries though (most notably Span Queries -- 
> QueryParser knows nothing about Span queries).

I see, so my old code which was based on QueryParser was not
completely flawed :) Are there any other queries besides span
queries which can occur with qsol and do not produce valid
QueryParser syntax? 

Also I wonder why a facette query, like `foo:bar' results in a
SpanQuery `+spanNear([foo, bar], 0, true)' (I may not understand
the concept here).

Cheers,

Martin

-- 
--- / http://herbert.the-little-red-haired-girl.org / -
=+= 
Who the fsck is "General Failure", and why is he reading my disk?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Weird operator precedence with default operator AND

2007-10-12 Thread Martin Dietze
Chris,

On Thu, October 11, 2007, Chris Hostetter wrote:

> ... are you talking about preventing people from including field 
> specific queries in their query string? i'm guessing that you mean 
> something like this is okay...
> 
> solr title:bobby body:boy
> 
> ...but this isn't...
> 
>   solr title:bobby body:boy secret_field:xyzyq
> 
> ...is that the idea?

 yes that's just about it. We have two search engines for
different purposes. The first one indexes more fields than the
second and we want to prevent "good" search queries from failing
on the second. Supporting all theses fields on the second SE is
not a good idea since indexing all this additonal data would
have an impact on performance and index size.

> the easiest approach is to do your own simple pass over the query string, 
> and escape any metacharacters in clauses you don't like ... they'll be 
> treated as "terms" and either be ignored (if they are optional) or cause 
> the query to not match anything (if they are required)...

This is a very interesting idea. Yet I wonder how to deal with
such terms if they are part of an AND query (actually AND is our
default operator, so that a query "body:boy secret_field\:xyzyq"
would always fail. It seems obvious that in any case you end up
parsing the query in some way...

Cheers,

Martin

-- 
--- / http://herbert.the-little-red-haired-girl.org / -
=+= 
My family says I'm a psychopath, but the voices in my head disagree

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Most efficient way to find related terms

2008-02-29 Thread Martin Bayly
I'm wondering what the most efficient approach is to finding all terms
which are related to a specific term X.

 

By related I mean terms which appear in a specific document field that
also contains the target term X.

 

e.g. Document has a keyword field, field1 that can contain multiple
keywords.

 

document1 - field1 HAS key1, key2, key3

document2 - field1 HAS key2, key4

document3 - field1 HAS key5

 

If I want to find terms related to key2, I need to return key1, key3,
key4

 

Obviously I can do a search for key2, iterate all the docs and collect
there field1 terms manually.

 

But presumably a more efficient way is to use TermDocs:

 

1. TermDocs termdocs = IndexReader.termDocs(new Term("field1", "key2")

2. Iterate term docs to get documents containing that term 

3. Now this is the bit I'm not sure of:

a. I could call Document doc = IndexReader.document(n), but that will
load all fields and I only want the field1

b. Presumably better to call Document doc = IndexReader.document(n,
fieldSelector)

c. Or would I be better to turn on TermFrequencyVectors for this field
so I can call IndexReader.getTermFrequencyVector(n, "field1") - don't
particularly care about the frequencies as it will always be 1 for a
particular doc.

 

Other approaches?

 

I'm going to perf test to see how (b) and (c) compare but would be glad
if anyone has any insights.

 

Thanks

Martin



Get BestFrequentKeywords

2008-08-04 Thread Martin vWysiecki
Hello to all,

Thanks for help in advance.

Example docs:

1,"car, volvo, dealer, tyres"
2,"car, mercedes, dealer, tyres"
3,"car, renault, export, tyres"

So, if i look for "car", so i would like to get, except normal
results, a list of most frequent terms in result set.
This would be in my example:

tyres, dealer

tyres 3x
dealer 2x


How can i do that?

THX


-- 
 mit freundlichen Grüßen

Martin von Wysiecki
software development

aspedia GmbH
Roßlauer Weg 5
D-68309 Mannheim
Telefon +49 (0) 621 - 71600 33
Telefax +49 (0) 621 - 71600 10
[EMAIL PROTECTED]

Geschäftsführung:
Steffen Künster, Christoph Goldschmitt
Amtsgericht Mannheim HRB 9942
www.aspedia.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Term Based Meta Data

2008-08-05 Thread Martin Owens
Hello Users,

I'm working on a project which attempts to store data that comes from an
OCR process which describes the pixel co-ordinates of each term in the
document. It's used for hit highlighting.

What I would like to do is store this co-ordinate information alongside
the terms. I know there is existing meta data stored per term (Word
Offset and Char Offsets) the problem is that If I create a separate
index and try and use the word offset or char offsets not only is it
slower but it doesn't match because of the way the terms are processed
both inside of lucene and the OCR program.

So, is it possible to store the data alongside the terms in lucene and
then recall them when doing certain searches? and how much custom code
needs to be written to do it?

Best Regards, Martin Owens

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Term Based Meta Data

2008-08-05 Thread Martin Owens
Thank you very much, I'm using Solr so it's very relivent to me. Even
though the indexing is being done by a smaller RMI method (since Solr
doesn't support streaming of very large files and has term limits) but
all the searching is done through solr.

Thanks again,

Best Regards, Martin Owens

On Tue, 2008-08-05 at 11:14 -0600, Tricia Williams wrote:
> Hi Martin,
> 
> Take a look at what I've done with SOLR-380 
> (https://issues.apache.org/jira/browse/SOLR-380). It might solve your 
> problem, or at least give you a good starting point.
> 
> Tricia
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Unique list of keywords

2008-08-08 Thread Martin vWysiecki
Hello,

i have very much data, about 20GB of text, and need a unique list of
keywords based on my text in all docs from the whole index.

Some ideas?


THX

Martin




-- 
 mit freundlichen Grüßen

Martin von Wysiecki
software development

aspedia GmbH
Roßlauer Weg 5
D-68309 Mannheim
Telefon +49 (0) 621 - 71600 33
Telefax +49 (0) 621 - 71600 10
[EMAIL PROTECTED]

Geschäftsführung:
Steffen Künster, Christoph Goldschmitt
Amtsgericht Mannheim HRB 9942
www.aspedia.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Term Based Meta Data

2008-08-08 Thread Martin Owens
Dear Lucene Users and Tricia Williams,

The way we're operating our lucene index is one where we index all the
terms but not store the text. From your SOLR-380 patch example Tricia I
was able to get a very good idea of how to set things up. Historically I
have used TermPositionsVector instead of TermPositions because of that
data is available without storing the text in the index.

Is it possible to translate code which uses TermPositions to using
TermPositionsVector with regards to payloads?

Best Regards, Martin Owens

On Tue, 2008-08-05 at 11:14 -0600, Tricia Williams wrote:
> Hi Martin,
> 
> Take a look at what I've done with SOLR-380 
> (https://issues.apache.org/jira/browse/SOLR-380). It might solve your 
> problem, or at least give you a good starting point.
> 
> Tricia
> 
> Michael McCandless wrote:
> >
> > I think you could use payloads (= arbitrary/opaque byte[]) for this?
> >
> > You can attach a payload to each term occurrence during tokenization 
> > (indexing), and then retrieve the payload during searching.
> >
> > Mike
> >
> > Martin Owens wrote:
> >
> >> Hello Users,
> >>
> >> I'm working on a project which attempts to store data that comes from an
> >> OCR process which describes the pixel co-ordinates of each term in the
> >> document. It's used for hit highlighting.
> >>
> >> What I would like to do is store this co-ordinate information alongside
> >> the terms. I know there is existing meta data stored per term (Word
> >> Offset and Char Offsets) the problem is that If I create a separate
> >> index and try and use the word offset or char offsets not only is it
> >> slower but it doesn't match because of the way the terms are processed
> >> both inside of lucene and the OCR program.
> >>
> >> So, is it possible to store the data alongside the terms in lucene and
> >> then recall them when doing certain searches? and how much custom code
> >> needs to be written to do it?
> >>
> >> Best Regards, Martin Owens
> >>
> >> -
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Term Based Meta Data

2008-08-11 Thread Martin Owens

> Following the history of Payloads from its beginnings 
> (https://issues.apache.org/jira/browse/LUCENE-755, 
> https://issues.apache.org/jira/browse/LUCENE-761, 
> https://issues.apache.org/jira/browse/LUCENE-834, 
> http://wiki.apache.org/lucene-java/Payload_Planning) it looks like 
> TermPostionsVector was never considered as part of the Payload 
> functionality.  I think this is based on the underlying index file 
> structure???  I don't see any way to get at a Payload other than through 
> a TermPositions object.  I don't think there is a way to translate code 
> which uses TermPositions to using TermPositionVector with regards to 
> payloads  -- but I welcome someone to show me how they could.

Very interesting, and it fills in a few missing bits.

> Maybe there is some other work around.  What are you trying to 
> accomplish "historically" with TermPositionsVectors instead of 
> TermPositions?

Historically we've not been able to access the TermPositions object
because it seemed to require that the original text was stored and not
just indexed (although I can't see why) Perhaps I am mistaken?

We're not storing the text context because a) there is rather a lot of
it, b) we have the text files stored on special storage boxes mounted to
the webservers and they're using directly and c) It didn't seem worth
it.

Thoughts? So can I use the TermPositions object without the stored text?

Best Regards, Martin Owens

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Results by unique id's

2008-08-12 Thread Martin vWysiecki
Hello,

thanks for help in advance.

my example docs:

two fileds company_id and content

doc1;1;"car volvo"
doc2;1;"car toyota"
doc3;2;"car mitsubishi"
doc4;2;"car skoda"

my search "car"

Now i would like to get only doc 1 and 3 because doc2 is the same
company, same company_id, same for doc 4

Is this possible?

Thank you




-- 
 mit freundlichen Grüßen

Martin von Wysiecki
software development

aspedia GmbH
Roßlauer Weg 5
D-68309 Mannheim
Telefon +49 (0) 621 - 71600 33
Telefax +49 (0) 621 - 71600 10
[EMAIL PROTECTED]

Geschäftsführung:
Steffen Künster, Christoph Goldschmitt
Amtsgericht Mannheim HRB 9942
www.aspedia.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Results by unique id's

2008-08-12 Thread Martin vWysiecki
Hello Chris,

Sorry but this is not the solution for me, because i've got more
fields which are imported, for example url

doc1;1;"car volvo","company1.com/volvo"
doc2;1;"car toyota","company1.com/toyota"
doc3;2;"car mitsubishi","company2.com/mitsubishi"
doc4;2;"car skoda","company2.com/skoda"

so, if I search for skoda, so i need the right result doc4,
but if i search for car, so i want to get only one result per company
doc1 and doc3

THX



On Tue, Aug 12, 2008 at 3:11 PM, Chris Lu <[EMAIL PROTECTED]> wrote:
> Maybe re-organize the index structure as
>
> doc1:1; "car volvo", "car toyota"
> doc2;2;"car mitsubishi", "car skoda"
>
> You can add the content field twice for the same company_id.
>
> --
> Chris Lu
> -
> Instant Scalable Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
> Lucene Database Search in 3 minutes:
> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
> DBSight customer, a shopping comparison site, (anonymous per request) got
> 2.6 Million Euro funding!
>
> On Tue, Aug 12, 2008 at 6:05 AM, Martin vWysiecki <[EMAIL PROTECTED]>wrote:
>
>> Hello,
>>
>> thanks for help in advance.
>>
>> my example docs:
>>
>> two fileds company_id and content
>>
>> doc1;1;"car volvo"
>> doc2;1;"car toyota"
>> doc3;2;"car mitsubishi"
>> doc4;2;"car skoda"
>>
>> my search "car"
>>
>> Now i would like to get only doc 1 and 3 because doc2 is the same
>> company, same company_id, same for doc 4
>>
>> Is this possible?
>>
>> Thank you
>>
>>
>>
>>
>> --
>>  mit freundlichen Grüßen
>>
>> Martin von Wysiecki
>> software development
>>
>> aspedia GmbH
>> Roßlauer Weg 5
>> D-68309 Mannheim
>> Telefon +49 (0) 621 - 71600 33
>> Telefax +49 (0) 621 - 71600 10
>> [EMAIL PROTECTED]
>>
>> Geschäftsführung:
>> Steffen Künster, Christoph Goldschmitt
>> Amtsgericht Mannheim HRB 9942
>> www.aspedia.de
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>



-- 
 mit freundlichen Grüßen

Martin von Wysiecki
software development

aspedia GmbH
Roßlauer Weg 5
D-68309 Mannheim
Telefon +49 (0) 621 - 71600 33
Telefax +49 (0) 621 - 71600 10
[EMAIL PROTECTED]

Geschäftsführung:
Steffen Künster, Christoph Goldschmitt
Amtsgericht Mannheim HRB 9942
www.aspedia.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[Fwd: Spam filter for lucene project]

2006-10-05 Thread Martin Braun
Hello Rajiv,

perhaps captcha's will solve your problem:

http://en.wikipedia.org/wiki/CAPTCHA

many open-source PHP products are using this like phpmyfaq and phpBB.
So you can take a look at this code.

hth,
martin



 Original-Nachricht 
Von: Rajiv Roopan <[EMAIL PROTECTED]>
Betreff: Spam filter for lucene project
An: java-user@lucene.apache.org

Hello, I'm currently running a site which allows users to post. Lately posts
have been getting out of hand. I was wondering if anyone knows of an open
source spam filter that I can add to my project to scan the posts (which are
just plain text)  for spam?

thanks in advance.
Rajiv


-- 
Universitaetsbibliothek Heidelberg   Tel: +49 6221 54-2580
Ploeck 107-109, D-69117 Heidelberg   Fax: +49 6221 54-2623

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



experiences with lingpipe

2006-10-23 Thread Martin Braun
hi all,

does anybody have practical experiences with Ling Pipes Spellchecker
(http://www.alias-i.com/lingpipe/demos/tutorial/querySpellChecker/read-me.html)?

With lucenes spellcheck contribution I am not really satisfied because
the Index has some (many?) mispelled words, so the did you mean class
(from the jave.net example) is good in finding similar mispelled words.
With the similarWords  Function the correct word is only around Position
2-5  - though it should be more frequent in the index.


So for know I am thinking of switching to lingpipe, but I have a couple
of questions:

Is it better than lucenes  spell-check contribution?
What about performance?
What about the quality of suggestions?


Does anybody have a good idea how to find typos in the index.

tia,
martin


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: experiences with lingpipe

2006-10-25 Thread Martin Braun
Hi Breck,

thanks for your answer.
>>
>> With lucenes spellcheck contribution I am not really satisfied because
>> the Index has some (many?) mispelled words, so the did you mean class
>> (from the jave.net example) is good in finding similar mispelled words.
>> With the similarWords  Function the correct word is only around Position
>> 2-5  - though it should be more frequent in the index.
> 
> Not quite sure I understand what the issue is here. Is it that the
> similarWords returns ranked words and the correct one is too far down
> the ranked list?

Yes that is exactly the problem. The problem is even worse when
searching with multiple words, because the corrected query has often no
results.  Another part of the problem are that there are some (many ?)
typos in the search_index.

>> What about performance?
> 
> Tuning params dominate the performance space. A small beam (16 active
> hypotheses) will be quite snappy (I have 200 queries/sec with a 32 beam.
> over a 80 gig text collection that with some pruning was 5 gig in memory
> running an 8 gram model)
> 

That's really impressive (though I didn't understand what you mean with
"beams").

Did I unterstand the  license term correctly, that I could use Lingpipe
for free when I am building a Search Engine for a Academic Website (for
free use)?

thanks,
martin

> Tuning is a big deal and I need to write a tuning tutorial. I am doing
> more teaching/training now so that may happen.
> 
> 
> breck
> 
>>
>>
>> Does anybody have a good idea how to find typos in the index.
>>
>> tia,
>> martin
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


-- 
Universitaetsbibliothek Heidelberg   Tel: +49 6221 54-2580
Ploeck 107-109, D-69117 Heidelberg   Fax: +49 6221 54-2623

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: experiences with lingpipe

2006-11-02 Thread Martin Braun
Hi Breck,

i have tried your tutorial and built (hopefully) a successful
SpellCheck.model File with
49M.
My Lucene Index directory is 2,4G. When I try to read the Model with the
readmodel function,
i get an "Exception in thread "main" java.lang.OutOfMemoryError: Java
heap space", though I started java with -Xms1024m -Xmx1024m.

How many RAM will I need for the Model (I only have 2 GB of physical
RAM, and lucene's also using some memory).

Is there a "rule of thumb" to calculate the needed amount of memory of
the model?

thanks in advance,

martin


>>> Tuning params dominate the performance space. A small beam (16 active
>>> hypotheses) will be quite snappy (I have 200 queries/sec with a 32 beam.
>>> over a 80 gig text collection that with some pruning was 5 gig in memory
>>> running an 8 gram model)
>>>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Update an existing index

2006-11-08 Thread Martin Braun
WATHELET Thomas schrieb:
> how to update a field in lucene?
> 
I think you'll have to delete the whole doc and add the doc with the new
field to the index...

hth,
martin


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Best approach for exact Prefix Field Query

2006-11-14 Thread Martin Braun
hi,

i would like to provide a exact "PrefixField Search", i.e. a search for
exactly the first words in a field.
I think I can't use a PrefixQuery because it would find also substrings
inside the field, e.g.
action* would find titles like "Action and knowledge" but also (that's
what i don't want it to find)
"Lucene in Action"

As a regex it would be sth. like /^Action and.*/

Now the question for me is how to implement this functionality, I see to
ways:

1) Some kind of TermEnum over all Docs (or the prefixquery results?) and
string comparison
2) Using the regex contribution
3) a super -fast lucene function I have overseen :)

with 2) I am worrying about performance, anybody have experiences with
regex-queries?

.. but same for 1) anybody already impolemented this already and could
give some code samples / hints ?

tia,


martin





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Best approach for exact Prefix Field Query

2006-11-14 Thread Martin Braun
Hi Erik,

> SpanFirstQuery is what you're after.

thanks for this hint (@Erick: thanks for the good explanation of my prob),

I read the chapter for the spanfirstquery in LIA, but what I don't
understand is, how do i have to do a "Phrase" SpanFirstQuery?
I found a message with example code (
http://www.nabble.com/Speedup-indexing-process-tf1140025.html#a3034612 ):

here's my jruby snippet:

   SpanFirstQuery = org.apache.lucene.search.spans.SpanFirstQuery
   SpanTermQuery = org.apache.lucene.search.spans.SpanTermQuery
   Term = org.apache.lucene.index.Term

   sp = SpanFirstQuery.new(SpanTermQuery.new(Term.new("TI",search)),2)
   hits = searcher.search(sp)
   for i in 0...hits.length
  puts hits.doc(i).getField("kurz")
   end

I get no results for "action and" (there are some docs with beginning
with "action and" in the title) but i get (correct) results for "action",
What am I doing wrong here?

tia,
martin


> 
> Erik
> 
> 
> On Nov 14, 2006, at 8:32 AM, Martin Braun wrote:
> 
>> hi,
>>
>> i would like to provide a exact "PrefixField Search", i.e. a search for
>> exactly the first words in a field.
>> I think I can't use a PrefixQuery because it would find also substrings
>> inside the field, e.g.
>> action* would find titles like "Action and knowledge" but also (that's
>> what i don't want it to find)
>> "Lucene in Action"
>>
>> As a regex it would be sth. like /^Action and.*/
>>
>> Now the question for me is how to implement this functionality, I see to
>> ways:
>>
>> 1) Some kind of TermEnum over all Docs (or the prefixquery results?) and
>> string comparison
>> 2) Using the regex contribution
>> 3) a super -fast lucene function I have overseen :)
>>
>> with 2) I am worrying about performance, anybody have experiences with
>> regex-queries?
>>
>> .. but same for 1) anybody already impolemented this already and could
>> give some code samples / hints ?
>>
>> tia,
>>
>>
>> martin
>>
>>
>>
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


-- 
Universitaetsbibliothek Heidelberg   Tel: +49 6221 54-2580
Ploeck 107-109, D-69117 Heidelberg   Fax: +49 6221 54-2623

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Best approach for exact Prefix Field Query

2006-11-16 Thread Martin Braun
hi Erik,

> "action and" is likely not a single Term, so you'll want to create a
> SpanNearQuery of those individual terms (that match the way they were
> when analyzed and indexed, mind you) and use a SpanNearQuery inside a
> SpanFirstQuery.   Make sense?
Yes, it works (see below)!
... but with my Java-App I have the Problem that I need to combine this
SpanFirstQuery with a Query from the QueryParser,
i.e. from the Webform I get the SpanFirstQuery (which I am just
.split'ing as inJRuby Sample) and I get another inputfield with a query
I normally parse with the QueryParser.

Is there a way to merge these two query-classes?

tia,
martin


   SpanFirstQuery = org.apache.lucene.search.spans.SpanFirstQuery
   SpanTermQuery  = org.apache.lucene.search.spans.SpanTermQuery
   SpanQuery  = org.apache.lucene.search.spans.SpanQuery
   SpanNearQuery  = org.apache.lucene.search.spans.SpanNearQuery

   Term = org.apache.lucene.index.Term

   qs = search.split(/\s/)

   spanq_ar =SpanQuery[].new(qs.length)

   for i in 0...qs.length
  spanq_ar[i] = SpanTermQuery.new( Term.new("TI", qs[i] ) )
   end

   sp = SpanFirstQuery.new(SpanNearQuery.new(spanq_ar,1,true),
spanq_ar.length)
   hits = searcher.search(sp)
   for i in 0...hits.length
  puts hits.doc(i).getField("kurz")
   end



-- 
Universitaetsbibliothek Heidelberg   Tel: +49 6221 54-2580
Ploeck 107-109, D-69117 Heidelberg   Fax: +49 6221 54-2623

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Search "C++" with Solrs WordDelimiterFilter

2006-11-17 Thread Martin Braun
hi all,

I would like to implement the possibility to search for "C++" and "C#" -
I found in the archive the hint to customize the appropriate *.jj  file
with the code in NutchAnalysis.jj:

 // irregular words
| <#IRREGULAR_WORD: (|)>
| <#C_PLUS_PLUS: ("C"|"c") "++" >
| <#C_SHARP: ("C"|"c") "#" >

I am using a custum analyzer with the yonik's WordDelimiterFilter:

@Override
public TokenStream tokenStream(String fieldName, Reader reader) {

return new LowerCaseFilter(new WordDelimiterFilter(new
WhitespaceTokenizer(reader),1,1,1,1,1 ));
}


But as I can see WordDelimiterFilter uses only the WhiteSpaceTokenizer
which does not use a Java-CC file.

What would be the best way to integrate (anyway, preferably not changing
lucene-src) this feature?

Should I override the WhitespaceTokenizer and using java-cc ( are there
any docs on doing this?).

tia,
martin





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: how to search string with words

2006-11-21 Thread Martin Braun
spinergywmy schrieb:
> Hi Erick,
> 
>I did take a look at the link that u provided me, and I have try myself
> but I have no return reesult.
> 
>My search string is "third party license readme"
> 
hhm with a quick look I would suggest that you have to split the string
into individual terms, and then make a spannearquery  for these Terms:

String[] que_ary = system_query.split("\\s");
//=> Array with third,party,licens,readme
SpanQuery[] spanq_ar = new SpanQuery[que_ary.length];

for (int i=0; i < que_ary.length; i++) {
spanq_ar[i] = new SpanTermQuery( new Term("TI", que_ary[i]) );
}
// now we have an array of spantermquerys

// each term of the sentence should be in exact order => spannearquery
//  I am not sure if you'll better do a slop of "0"
SpanFirstQuery sfq = new SpanFirstQuery(
        new 
SpanNearQuery(spanq_ar,1,true), spanq_ar.length);


hth,
martin

>Below r the codes that I wrote, please point me out where I have done
> wrong.
> 
>   readerA = IndexReader.open(DsConstant.indexDir);
>   readerB = IndexReader.open(DsConstant.idxCompDir);
>   
>   //building the searchables
>   Searcher[] searchers = new Searcher[2];
>   
>   // VITAL STEP:adding the searcher for the empty index first, 
> before
> the searcher for the populated index
>   searchers[0] = new IndexSearcher(readerA);
>   searchers[1] = new IndexSearcher(readerB);
>   
>   Analyzer analyzer = new StandardAnalyzer();
>   QueryParser parser = new 
> QueryParser(DsConstant.idxFileContent,
> analyzer);
> 
>   SpanTermQuery stq = new SpanTermQuery(new Term(field,
> buff.toString())); //field = search base on what I have index
>   SpanFirstQuery sfq = new SpanFirstQuery(stq, 
> searchString1.length);
> //searchString1 = "third party license readme"
>   
>   sfq = (SpanFirstQuery) sfq.rewrite(readerA);
>   sfq = (SpanFirstQuery) sfq.rewrite(readerB);
>   
>   //creating the multiSearcher
>   Searcher mSearcher = 
> getMultiSearcherInstance(searchers);
>   
>   searchHits = mSearcher.search(sfq);
> 
>The sysout as below:
> 
>   span first query is ::: spanFirst(TestC:TestC:Third Party License
> Readme, 32)



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index XML file

2006-12-14 Thread Martin Braun
Hi Wooi,
>Just wondering is there anyone used Digester to extract xml content and
> index the xml file? Is there any source that I can refer to on how to
> extract the xml contents. Or is there any other xml parser is much easier to
> use?

Perhaps this article may help:

http://www-128.ibm.com/developerworks/java/library/j-lucene/

regards,
martin

> 
>Thanks
> 
> regards,
> Wooi Meng



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



to boost or not to boost

2006-12-20 Thread Martin Braun
Hello all,

I am trying to boost more recent Docs, i.e. Docs with a greater year
Value like this:

if (title.getEJ() != null) {
titleDocument.setBoost(new Float("1." + title.getEJ()));
}
so a doc from 1973 should get a boost of 1.1973 and a doc of 1975 should
get a boost of 1.1975 .

I have indexed these two docs:


DOK 1:
Document
stored/uncompressed,indexed,termVector
indexed,tokenized
[...]

DOK 2:
Document
stored/uncompressed,indexed,termVector
indexed,tokenized
[...]

If I am Searching for AU:palandt

I get this:
Explain für 1042362: 1.6931472 = fieldWeight(AU:palandt in 0), product of:
  1.0 = tf(termFreq(AU:palandt)=1)
  1.6931472 = idf(docFreq=2)
  1.0 = fieldNorm(field=AU, doc=0)

Explain für 1043960: 1.6931472 = fieldWeight(AU:palandt in 1), product of:
  1.0 = tf(termFreq(AU:palandt)=1)
  1.6931472 = idf(docFreq=2)
  1.0 = fieldNorm(field=AU, doc=1)


so the "older" doc is better rated or with the same rank as the newer?


any ideas?

tia,
martin













-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



boosting instead of sorting WAS: to boost or not to boost

2006-12-21 Thread Martin Braun
Hi Daniel,

>> so a doc from 1973 should get a boost of 1.1973 and a doc of 1975 should
>> get a boost of 1.1975 .
> 
> The boost is stored with a limited resolution. Try boosting one doc by 10, 
> the other one by 20 or something like that.

You're right. I thought that with the float values the resolution should
be good enough!
But there is only a difference in the score with a boosting diff of 0.2
(e.g. 1.7 and 1.9).

I know that there were many questions on the list regarding scoring
better new documents.
But I want to avoid any overhead like "FunctionQuery" at query time,
and in my case I have some documents
which have same values in many fields (=>same score) and the only
difference is the year.

However  I don't want to overboost the score so that the scoring for
other criteria is not considered.

Shortly spoken: As a result of a search I have a list of book titles and
I want  a sort by score AND by year of publication.

But for performance reasons I want to avoid this sorting at query-time
by boosting at index time.

Is that possible?

thanks,
Martin






> 



-- 
Universitaetsbibliothek Heidelberg   Tel: +49 6221 54-2580
Ploeck 107-109, D-69117 Heidelberg   Fax: +49 6221 54-2623

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



spnafirstquery and multiple field instances

2006-12-21 Thread Martin Braun
hello,

with a SpanFirstQuery I want to realize a "starts with" search -
that seems to work fine. But I have the Problem that I have documents
with multiple titles and I thought I can do a sfq-search for each tiltle
   by adding multiple instances for the specific field:

for (String key : title.getTitel().split("\\n")  ) {

titleDocument.add(new Field("TI", key, 
Field.Store.NO,
Field.Index.TOKENIZED));
}

but that didn't work. The query finds only matches for the first token
in that field of a document.

Is there a way to do a SpanFirstQuery for each token?

tia,
martin


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



autocomplete with multiple terms

2007-02-22 Thread Martin Braun
Hello All,

I am implementing a query auto-complete function à la google. Right now
I am using a TermEnum enumerator on a specific field and list the Terms
found.
That works good for Searches with only one Term, but when the user's
typing two or three words the function will autocomplete each Term
individually - but the problem is that the combination of the terms
could probably return no results.
An autocomplete Function should be really fast, so a search for all
possible combinations of the terms  wouldn't be a good solution.

So my strategy is in a dead end.

Does anybody know a better way?

I am not sure if we get enough queries for a search over an index base
on the user-queries.

the only thing I have found in the list before concerning this subject
is http://issues.apache.org/jira/browse/LUCENE-625, but I'm not sure if
it does the things I want.

tia,
martin



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



recovering an index from RAM disk.

2007-02-27 Thread Martin Spamer

I generate my index to the file system and load that index into a
RAMDirectory for speed.  If my indexer fails the directory based index
can be left in an inadequate state for my needs.  I therefore wish to
flush the current index from the RAMDirectory back to the File system.
The RAMDirectory class doesn't seem to support this feature.  Is this
possible and can anybody give me some pointers ?




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: similar contrib in lucene 2.1.0

2007-03-02 Thread Martin Braun
Hi Hans,
> 
> I'm in the process to upgrade from 2.0 to 2.1, but are missing the
> similar contrib (the jar only contains a Manifest). is this a bug or
> is that on purpose?

Take a look in:
 lucene-2.1.0/contrib/queries/
this is the new home, it is explained in the changelog why the code moved...

hth,
martin


> 
> Cheers
> Hans Lund
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Spelt, for better spelling correction

2007-03-20 Thread Martin Haye

As part of XTF, an open source publishing engine that uses Lucene, I
developed a new spelling correction engine specifically to provide "Did you
mean..." links for misspelled queries. I and a small group are preparing
this for submission as a contrib module to Lucene. And we're inviting
interested people to join the discussion about it.

The new engine is being called "Spelt" and differs from the one currently in
Lucene contrib in the following ways:

- More accurate: Much better performance on single-word queries (90% correct
in #1 slot in my tests). On general list including multi-word queries, gets
80%+ correct.
- Multi-word: Handles and corrects multi-word queries such as "harrypotter"
-> "harry potter".
- Fast: In my tests, builds the dictionary more than 30 times faster.
- Small: Dictionary size is roughly a third of that built by the existing
engine.
- Other bells and whistles...

There is already a standalone test program that people can try out, and
we're interested in feedback. If you're interested in discussing, testing,
or previewing, consider joining the Google group:
http://groups.google.com/group/spelt/

--Martin


Re: Spelt, for better spelling correction

2007-03-21 Thread Martin Haye

The dictionary is generated from the corpus, with the result that a larger
corpus gives better results.

Words are queued up during an index run, and at the end are munged to create
an optimized dictionary. It also supports incremental building, though the
overhead would be too much for those applications that are continuously
adding things to an index. Happily, it's not as important to keep the
spelling dictionary absolutely up to date, so it would be fine to queue
words over several index runs, and refresh the dictionary less often.

--Martin

On 3/20/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:


Sounds interesting Martin!
Is the dictionary static, or is it generated from the corpus or from
user queries?

-Yonik

On 3/20/07, Martin Haye <[EMAIL PROTECTED]> wrote:
> As part of XTF, an open source publishing engine that uses Lucene, I
> developed a new spelling correction engine specifically to provide "Did
you
> mean..." links for misspelled queries. I and a small group are preparing
> this for submission as a contrib module to Lucene. And we're inviting
> interested people to join the discussion about it.
>
> The new engine is being called "Spelt" and differs from the one
currently in
> Lucene contrib in the following ways:
>
> - More accurate: Much better performance on single-word queries (90%
correct
> in #1 slot in my tests). On general list including multi-word queries,
gets
> 80%+ correct.
> - Multi-word: Handles and corrects multi-word queries such as
"harrypotter"
> -> "harry potter".
> - Fast: In my tests, builds the dictionary more than 30 times faster.
> - Small: Dictionary size is roughly a third of that built by the
existing
> engine.
> - Other bells and whistles...
>
> There is already a standalone test program that people can try out, and
> we're interested in feedback. If you're interested in discussing,
testing,
> or previewing, consider joining the Google group:
> http://groups.google.com/group/spelt/
>
> --Martin
>



Re: Spelt, for better spelling correction

2007-03-22 Thread Martin Haye

Otis,

I hadn't really thought about this, but it would be easy to build a
dictionary from an existing Lucene index. Tha main caveat is that it would
only work with "stored" fields. That's because this spellchecker boosts
accuracy using pair frequencies in addition to term frequencies, and Lucene
doesn't need or track pair frequencies to my knowledge. So any field which
you wanted to spellcheck would need to be indexed with Field.Store.YES.

Of course a side effect is that they'd have to be Analyzed again, with the
resulting time cost. Still, this could make sense for a lot of people.

I'll make sure the contribution includes an index-to-dictionary API, and
thank you very much for the input.

--Martin

On 3/21/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:


Martin,
This sounds like the spellchecker dictionary needs to be built in parallel
with the main Lucene index.  Is it possible to create a dictionary out of an
existing (and no longer modified) Lucene index?

Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Martin Haye <[EMAIL PROTECTED]>
To: Yonik Seeley <[EMAIL PROTECTED]>
Cc: java-user@lucene.apache.org
Sent: Wednesday, March 21, 2007 2:03:50 PM
Subject: Re: Spelt, for better spelling correction

The dictionary is generated from the corpus, with the result that a larger
corpus gives better results.

Words are queued up during an index run, and at the end are munged to
create
an optimized dictionary. It also supports incremental building, though the
overhead would be too much for those applications that are continuously
adding things to an index. Happily, it's not as important to keep the
spelling dictionary absolutely up to date, so it would be fine to queue
words over several index runs, and refresh the dictionary less often.

--Martin

On 3/20/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:
>
> Sounds interesting Martin!
> Is the dictionary static, or is it generated from the corpus or from
> user queries?
>
> -Yonik
>
> On 3/20/07, Martin Haye <[EMAIL PROTECTED]> wrote:
> > As part of XTF, an open source publishing engine that uses Lucene, I
> > developed a new spelling correction engine specifically to provide
"Did
> you
> > mean..." links for misspelled queries. I and a small group are
preparing
> > this for submission as a contrib module to Lucene. And we're inviting
> > interested people to join the discussion about it.
> >
> > The new engine is being called "Spelt" and differs from the one
> currently in
> > Lucene contrib in the following ways:
> >
> > - More accurate: Much better performance on single-word queries (90%
> correct
> > in #1 slot in my tests). On general list including multi-word queries,
> gets
> > 80%+ correct.
> > - Multi-word: Handles and corrects multi-word queries such as
> "harrypotter"
> > -> "harry potter".
> > - Fast: In my tests, builds the dictionary more than 30 times faster.
> > - Small: Dictionary size is roughly a third of that built by the
> existing
> > engine.
> > - Other bells and whistles...
> >
> > There is already a standalone test program that people can try out,
and
> > we're interested in feedback. If you're interested in discussing,
> testing,
> > or previewing, consider joining the Google group:
> > http://groups.google.com/group/spelt/
> >
> > --Martin
> >
>




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




phrases containing escaped quotes

2007-05-15 Thread Martin Kobele
Hi,

I tried to parse the following phrase: "foo \"bar\""
I get the following exception:
org.apache.lucene.queryParser.ParseException: Lexical error at line 1, column 
18.  Encountered:  after : "\") "

Am I mistaken that "foo \"bar\"" is a valid phrase?

Thanks!
Martin


pgp6qEn6ntvUi.pgp
Description: PGP signature


Re: phrases containing escaped quotes

2007-05-15 Thread Martin Kobele
thank you! I was indeed using lucene 2.0 and it works very nicely with 2.1
thanks!

Martin

On Tuesday 15 May 2007 09:59:42 Michael Busch wrote:
> Martin Kobele wrote:
> > Hi,
> >
> > I tried to parse the following phrase: "foo \"bar\""
> > I get the following exception:
> > org.apache.lucene.queryParser.ParseException: Lexical error at line 1,
> > column 18.  Encountered:  after : "\") "
> >
> > Am I mistaken that "foo \"bar\"" is a valid phrase?
> >
> > Thanks!
> > Martin
>
> Hi Martin,
>
> which Lucene version are you using? This was a bug in Lucene 2.0 and
> earlier versions, but should be fixed in 2.1.
>
>
> Regards,
> Michael
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]



pgpuuvjKxbqTk.pgp
Description: PGP signature


Obtain Lock file timeout during deleteDocument()

2007-05-30 Thread Martin Kobele
Hi,

I am a little confused, probably, because I missed some detail when looking 
through the code of lucene 2.1.

Scenario: Deleting documents works for a while, eventually, I get the 
exception that obtaining the lock files has timed out.


I was trying to find an answer to this.
I call IndexReader.deleteDocument() for the _first_ time.
If my index has several segments, my IndexReader is actually a MultiReader.
Therefore the variable directoryOwner is set to true and as the first step, a 
lock file is created. After that, the document is marked as deleted.

If I call deleteDocument again, it may or may not work.
Now by just reading the code, and I am sure I am missing some details, I am 
wondering, how can I successfully call deleteDocument again? The code will 
try to obtain the lockfile again, but since it is already there, it will time 
out. That is the point where I am confused. After I deleted a document, the 
write.lock file is still there, and directoryOwner is still true.


Maybe knowing more about this will help me to find out why I get the 
exception "Lock obtain timed out" after a while and after several successful 
document deletions.

Thank you!

Regards,
Martin


pgp43J6vYhrDA.pgp
Description: PGP signature


Re: Obtain Lock file timeout during deleteDocument()

2007-05-30 Thread Martin Kobele

On Wednesday 30 May 2007 11:49:41 Michael McCandless wrote:
> "Martin Kobele" <[EMAIL PROTECTED]> wrote:
> > I was trying to find an answer to this.
> > I call IndexReader.deleteDocument() for the _first_ time.
> > If my index has several segments, my IndexReader is actually a
> > MultiReader. Therefore the variable directoryOwner is set to true and as
> > the first step, a lock file is created. After that, the document is
> > marked as deleted.
> >
> > If I call deleteDocument again, it may or may not work.
> > Now by just reading the code, and I am sure I am missing some details, I
> > am wondering, how can I successfully call deleteDocument again? The code
> > will try to obtain the lockfile again, but since it is already there, it
> > will time out. That is the point where I am confused. After I deleted a
> > document, the write.lock file is still there, and directoryOwner is still
> > true.
>
> The lock should be acquired only on the first call to deleteDocument.
> That method calls acquireWriteLock which only acquires the write lock
> if it hasn't already (ie, writeLock == null).  So the 2nd call to
> deleteDocument would see that the lock was already held and should not
> then try to acquire it again.

oh yeah, darn, I knew I missed that little detail! ;)

>
> You are only using a single instance of IndexReader, right?  If for
> example you try to make a new instance of IndexReader and then call
> deleteDocument on that new one, then you would hit the exception
> unless you had closed the first one.

Yeah, I use only one single instance of the IndexReader of this particular 
index. But I have like 20 IndexReaders of 20 different indexes open.


Thanks,
Martin



pgpNzOrYxa0sM.pgp
Description: PGP signature


Re: Obtain Lock file timeout during deleteDocument()

2007-05-30 Thread Martin Kobele
On Wednesday 30 May 2007 11:53:09 Martin Kobele wrote:
> On Wednesday 30 May 2007 11:49:41 Michael McCandless wrote:
> > You are only using a single instance of IndexReader, right?  If for
> > example you try to make a new instance of IndexReader and then call
> > deleteDocument on that new one, then you would hit the exception
> > unless you had closed the first one.
>
> Yeah, I use only one single instance of the IndexReader of this particular
> index. But I have like 20 IndexReaders of 20 different indexes open.
>
oh dear, it was my own fault.
I was hoping I could monitor the last modified time of the directory to 
determine whether somebody has replaced the index. But lucene itself modifies 
the time stamp of the directory when you delete documents.
I am wiser now ;)

Thanks!

Martin


-- 
Martin Kobele
Software Developer
t. 519-826-5222 ext #224
f. 519-826-5228
[EMAIL PROTECTED]
Netsweeper Corporate Head Office
104 Dawson Road
Guelph, Ontario
N1H 1A7


pgpRCj0tRR9rn.pgp
Description: PGP signature


  1   2   >