Lucene 3.4 : shift bug in possibly invalid use of NumericTokenStream

2011-12-16 Thread Thushara Wijeratna
I got this exception while indexing with Lucene 3.4:

Exception in thread "Thread-0" java.lang.IllegalArgumentException: Illegal
shift value, must be 0..31

at
org.apache.lucene.util.NumericUtils.intToPrefixCoded(NumericUtils.java:157)

at
org.apache.lucene.analysis.NumericTokenStream.incrementToken(NumericTokenStream.java:217)

at
org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:185)

at
org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:278)

at
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:766)

at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2067)

at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2041)

at com.adxpose.affinity.IndexerHelper.index(IndexerHelper.java:797)

at com.adxpose.affinity.IndexerHelper$Clerk.run(IndexerHelper.java:433)

at java.lang.Thread.run(Thread.java:662)


It is not clear to my why the NumericTokenStream is being called here, as
my analyzer do not use that. Any clues much appreciated.


thx,

thushara


Re: Lucene 3.4 : shift bug in possibly invalid use of NumericTokenStream

2011-12-16 Thread Thushara Wijeratna
Yes, there is one.

This is how the field is being created:

new NumericField("timestamp", Field.Store.NO, true);

Thus, the field is not stored, but indexed.

thx,
thushara


On Fri, Dec 16, 2011 at 3:28 PM, Uwe Schindler  wrote:

> Do you have NumericFields? If yes, how are they configured?
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> > -----Original Message-
> > From: Thushara Wijeratna [mailto:thu...@gmail.com]
> > Sent: Saturday, December 17, 2011 12:25 AM
> > To: java-user@lucene.apache.org
> > Subject: Lucene 3.4 : shift bug in possibly invalid use of
> NumericTokenStream
> >
> > I got this exception while indexing with Lucene 3.4:
> >
> > Exception in thread "Thread-0" java.lang.IllegalArgumentException:
> Illegal
> shift
> > value, must be 0..31
> >
> > at
> >
> org.apache.lucene.util.NumericUtils.intToPrefixCoded(NumericUtils.java:157)
> >
> > at
> > org.apache.lucene.analysis.NumericTokenStream.incrementToken(NumericTok
> > enStream.java:217)
> >
> > at
> >
>
> org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerFiel
> d
> > .java:185)
> >
> > at
> > org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFie
> > ldProcessorPerThread.java:278)
> >
> > at
> > org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter
> > .java:766)
> >
> > at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2067)
> >
> > at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2041)
> >
> > at com.adxpose.affinity.IndexerHelper.index(IndexerHelper.java:797)
> >
> > at com.adxpose.affinity.IndexerHelper$Clerk.run(IndexerHelper.java:433)
> >
> > at java.lang.Thread.run(Thread.java:662)
> >
> >
> > It is not clear to my why the NumericTokenStream is being called here, as
> my
> > analyzer do not use that. Any clues much appreciated.
> >
> >
> > thx,
> >
> > thushara
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Lucene 3.4 : shift bug in possibly invalid use of NumericTokenStream

2011-12-18 Thread Thushara Wijeratna
Yes, I use this field to set a timestamp (an int). And I'm not using the
special constructor, so I must be using the default precision step.
Java version : 1.6.0_24

mpire@seafcmr16:~$ java -version
java version "1.6.0_24"
Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)

Also : I have only seen this when multiple threads within the app is
writing to a single Lucene index. But it is rare.

I'm attaching the indexing code.

Could you also point me to the JVM bug you suspect to be the cause?

thx,
thushara

On Fri, Dec 16, 2011 at 4:07 PM, Uwe Schindler  wrote:

> Hi,
>
> Thanks, this *may* cause the exception, but it is impossible that the
> exception stack trace you are posting occurs in Lucene's code with a
> default
> precision step on a numeric field, as you use here. I assume it's a 32bit
> integer (NumericField.setIntValue or setFloatValue)?
>
> Please provide us your full Java version (java -version) and ideally the
> full source code you use during indexing. The only chance you can get this
> Exception is by a JVM bug.
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> > -Original Message-
> > From: Thushara Wijeratna [mailto:thu...@gmail.com]
> > Sent: Saturday, December 17, 2011 1:01 AM
> > To: java-user@lucene.apache.org; u...@thetaphi.de
> > Subject: Re: Lucene 3.4 : shift bug in possibly invalid use of
> > NumericTokenStream
> >
> > Yes, there is one.
> >
> > This is how the field is being created:
> >
> > new NumericField("timestamp", Field.Store.NO, true);
> >
> > Thus, the field is not stored, but indexed.
> >
> > thx,
> > thushara
> >
> >
> > On Fri, Dec 16, 2011 at 3:28 PM, Uwe Schindler  wrote:
> >
> > > Do you have NumericFields? If yes, how are they configured?
> > >
> > > -
> > > Uwe Schindler
> > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > http://www.thetaphi.de
> > > eMail: u...@thetaphi.de
> > >
> > >
> > > > -Original Message-
> > > > From: Thushara Wijeratna [mailto:thu...@gmail.com]
> > > > Sent: Saturday, December 17, 2011 12:25 AM
> > > > To: java-user@lucene.apache.org
> > > > Subject: Lucene 3.4 : shift bug in possibly invalid use of
> > > NumericTokenStream
> > > >
> > > > I got this exception while indexing with Lucene 3.4:
> > > >
> > > > Exception in thread "Thread-0" java.lang.IllegalArgumentException:
> > > Illegal
> > > shift
> > > > value, must be 0..31
> > > >
> > > > at
> > > >
> > >
> org.apache.lucene.util.NumericUtils.intToPrefixCoded(NumericUtils.java:157)
> > > >
> > > > at
> > > >
> > org.apache.lucene.analysis.NumericTokenStream.incrementToken(NumericTok
> > > > enStream.java:217)
> > > >
> > > > at
> > > >
> > >
> > >
> >
>
> org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerFiel
> > > d
> > > > .java:185)
> > > >
> > > > at
> > > >
> > org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFie
> > > > ldProcessorPerThread.java:278)
> > > >
> > > > at
> > > >
> > org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter
> > > > .java:766)
> > > >
> > > > at
> > org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2067)
> > > >
> > > > at
> > org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2041)
> > > >
> > > > at com.adxpose.affinity.IndexerHelper.index(IndexerHelper.java:797)
> > > >
> > > > at
> com.adxpose.affinity.IndexerHelper$Clerk.run(IndexerHelper.java:433)
> > > >
> > > > at java.lang.Thread.run(Thread.java:662)
> > > >
> > > >
> > > > It is not clear to my why the NumericTokenStream is being called
> here,
> as
> > > my
> > > > analyzer do not use that. Any clues much appreciated.
> > > >
> > > >
> > > > thx,
> > > >
> > > > thushara
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Lucene 3.4 : shift bug in possibly invalid use of NumericTokenStream

2011-12-18 Thread Thushara Wijeratna
This is difficult to repro. I'm not using any JVM flags. It does seem
that the following code could never call NumericUtils.intToPrefixCoded
with a shift > 31 (or shift < 0) so I tend to agree this must be a JVM
bug. Looking through all logs I have for December, I only found one
instance of this issue. It seems it has nothing to do with
concurrency, then it must have to do with the value set in the
NumericField, so the bug must be triggered by a particular timestamp.


from: 
http://javasourcecode.org/html/open-source/lucene/lucene-3.3.0/org/apache/lucene/analysis/NumericTokenStream.java.html


  public boolean incrementToken() {
if (valSize == 0)
  throw new IllegalStateException
<http://javasourcecode.org/html/open-source/jdk/jdk-6u23/java/lang/IllegalStateException.java.html>("call
set???Value() before usage");
if (shift >= valSize)
  return false;

clearAttributes();
final char[] buffer;
switch (valSize) {
  case 64:
buffer = termAtt.resizeBuffer(NumericUtils.BUF_SIZE_LONG);
termAtt.setLength(NumericUtils.longToPrefixCoded(value, shift, buffer));
break;

  case 32:
buffer = termAtt.resizeBuffer(NumericUtils.BUF_SIZE_INT);
termAtt.setLength(NumericUtils.intToPrefixCoded((int) value,
shift, buffer));
break;

  default:
// should not happenthrow new IllegalArgumentException
<http://javasourcecode.org/html/open-source/jdk/jdk-6u23/java/lang/IllegalArgumentException.java.html>("valSize
must be 32 or 64");
}

typeAtt.setType((shift == 0) ? TOKEN_TYPE_FULL_PREC :
TOKEN_TYPE_LOWER_PREC);
posIncrAtt.setPositionIncrement((shift == 0) ? 1 : 0);
shift += precisionStep;
return true;
  }


On Sun, Dec 18, 2011 at 2:50 PM, Uwe Schindler  wrote:

> Hi,
>
> ** **
>
> Can you try 1.6.0_29 or disable hotspot by using “-Xint” JVM startup flag
> (just to test, I know, it’s slow then)? Are you **not** using
> “-XX:+AggressiveOpts” as JVM parameter?
>
> The JVM bug which may lead to this is a sign-flip bug:
> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5091921 (see also
> http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2011-March/004942.html
> )
>
> ** **
>
> Otherwise, is all fine, if you remove the numeric field? The code you are
> using can never cause such behavior, this is extensively tested.
>
> ** **
>
> Uwe
>
> ** **
>
> -
>
> Uwe Schindler
>
> H.-H.-Meier-Allee 63, D-28213 Bremen
>
> http://www.thetaphi.de
>
> eMail: u...@thetaphi.de
>
> ** **
>
> *From:* Thushara Wijeratna [mailto:thu...@gmail.com]
> *Sent:* Sunday, December 18, 2011 11:17 PM
>
> *To:* java-user@lucene.apache.org; u...@thetaphi.de
> *Subject:* Re: Lucene 3.4 : shift bug in possibly invalid use of
> NumericTokenStream
>
> ** **
>
> Yes, I use this field to set a timestamp (an int). And I'm not using the
> special constructor, so I must be using the default precision step.
>
> Java version : 1.6.0_24
>
> ** **
>
> mpire@seafcmr16:~$ java -version
>
> java version "1.6.0_24"
>
> Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
>
> Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)
>
> ** **
>
> Also : I have only seen this when multiple threads within the app is
> writing to a single Lucene index. But it is rare.
>
> ** **
>
> I'm attaching the indexing code.
>
> ** **
>
> Could you also point me to the JVM bug you suspect to be the cause?
>
> ** **
>
> thx,
>
> thushara
>
> ** **
>
> On Fri, Dec 16, 2011 at 4:07 PM, Uwe Schindler  wrote:***
> *
>
> Hi,
>
> Thanks, this *may* cause the exception, but it is impossible that the
> exception stack trace you are posting occurs in Lucene's code with a
> default
> precision step on a numeric field, as you use here. I assume it's a 32bit
> integer (NumericField.setIntValue or setFloatValue)?
>
> Please provide us your full Java version (java -version) and ideally the
> full source code you use during indexing. The only chance you can get this
> Exception is by a JVM bug.
>
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> > -Original Message-
> > From: Thushara Wijeratna [mailto:thu...@gmail.com]
>
> > Sent: Saturday, December 17, 2011 1:01 AM
> > To: java-user@lucene.apache.org; u...@thetaphi.de
> > Subject: Re: Lucene 3.4 : shift bug in possibly invalid use of
> > NumericTokenStream
> >
> > Yes, there is one.
> >
> > This

Re: Lucene 3.4 : shift bug in possibly invalid use of NumericTokenStream

2011-12-19 Thread Thushara Wijeratna
Actually, the a single timestamp field is being used by several threads.
Sorry, I missed that, and thanks Peter, Uwe both for the explanations. [In
my code snippet, I was trying to simplify so missed this. I'm constructing
one timestamp field and passing it to all threads in the ctor.]

On Mon, Dec 19, 2011 at 5:07 AM, Uwe Schindler  wrote:

> Hi,
>
> NumericUtils is an internal implementation class, you should not use it.
> What do you want to do? There is no need to call any of its methods during
> indexing or searching. Everything else is advanced. I the latter case you
> should RTFM of BytesRef and realted classes (possibly watch the flexible
> indexing talk done by me in Berlin, Barcelona or San Francisco). Lucene
> moved to binary terms in 4.0 and no longer uses character based terms, so
> the code is different. BytesRef is just a wrapper around a byte[].
>
> Uwe
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> > -Original Message-
> > From: Peter Karich [mailto:peat...@yahoo.de]
> > Sent: Monday, December 19, 2011 1:40 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: Lucene 3.4 : shift bug in possibly invalid use of
> > NumericTokenStream
> >
> > BTW: how can I use NumericUtils.longToPrefixCoded in 4.0 ?
> >
> > Peter.
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


lucene gosen diff btn jars

2012-03-02 Thread Thushara Wijeratna
I'm testing lucene-gosen for Japanese tokenization and wondering what the
differences are between the two jars provided. (ipadic / chaisen)?
In my preliminary testing, I'm not seeing any difference in tokenization in
these two jars.  (the jar with no dictionary did not work, I assume I need
to make available a custom dictionary - header.sen which I did not try)

I tried to tokenize this phrase:

ゴルフが大好きなあなた。
アメリカにあるベスト・ゴルフコース情報が満載のイエローページ・ジャパンでは、オンラインまたはガイド・ブックからもあらゆる情報が簡単に入手できます。
詳しい情報は


which google translates as


You love golf. Best golf course information in the United States is in the
Yellow Pages Japan is full of, any information can be obtained easily from
online or book guide. For more information


I'm getting identical tokenization from both jars, namely :


ゴルフ / Golf

 大好き / I love

 あなた / You

 アメリカ / America

 ベスト / best

 ゴルフコース / Golf course

 情報 / information

 満載 / save

 イエロ / Hierro

 ページ / page

 ジャパン / Japan

 オンライン / online

 ガイド / guide

 ブック / book

 あらゆる / all

 情報 / information

 簡単 / simple

 入手 / obtaining

 できる / able to

 詳しい  /detailed

 情報 / information


Note: translations based on Google Translate


Any pointers you can provide as to the difference of the two methods of
tokenizing would be highly appreciated.


thx,

thushara