RE: umlauts / diacritic expansion

2019-04-16 Thread Markus Jelsma
Hello Michael, For the case of normalizing ü to ue, take a look at the german normalizer [1]. Regards, Markus [1] https://lucene.apache.org/core/7_6_0/analyzers-common/org/apache/lucene/analysis/de/GermanNormalizationFilter.html -Original message- > From:Ralf Heyde > Sent: Tuesday

RE: 8.0.0 ClassCastException in ValueSource

2019-03-27 Thread Markus Jelsma
t: Re: 8.0.0 ClassCastException in ValueSource > > Hi Markus, > > Thanks for reporting this. It looks like a side-effect of the Scorable > refactoring, can you open a JIRA issue? > > On Wed, Mar 20, 2019 at 5:01 PM Markus Jelsma > wrote: > > > > Hello, > > &g

8.0.0 ClassCastException in ValueSource

2019-03-20 Thread Markus Jelsma
Hello, Upgraded to Lucene and Solr 8.0 and ran all our unit tests, this one popped up: Caused by: java.lang.ClassCastException: org.apache.lucene.queries.function.ValueSource$ScoreAndDoc cannot be cast to org.apache.lucene.search.Scorer at org.apache.lucene.queries.function.ValueSource

RE: Query-of-Death Lucene/Solr 7.6

2019-02-08 Thread Markus Jelsma
Hello, I think i tracked it further down to LUCENE-8589 or SOLR-12243:. When i leave Solr's edismax' pf parameter empty, everything runs fast. When all fields are configured for pf, the node dies. I am now unsure whether i am on the right list, or if i should move to Solr's. Please let me know

Query-of-Death Lucene/Solr 7.6

2019-02-08 Thread Markus Jelsma
Hello, While working on SOLR-12743, using 7.6 on two nodes and 7.2.1 on the remaining four, we stumbled upon a situation where the 7.6 nodes quickly succumb when a 'Query-of-Death' is issued, 7.2.1 up to 7.5 are all unaffected (tested and confirmed). Following Smiley's suggestion i used Eclips

RE: An example for creating SynonymMap Object?

2018-10-15 Thread Markus Jelsma
regards > > > > On 10/15/18 3:28 PM, Markus Jelsma wrote: > > Hello Baris, > > > > Check out the filter factory and the map parser for a more low level > > example: > >

RE: An example for creating SynonymMap Object?

2018-10-15 Thread Markus Jelsma
Hello Baris, Check out the filter factory and the map parser for a more low level example: https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/SynonymGraphFilterFactory.java https://github.com/apache/lucene-solr/blob/master/lucene/a

RE: Lucene same search result for worlds with and without spaces

2018-06-20 Thread Markus Jelsma
Hi Egorlex, Set the tokenSeparator to "" and ShingleFilter will concatenate all shingles without whitespace. Keep in mind, this will greatly increase the size of the index so it might not be a good idea to concatenate all pairs of words. If you are looking for finding "similarissues" with "sim

RE: Rewrite SynonymQuery to support payloads

2018-05-24 Thread Markus Jelsma
ull Request is attached but it has not been reviewed yet. > Give it a look, and then we can continue the discussion here! > let me know if you feel your requirement is different ! > > Cheers > > On Wed, May 23, 2018 at 11:41 AM, Markus Jelsma > wrote: > > > Hel

Rewrite SynonymQuery to support payloads

2018-05-23 Thread Markus Jelsma
Hello, To support payloads we rewrite SynonymQuery to a pair of SpanTerm queries which we then can wrap in the PayloadScoreQuery. This is not the right way to do this because if both clauses match, both are also scored.  We could try to rewrite SynonymQuery to a SpanOrQuery but i suppose that w

Multiple languages, boosting and, stemming and KeywordRepeat

2018-05-14 Thread Markus Jelsma
Hello, First, apologies for the weird subject line, and apologies for cross-posting, but last week it got no replies on the Solr user mailing list. We index many languages and search over all those languages at once, but boost the language of the user's preference. To differentiate between ste

RE: German decompounding/tokenization with Lucene?

2017-09-16 Thread Markus Jelsma
quest. :) > > Uwe > > Am 16. September 2017 12:42:30 MESZ schrieb Markus Jelsma > : > >Hello Uwe, > > > >Thanks for getting rid of the compounds. The dictionary can be smaller, > >it still has about 1500 duplicates. It is also unsorted. > > > >R

RE: German decompounding/tokenization with Lucene?

2017-09-16 Thread Markus Jelsma
Hello Uwe, Thanks for getting rid of the compounds. The dictionary can be smaller, it still has about 1500 duplicates. It is also unsorted. Regards, Markus -Original message- > From:Uwe Schindler > Sent: Saturday 16th September 2017 12:16 > To: java-user@lucene.apache.org > Subject: R

RE: Using POS payloads for chunking

2017-06-14 Thread Markus Jelsma
sis/tokenattributes/TypeAttribute.html > [2] : > https://lucene.apache.org/core/6_5_0/analyzers-common/org/apache/lucene/analysis/core/TypeTokenFilter.html > > Il giorno mer 14 giu 2017 alle ore 23:33 Markus Jelsma < > markus.jel...@openindex.io> ha scritto: > > > Hello E

RE: Using POS payloads for chunking

2017-06-14 Thread Markus Jelsma
rs and the like to make sense of your > payload > > Best, > Erick (Erickson, not Hatcher) > > On Wed, Jun 14, 2017 at 2:22 PM, Markus Jelsma > wrote: > > Hello Erik, > > > > Using Solr, or actually more parts are Lucene, we have a CharFilter adding > >

RE: Using POS payloads for chunking

2017-06-14 Thread Markus Jelsma
23:03 > To: java-user@lucene.apache.org > Subject: Re: Using POS payloads for chunking > > Markus - how are you encoding payloads as bitsets and use them for scoring? > Curious to see how folks are leveraging them. > > Erik > > > On Jun 14, 2017, at 4:45 PM, Mar

RE: Using POS payloads for chunking

2017-06-14 Thread Markus Jelsma
Hello, We use POS-tagging too, and encode them as payload bitsets for scoring, which is, as far as is know, the only possibility with payloads. So, instead of encoding them as payloads, why not index your treebanks POS-tags as tokens on the same position, like synonyms. If you do that, you can

RE: Term no longer matches if PositionLengthAttr is set to two

2017-05-04 Thread Markus Jelsma
Ok, we decided not to implement PositionLengthAttribute for now due to, it either is a bad applied (how could one even misapply that attribute?) or Solr's QueryBuilder has a weird way of dealing with it or.. well. Thanks, Markus -Original message- > From:Markus Jelsma > Sent: Monday

RE: Term no longer matches if PositionLengthAttr is set to two

2017-05-01 Thread Markus Jelsma
Hello again, apologies for cross-posting and having to get back to this unsolved problem. Initially i thought this is a problem i have with, or in Lucene. Maybe not, so is this problem in Solr? Is here anyone who has seen this problem before? Many thanks, Markus -Original message- > Fr

Term no longer matches if PositionLengthAttr is set to two

2017-04-25 Thread Markus Jelsma
Hello, We have a decompounder and recently implemented the PositionLengthAttribute in it and set it to 2 for a two-word compound such as drinkwater (drinking water in dutch). The decompounder runs both at index- and query-time on Solr 6.5.0. The problem is, q=content_nl:drinkwater no longer ret

RE: Lucene

2017-02-08 Thread Markus Jelsma
Hello - you are on the wrong list, this is Lucene java user, not the Solr user mailing list. But this is what you are looking for: https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika https://wiki.apache.org/solr/ExtractingRequestHandler First is offic

RE: question

2017-01-16 Thread Markus Jelsma
Yes, they should be the same unless the field is indexed with shingles, in that case order matters. Markus -Original message- > From:Julius Kravjar > Sent: Monday 16th January 2017 18:20 > To: java-user@lucene.apache.org > Subject: question > > May I have one question? One company - w

Offset bug in WordDelimiterFilter?

2016-12-06 Thread Markus Jelsma
Hello - i noticed something peculiar running Lucene/Solr 6.3.0. The plural vaccinatieprogramma's should have a startOffset of 0 and a endOffset of 21 when passed through WordDelimiterFilter and/or stemmers but it isn't, slightly messing up highlighted terms. wdf = new WordDelimiterFilter(ne

Range query on date field

2016-11-24 Thread Markus Jelsma
Hi - i seem to be having trouble correctly executing a range query on a date field. The following Solr document is indexed via a unit test followed by a commit:       view     test_key     2013-01-09T17:11:40Z   I can retrieve the document simply wrapping term queries in a boolean query like

RE: Upgrade 6.2.x Char* API's

2016-09-21 Thread Markus Jelsma
> Uwe > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > -Original Message- > > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > > Sent: Wednesday, September 21, 2016c 12

Upgrade 6.2.x Char* API's

2016-09-21 Thread Markus Jelsma
Hello - upgrading one of our libraries to 6.2.0 failed due to LUCENE-7318. This is fixed nicely on 6.2.1, many thanks for that! Upgrading to 6.2.1, however, still raises compile errors. I haven't seen any notice of this in CHANGES.txt or its API changes section for both 6.2.x versions. Any tips

RE: LowerCaseFilter gone in 6.2.0

2016-08-31 Thread Markus Jelsma
//www.thetaphi.de > eMail: u...@thetaphi.de > > > -Original Message- > > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > > Sent: Wednesday, August 31, 2016 11:08 AM > > To: java-user@lucene.apache.org > > Subject: LowerCaseFilter gone in 6.2.0 &

LowerCaseFilter gone in 6.2.0

2016-08-31 Thread Markus Jelsma
Hello - i'm upgrading a project that uses Lucene to 6.2.0 and get the compile error that LowerCaseFilter does not exists. And, so it seems, the JavaDoc is gone too. I've checked CHANGES.txt and there is no mention of it, not even in the API changes section. Any ideas? Thanks, Markus https://l

RE: BlendedTermQuery causing negative IDF?

2016-04-19 Thread Markus Jelsma
or indices in which stop > words are eliminated. > Therefore, most of the term-weighting models have problems scoring common > terms. > By the way, DFI model does a decent job when handling common terms. > > Ahmet > > > > On Tuesday, April 19, 2016 4:48 PM, Markus Jelsma &

BlendedTermQuery causing negative IDF?

2016-04-19 Thread Markus Jelsma
Hello, I just made a Solr query parser for BlendedTermQuery on Lucene 6.0 using BM25 similarity and i have a very simple unit test to see if something is working at all. But to my surprise, one of the results has a negative score, caused by a negative IDF because docFreq is higher than docCount

RE: Problem with porter stemming

2016-03-14 Thread Markus Jelsma
Hi - if you don't want specific words passed through a stemmer, you need to supply a CharArraySet with exclusions as the second argument to its constructor. Markus -Original message- > From:Dwaipayan Roy > Sent: Monday 14th March 2016 15:31 > To: java-user@lucene.apache.org > Subject: Pr

RE: Jira issue for possibly transient resource issue, or a Lucene or JVM bug?

2016-01-21 Thread Markus Jelsma
ug? > > LUCENE-6970 > > On Thu, Jan 21, 2016 at 4:07 PM, Markus Jelsma > wrote: > > > Hi - we get the above issue as well some times. I've noticed Lucene-dev > > mails on this issue [1] but i couldn't find a corresponding Jira issue? Any > > po

Jira issue for possibly transient resource issue, or a Lucene or JVM bug?

2016-01-21 Thread Markus Jelsma
Hi - we get the above issue as well some times. I've noticed Lucene-dev mails on this issue [1] but i couldn't find a corresponding Jira issue? Any pointer to that one? Many thanks, Markus [1] http://mail-archives.apache.org/mod_mbox/lucene-dev/201601.mbox/%3CCAPsWd+OWZpRLXCyXsvhjufvouM=haavxu

RE: propagate Query.rewrite call to super.rewrite after 5.4 upgrade

2015-12-17 Thread Markus Jelsma
that by > prepending the following to your rewrite(IndexReader) implementation: > > if (getBoost() != 1f) { return super.rewrite(reader); } > > > Le jeu. 17 déc. 2015 à 13:23, Markus Jelsma a > écrit : > > > Hi, > > > > Apologies for the cross pos

propagate Query.rewrite call to super.rewrite after 5.4 upgrade

2015-12-17 Thread Markus Jelsma
Hi, Apologies for the cross post. We have a class overridding SpanPositionRangeQuery. It is similar to a SpanFirst query but it is capable of adjusting the boost value with regard to distance. With the 5.4 upgrade the unit tests suddenly threw the following exception: Query class org.GrSpanFir

RE: LUCENE-5388 AbstractMethodError

2014-01-30 Thread Markus Jelsma
ll the wrong thing to do. > > Uwe > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > > -Original Message- > > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > > S

RE: LUCENE-5388 AbstractMethodError

2014-01-30 Thread Markus Jelsma
D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > > -Original Message- > > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > > Sent: Thursday, January 30, 2014 10:50 AM > > To: java-user@lucene.apache.org > > Subject: LUCENE-538

LUCENE-5388 AbstractMethodError

2014-01-30 Thread Markus Jelsma
Hi, Apologies for cross posting; i got no response on the Sorl list. We have a developement environment running trunk but have custom analyzers and token filters built on 4.6.1. Now the constructors have changes somewhat and stuff breaks. Here's a consumer trying to get a TokenStream from an An

Coordination factor disabled for BM25 and other new scoring models

2013-08-22 Thread Markus Jelsma
Hi, I know it is recommended to disable the coordination factor when using models other than default TFIDFSimilarity. And out of curiosity i'd like to know the motivation behind it but it is not explained anywhere, not even in LUCENE-2959, the patches, wiki, PDF's or whatever. So, anyone here

Final token filters

2013-08-19 Thread Markus Jelsma
Hi, This is likely discussed before but i couldn't to find it. Why are most token filters final, or are most or all members private and / or final? It is impossible to customize token filters by extending them, instead we need to copy code around. How do you customize for example some bits with

RE: "read past EOF" when merge

2012-11-05 Thread Markus Jelsma
to the new code that > uses Directory for replication. > > - Mark > > On Nov 2, 2012, at 6:53 AM, Markus Jelsma wrote: > > > Hi, > > > > For what it's worth, we have seen similar issues with Lucene/Solr from this > > week's trunk. The issue

RE: "read past EOF" when merge

2012-11-02 Thread Markus Jelsma
No this is not using NFS but EXT3 on SSD. Thanks -Original message- > From:Michael McCandless > Sent: Fri 02-Nov-2012 16:22 > To: java-user@lucene.apache.org > Subject: Re: "read past EOF" when merge > > On Fri, Nov 2, 2012 at 6:53 AM, Markus Jelsma &

RE: "read past EOF" when merge

2012-11-02 Thread Markus Jelsma
Hi, For what it's worth, we have seen similar issues with Lucene/Solr from this week's trunk. The issue manifests itself when it want to replicate. The servers have not been taken offline and did not crash when this happenend. 2012-10-30 16:12:51,061 WARN [solr.handler.ReplicationHandler] - [h

RE: Highlighter IOOBE with modified HyphenationCompoundWordTokenFilter

2012-10-05 Thread Markus Jelsma
- > From:Thomas Matthijs > Sent: Thu 04-Oct-2012 15:55 > To: java-user@lucene.apache.org > Subject: Re: Highlighter IOOBE with modified > HyphenationCompoundWordTokenFilter > > And to include the code > > On Thu, Oct 4, 2012 at 3:52 PM, Markus Jelsma > wrote: > &

RE: Highlighter IOOBE with modified HyphenationCompoundWordTokenFilter

2012-10-04 Thread Markus Jelsma
I forgot to add that this is with today's build of trunk. -Original message- > From:Markus Jelsma > Sent: Thu 04-Oct-2012 15:42 > To: java-user@lucene.apache.org > Subject: Highlighter IOOBE with modified HyphenationCompoundWordTokenFilter > > Hi, > > I've modified the HyphenationComp

Highlighter IOOBE with modified HyphenationCompoundWordTokenFilter

2012-10-04 Thread Markus Jelsma
Hi, I've modified the HyphenationCompoundWordTokenFilter to emit less subtokens because the original filter can emit all kinds of subtokens that have a very different meaning on their own. I've modified it so no overlapping subtokens are emitted and no subtokens are emitted that can be found wi

Re: what's the status of droids project(http://incubator.apache.org/droids/)?

2011-08-23 Thread Markus Jelsma
You should ask on the Droids list but there's some activity in Jira. And did you consider Apache Nutch? On Tuesday 23 August 2011 10:17:50 Li Li wrote: > hi all > I am interested in vertical crawler. But it seems this project is not > very active. It's last update time is 11/16/2009 ---

Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?

2011-01-18 Thread Markus Jelsma
> [X] ASF Mirrors (linked in our release announcements or via the Lucene > website) > > [] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) > > [X] I/we build them from source via an SVN/Git checkout. > > [] Other (someone in your company mirrors them internally or via a > downst