Hi Erik, I like the fortune cookie :-)
I came to the same solution as you did but with a short java proggy by trying different patterns, so try and error ;-) This brings me to the question, is there now (with 4.0) any filter doing the job for me? I took a look at LengthFilter but it has a different purpose. And TrimFilter has also a different usage. By the way, why does TrimFilter option updateOffset defaults to false, just keep it backwards compatible? Thanks for your help, Bernd Am 13.11.2012 02:16, schrieb Erick Erickson: > Because your regex is wrong? (sorry, couldn't resist). > > Regexes always give me indigestion. But if you look at your results, your > regex isn't working in any case at all. The second group is being removed > from the end of the string. I _think_ what's happening is that the longest > possible string is being matched (which will usually be your second group). > Then from what's left, your first group is being captured. If you look at > what you have above, none of the matches is 31 characters long. I don't > think you need the second group at all. > > This works for me: > <filter class="solr.PatternReplaceFilterFactory" pattern="(.{1,30}).*" > replacement="$1" > replace="all"/> > > This pattern works too: pattern="^(.{1,30}).*" > > But like I said, I'm no expert with regex'es, I usually have to fumble > around quite a bit to get what I want. > > Found in a fortune cookie according to legend: > "A programmer had a problem. He solved it with regular expressions. Now he > has two problems". > > > > > On Mon, Nov 12, 2012 at 9:04 AM, Bernd Fehling < > bernd.fehl...@uni-bielefeld.de> wrote: > >> Yes, it is the second PatternReplaceFilterFactory. >> >> the String "Arslanagic, Aida ; Siqveland, Elisabeth" is reduced to "a", >> whereas the other strings are: >> "Alexander, Kvam ; Bjørn, Nyland ; Bjørn, Reiten ; Øystein, Huse" --> >> "alexanderkvambj" >> "Brennmoen, Ingar ; Hauklien, Øystein ; Hedalen, Trond ; Kvam, Erik" --> >> "brennmoeningarhauk" >> >> Now this explains the sorting (shit in --> shit out). >> >> But why is the first string reduced to "a", wrong regular expression? >> >> Bernd >> >> >> >> Am 12.11.2012 14:51, schrieb Bernd Fehling: >>> The field type is derived from the distributed alphaOnlySort as follows: >>> >>> <fieldType name="alphaOnlySortLim" class="solr.TextField" >> sortMissingLast="true" omitNorms="true"> >>> <analyzer> >>> <tokenizer class="solr.KeywordTokenizerFactory"/> >>> <filter class="solr.LowerCaseFilterFactory" /> >>> <filter class="solr.TrimFilterFactory" /> >>> <filter class="solr.PatternReplaceFilterFactory" >> pattern="([\x00-\x2F\x3A-\x40\x5B-\x60\x7B-\x9F\u2000-\u206F\uFEFF\uFFF9-\uFFFD])" >>> replacement="" >> replace="all"/> >>> <filter class="solr.PatternReplaceFilterFactory" >> pattern="(.{1,30})(.{31,})" >>> replacement="$1" >> replace="all"/> >>> </analyzer> >>> </fieldType> >>> >>> It reduces long lists of author names (100 and more authors) to the >> first 30 chars >>> for sorting and removes some illegal chars to keep sorting with utf8 >> solid. >>> >>> Don't see any problems there. >>> >>> Will check with admin/analysis page. >>> >>> Bernd >>> >>> >>> Am 12.11.2012 14:28, schrieb Erick Erickson: >>>> First, sorting on tokenized fields is undefined/unsupported. You _might_ >>>> get away with it if the author field always reduces to one token, i.e. >> if >>>> you're always indexing only the last name. >>>> >>>> I should say unsupported/undefined when more than one token is the >> result >>>> of analysis. You can do things like use the KeywordTokenizer followed by >>>> tranformations on the _entire_ input field (lowercasing is popular for >>>> instance). >>>> >>>> So somehow the analysis chain you have defined for this field grabs >>>> "Arslanagic" >>>> and translates it into "a". Synonyms? Stemming? Some "interesting" >> sequence? >>>> >>>> The fastest way to look at that would be in Solr's admin/analysis page. >>>> Just put Arslanagic into the index box and you should see which of the >>>> steps does the translation. Although changing it to "a" is really weird, >>>> it's almost certainly something you've defined in the indexing analysis >>>> chain. >>>> >>>> FWIW, >>>> Erick >>>> >>>> >>>> On Mon, Nov 12, 2012 at 8:19 AM, Bernd Fehling < >>>> bernd.fehl...@uni-bielefeld.de> wrote: >>>> >>>>> Hi list, >>>>> a user reported wrong sorting of our search service running on solr. >>>>> While chasing this issue I traced it back through lucene into the >> index. >>>>> I have a text field for sorting >>>>> (stored,indexed,tokenized,omitNorms,sortMissingLast) >>>>> and three docs with author names. >>>>> >>>>> If I trace at org.apache.lucene.document.Document.add(IndexableField) >> while >>>>> indexing I can see all three author names added as field to each >> documents. >>>>> >>>>> After searching with *:* for the three docs and doing a sort the >> sorting >>>>> is wrong >>>>> because one of the author names is reduced to the first char, all other >>>>> chars are lost. >>>>> >>>>> So having the authors names (Alexander, Arslanagic, Brennmoen) indexed, >>>>> the result >>>>> of sorting ascending is (Arslanagic, Alexander, Brennmoen) which is >> wrong. >>>>> But this happens because the author "Arslanagic" is reduced to "a" >> during >>>>> indexing (???) >>>>> and if sorted "a" is before "alexander". >>>>> >>>>> Currently I use 4.0 but have the same issue with 3.6.1. >>>>> >>>>> Without tracing through tons of code: >>>>> - which is the last breakpoint for debugging to see the docs right >> before >>>>> they go into the index >>>>> - which is the first breakpoint for debugging to see the docs coming >> right >>>>> out of the index >>>>> >>>>> Regards >>>>> Bernd >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>> >>>>> >>>> >>> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org