Re: content disappears in the index

Jack Krupansky Mon, 12 Nov 2012 05:47:27 -0800

Maybe... the author names have middle or first initials? Like, maybe the"Arslanagic" dude has an "A" initial in his name, like "A. Arslanagic" or"Arslanagic, A.".

In any case, "string" is the proper type for a sorted field, although itwould be nice if Lucene/Solr was more developer-friendly when this "mistake"is made.


The relevant doc is:

"Sorting can be done on the "score" of the document, or on anymultiValued="false" indexed="true" field provided that field is eithernon-tokenized (ie: has no Analyzer) or uses an Analyzer that only produces asingle Term (ie: uses the KeywordTokenizer)"

...

"The common situation for sorting on a field that you do want to betokenized for searching is to use a <copyField> to clone your field. Sort onone, search on the other."


See:
http://wiki.apache.org/solr/CommonQueryParameters

For example, have an "author" field that is "text" and an "author_s" (or"author_sorted" or "author_string") field that you copy the name to:


   <copyField source="author" dest="author_s" />

Query on "author", but sort on "author_s".

-- Jack Krupansky

-----Original Message-----From: Erick Erickson

Sent: Monday, November 12, 2012 5:28 AM
To: java-user
Subject: Re: content disappears in the index

First, sorting on tokenized fields is undefined/unsupported. You _might_
get away with it if the author field always reduces to one token, i.e. if
you're always indexing only the last name.

I should say unsupported/undefined when more than one token is the result
of analysis. You can do things like use the KeywordTokenizer followed by
tranformations on the _entire_ input field (lowercasing is popular for
instance).

So somehow the analysis chain you have defined for this field grabs
"Arslanagic"
and translates it into "a". Synonyms? Stemming? Some "interesting" sequence?

The fastest way to look at that would be in Solr's admin/analysis page.
Just put Arslanagic into the index box and you should see which of the
steps does the translation. Although changing it to "a" is really weird,
it's almost certainly something you've defined in the indexing analysis
chain.

FWIW,
Erick

On Mon, Nov 12, 2012 at 8:19 AM, Bernd Fehling <
bernd.fehl...@uni-bielefeld.de> wrote:

Hi list,
a user reported wrong sorting of our search service running on solr.
While chasing this issue I traced it back through lucene into the index.
I have a text field for sorting
(stored,indexed,tokenized,omitNorms,sortMissingLast)
and three docs with author names.

If I trace at org.apache.lucene.document.Document.add(IndexableField)whileindexing I can see all three author names added as field to eachdocuments.


After searching with *:* for the three docs and doing a sort the sorting
is wrong
because one of the author names is reduced to the first char, all other
chars are lost.

So having the authors names (Alexander, Arslanagic, Brennmoen) indexed,
the result
of sorting ascending is (Arslanagic, Alexander, Brennmoen) which is wrong.
But this happens because the author "Arslanagic" is reduced to "a" during
indexing (???)
and if sorted "a" is before "alexander".

Currently I use 4.0 but have the same issue with 3.6.1.

Without tracing through tons of code:
- which is the last breakpoint for debugging to see the docs right before
they go into the index
- which is the first breakpoint for debugging to see the docs coming right
out of the index

Regards
Bernd

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: content disappears in the index

Reply via email to