Re: help finding docs, creating analyzer objects

Erik Hatcher Tue, 26 Dec 2006 20:58:20 -0800

Mark's message was very nice to see about LIA, but I want to replyand second Eric's comments about the Lucene distributable. I justdownloaded and unpacked the 2.0 .zip to test out a newbie to Lucene,but Java savvy (our target audience), experience.

I opened docs/index.html and the first thing I noticed was a brokenimage link to the ASF logo in the upper left corner.


Eric has some great points, that I'll reply to below...

On Dec 26, 2006, at 4:36 PM, Haszlakiewicz, Eric wrote:

I'm sorry you are not finding what you need.   The snowball analyzers
come in a separate jar, in the release zip, under the contrib/
snowball directory.  You may also want/need the analyzers in contrib/
analyzers for other languages.  The README delivered w/ the release

uh.. maybe I'm being dense, but where exactly would I find thisdirectory?

In the unpacked lucene-2.0.0.zip file, there is a contrib directorywith lots of goodies hidden in plain sight there. You're certainlyright that there is very little documentation available for thisstuff, even in javadocs. We should leverage the Java Lucene wiki(which needs to be moved from jakarta-lucene URL structure) a lotmore, ala Solr, to let the community contribute to the documentationarea much more freely on all these pieces.

It seems that all of the mirrors I look at don't have it, nor evendoes the
main(?) url  (i.e. http://www.apache.org/dist/lucene/java/)


        <http://www.apache.org/dyn/closer.cgi/lucene/java/>

which is "free download" from here <http://lucene.apache.org/java>

Have you gone through the demo and the "Getting Started" section:
http://lucene.apache.org/java/docs/gettingstarted.html ?

yeah, I did, but all I found there was some info about a demoapplication,

a link to the aforementioned download directory, and some links to the
online sources through svn.

Sadly, our demo application is pretty pathetic. A better index/search demo could fairly easily be whipped up that actually was real-world usable without even writing code for basic file system documentsearching. The demo is barely usable in this capacity. Again, weshould look to Solr to shed some light (heh) on this path.

A real-world usable demo (you know, like command-line switches tocontrol the indexers parameters a lot more, and customizable outputfrom the command-line searcher (what fields, CSV/XML output), withthe web application providing some computer usable response (doesn'thave to be nearly as fancy as Solr here, even *gasp* just XML woulddo the trick).

There are a number of articles, presentations and books available,
many of which are listed at http://wiki.apache.org/jakarta-lucene/
Resources

I'll take a closer look. I just figured the best documentationwould be on

the actual lucene site.


Point well taken, Eric.  I concur.

was hoping to find an online overview of how things are supposedto work.i.e. some thing that explains what the important classes are, howto use
them, etc..  Also a definition of vocabulary would be nice.
Here's a just a brief selection of questions that I had (have)
What is javacc? Why do I care? What is snowball? What is astemmer, and
how/why would I use one.  What's a term vector?  What happens when you
add a the same field to a document twice? How do I combine twoqueries?
(I figured some of these out, don't answer them now)

All very poignant points. We do have javadocs, which is ultimatelywhere the low-level API stuff should go, and with decent summarypages we can guide users to the important classes. We have goodstuff at the core Lucene API, but now that we are blending thecontrib pieces in (as well see below) we've lowered our overallpublished documentation quality. We need to hold the contrib piecesto at least core API documentation standards for acceptance into thecodebase.


A definition of vocabulary fits perfectly on the wiki.

javacc should be fairly well hidden in the documentation as it isLucene developer related, not end-user related (at least not for aninitial user of Lucene, only after getting familiar with QueryParsershould one really be venturing into javacc land).

Snowball would fit well into a glossary wiki area, as would termvector - with of course links to the appropriate API documentation.

Adding the same field multiple times ought to be on our FAQ wiki page(I just looked, it's not, and "searching" comes before "indexing" inthe FAQ sections, awkwardly). Those FAQ URLs, *ugh*, we really needan overhaul of that area structurally.

 I was able to pick out some info from the faq, but a lot of
that seems to assume you already know what you're doing.


Mea culpa.

  I ended up
doing a lot of trial and error to get things going.

As we all did. And once most of us have it figured out it seems soeasy and obvious that going back and writing documentation has littleappeal. Your bringing these issues up to the list is an atypical andwelcome step. It brings our weaknesses to the forefront and islikely to spark positive changes.

For instance, it took me forever just to figure out how to combinea coupleof queries together. The apparently appropriately named AndQueryclass,isn't what it seems, and the javadocs don't say anything that wouldpoint me
towards the correct class (which seems to be BooleanQuery)

Now that _is_ confusing. Ouch. This comes from blending in thecontrib javadocs and sure enough AndQuery is exactly what I wouldhave expected to use myself. That change to the API docs, having thecontrib blended in, is new to 2.0, and our contrib pieces are not aswell javadoc'd as the core. I can see having the contrib stuffjavadoc'd separately, but it is also nice to see it all blended aswell. I'd love to see others thoughts on how to make this a betterNewcene experience.

AndQuery is part of the surround query language, which has likely notgotten much usage in the field - only niche environments would useit, I think. Having it named that way, and near the top of the APIlist is way misleading.

I know much of what I just said is basically just complaining aboutthelevel of documentation, and I'd be happy to help, but I'm stillfeeling
a bit overwhelmed with the amount of implied knowledge that seems
to be necessary, so picking out specific places is a bit difficult.

I suppose the most useful thing would be a better getting startedguidethat actually explains how things work, rather than just saying"look at
this app".

We will learn from your experience, thanks to your forwardness aswell as the specific details of where things are lacking. There aresome easy steps we can take to get things improved for our next release:

* Leverage the wiki lots more for a glossary, quick start userguides, and FAQs(revamping the wiki structure, renaming the top-level URL wouldgo a long way to encouraging its use, and learning from Solr's greatlead)


   * Tighten up our API docs specifically on the contrib pieces.

* Tidy up and generalize the demo application, ship Luke too (ifpossible, licensing-wise).


        Erik



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: help finding docs, creating analyzer objects

Reply via email to