Mark's message was very nice to see about LIA, but I want to reply
and second Eric's comments about the Lucene distributable. I just
downloaded and unpacked the 2.0 .zip to test out a newbie to Lucene,
but Java savvy (our target audience), experience.
I opened docs/index.html and the first thing I noticed was a broken
image link to the ASF logo in the upper left corner.
Eric has some great points, that I'll reply to below...
On Dec 26, 2006, at 4:36 PM, Haszlakiewicz, Eric wrote:
I'm sorry you are not finding what you need. The snowball analyzers
come in a separate jar, in the release zip, under the contrib/
snowball directory. You may also want/need the analyzers in contrib/
analyzers for other languages. The README delivered w/ the release
uh.. maybe I'm being dense, but where exactly would I find this
directory?
In the unpacked lucene-2.0.0.zip file, there is a contrib directory
with lots of goodies hidden in plain sight there. You're certainly
right that there is very little documentation available for this
stuff, even in javadocs. We should leverage the Java Lucene wiki
(which needs to be moved from jakarta-lucene URL structure) a lot
more, ala Solr, to let the community contribute to the documentation
area much more freely on all these pieces.
It seems that all of the mirrors I look at don't have it, nor even
does the
main(?) url (i.e. http://www.apache.org/dist/lucene/java/)
<http://www.apache.org/dyn/closer.cgi/lucene/java/>
which is "free download" from here <http://lucene.apache.org/java>
Have you gone through the demo and the "Getting Started" section:
http://lucene.apache.org/java/docs/gettingstarted.html ?
yeah, I did, but all I found there was some info about a demo
application,
a link to the aforementioned download directory, and some links to the
online sources through svn.
Sadly, our demo application is pretty pathetic. A better index/
search demo could fairly easily be whipped up that actually was real-
world usable without even writing code for basic file system document
searching. The demo is barely usable in this capacity. Again, we
should look to Solr to shed some light (heh) on this path.
A real-world usable demo (you know, like command-line switches to
control the indexers parameters a lot more, and customizable output
from the command-line searcher (what fields, CSV/XML output), with
the web application providing some computer usable response (doesn't
have to be nearly as fancy as Solr here, even *gasp* just XML would
do the trick).
There are a number of articles, presentations and books available,
many of which are listed at http://wiki.apache.org/jakarta-lucene/
Resources
I'll take a closer look. I just figured the best documentation
would be on
the actual lucene site.
Point well taken, Eric. I concur.
was hoping to find an online overview of how things are supposed
to work.
i.e. some thing that explains what the important classes are, how
to use
them, etc.. Also a definition of vocabulary would be nice.
Here's a just a brief selection of questions that I had (have)
What is javacc? Why do I care? What is snowball? What is a
stemmer, and
how/why would I use one. What's a term vector? What happens when you
add a the same field to a document twice? How do I combine two
queries?
(I figured some of these out, don't answer them now)
All very poignant points. We do have javadocs, which is ultimately
where the low-level API stuff should go, and with decent summary
pages we can guide users to the important classes. We have good
stuff at the core Lucene API, but now that we are blending the
contrib pieces in (as well see below) we've lowered our overall
published documentation quality. We need to hold the contrib pieces
to at least core API documentation standards for acceptance into the
codebase.
A definition of vocabulary fits perfectly on the wiki.
javacc should be fairly well hidden in the documentation as it is
Lucene developer related, not end-user related (at least not for an
initial user of Lucene, only after getting familiar with QueryParser
should one really be venturing into javacc land).
Snowball would fit well into a glossary wiki area, as would term
vector - with of course links to the appropriate API documentation.
Adding the same field multiple times ought to be on our FAQ wiki page
(I just looked, it's not, and "searching" comes before "indexing" in
the FAQ sections, awkwardly). Those FAQ URLs, *ugh*, we really need
an overhaul of that area structurally.
I was able to pick out some info from the faq, but a lot of
that seems to assume you already know what you're doing.
Mea culpa.
I ended up
doing a lot of trial and error to get things going.
As we all did. And once most of us have it figured out it seems so
easy and obvious that going back and writing documentation has little
appeal. Your bringing these issues up to the list is an atypical and
welcome step. It brings our weaknesses to the forefront and is
likely to spark positive changes.
For instance, it took me forever just to figure out how to combine
a couple
of queries together. The apparently appropriately named AndQuery
class,
isn't what it seems, and the javadocs don't say anything that would
point me
towards the correct class (which seems to be BooleanQuery)
Now that _is_ confusing. Ouch. This comes from blending in the
contrib javadocs and sure enough AndQuery is exactly what I would
have expected to use myself. That change to the API docs, having the
contrib blended in, is new to 2.0, and our contrib pieces are not as
well javadoc'd as the core. I can see having the contrib stuff
javadoc'd separately, but it is also nice to see it all blended as
well. I'd love to see others thoughts on how to make this a better
Newcene experience.
AndQuery is part of the surround query language, which has likely not
gotten much usage in the field - only niche environments would use
it, I think. Having it named that way, and near the top of the API
list is way misleading.
I know much of what I just said is basically just complaining about
the
level of documentation, and I'd be happy to help, but I'm still
feeling
a bit overwhelmed with the amount of implied knowledge that seems
to be necessary, so picking out specific places is a bit difficult.
I suppose the most useful thing would be a better getting started
guide
that actually explains how things work, rather than just saying
"look at
this app".
We will learn from your experience, thanks to your forwardness as
well as the specific details of where things are lacking. There are
some easy steps we can take to get things improved for our next release:
* Leverage the wiki lots more for a glossary, quick start user
guides, and FAQs
(revamping the wiki structure, renaming the top-level URL would
go a long way to encouraging its use, and learning from Solr's great
lead)
* Tighten up our API docs specifically on the contrib pieces.
* Tidy up and generalize the demo application, ship Luke too (if
possible, licensing-wise).
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]