ssary, but the book never loses sight of its
goal of providing a practical introduction. In that way,
it’s like the Manning "in Action" series.
About the author: Manu Konchady has a home page/blog on Amazon:
http://www.amazon.com/gp/blog/A2TWRNMTU6T9TW/ref=cm_blog_dp_artist_blog
- Bob C
Gathering more data like that
from Amazon, C-net, etc. should be easy. That's what everyone's
doing for evaluations.
But these are all at the review level, not at the sentence
level. We've actually had customers annotating at the sen
.pdf
Both LingPipe and Kea are able to find significant
phrases, which is useful for query refinement or
summarizing sets of search results, but not so
useful for individual documents. It can be a huge
help to add part-of-speech information to these
kinds of approaches.
-
arch in this area is
coming out of Kathy McKeown's group at Columbia,
not to mention the horde of students she's graduated
over the last ten years, such as Drago Radev, the
author of the second tutorial and software above.
- Bob Carpenter
Alias-i
---
here's a blog entry comparing our hypothesis
testing approach to a standard mutual-info based
method (discussed by Matthew Hurst, when he was
at Nielsen BuzzMetrics):
http://www.alias-i.com/blog/?p=14
- Bob Carpenter
Alias-i
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
22.59%
2 34.82%
4 58.55%
8 81.17%
16 92.45%
32 97.33%
64 98.99%
128 99.67%
The end of the tutorial has references to other
popular language ID packages online (e.g. TextCat,
which is Gertjan van Noord's Perl package). And it
also has
and maximum n-gram length. You
might want to put them in different fields
if you want weighting between them to be
easy.
- Bob Carpenter
Alias-i
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
oblems, you
might want to check out Weka.
- Bob Carpenter
Alias-i
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
out our tutorial at:
http://www.alias-i.com/lingpipe/demos/tutorial/langid/read-me.html
Accuracy depends on the pair of languages (some are
more confusible than others), as well as length of
input (it's very hard with only one or two words,
especially if it's a a nam
recall is much easier than fine-grained
linguistic morphology.
Often the best solution is a combination of
best-guess based on linguistic rules/statistical
models/heuristics combined with weaker substring
measures.
For beter solutions that would cover fuzzy errors, contact Bob Carpenter from
Al
because the doc vectors
remain stable as new docs are added.
Then, in general:
score(doc,doc) < score(doc,doc')
if IDF(doc) = doc'. That is, the inversely IDF-scaled
query matches a document better than the document itself.
- Bob Carpenter
Alias-i
-
iness is a testament to how hard
this problem is in general.
- Bob Carpenter
Alias-i
i'm looking at a problem and i can't figure out how to "easily" solve it...
basically, i'm trying to figure out if there's a way to use lucene/nutch
with some form of pattern match
nt("t") / collectionSize
collectionCount("t") = count of term "t" in the collection
collectionSize = number of term instances (not types) in the collection
- Bob Carpenter
Alias-i
Andrzej Bialecki wrote:
Nader Akhnoukh wrote:
Yes, Chris is correct, the goal is to det
"t1")*probFG("t2")
to both find things that are new and that are
phrase-like.
I'm going to be writing this all up in a bit longer
form in a case study for the revised Lucene in Action,
with explanations of how to find the significant
terms relative to a query, like Scirus.com doe
in the JVM until I replaced my memory with ECC memory
a couple of years ago, and haven't seg-faulted since.
- Bob Carpenter
Ross Rankin wrote:
We keep getting JVM crashes on 1.4.3. I found in the archive that setting a
JVM parameter solved the problem for a few users. We've tried that a
et al. a lot of problem with false positives (correcting things
that were right) and false negatives (missing corrections). This is
especially obvious once you drop into a specialized domain that's
not computer science (which is over-represented proportionally
on the web), or a language that
16 matches
Mail list logo