Hi Matt, On Fri, Feb 8, 2019 at 4:31 PM Matt Mahoney <[email protected]> wrote:
> > > On Tue, Feb 5, 2019, 5:23 PM Linas Vepstas <[email protected] wrote: > >> >> if there were an experimental results section that told us >>> which ones were worth pursuing. >> >> >> There's this: >> >> >> https://github.com/opencog/opencog/raw/master/opencog/nlp/learn/learn-lang-diary/connector-sets-revised.pdf >> >> >> https://github.com/opencog/opencog/raw/master/opencog/nlp/learn/learn-lang-diary/learn-lang-diary.pdf >> > > Yes, that is what I was looking for. I haven't read all of it but so far I > learned: > > 1. It is possible to learn parts of speech and a grammar from unlabeled > text. > Well, I think that is the major claim, as it is effectively something that's never been done before (and many people have doubts that it's even possible, without NN's). It's been the cause of much teeth-gnashing. > > 2. It is possible to learn word boundaries in continuous speech by finding > boundaries with low mutual information. (I did similar experiments to > restore deleted spaces in text using only n-gram statistics. It is how > infants learn to segment speech at 7-10 months old, before learning any > words.) > This claim is .. well, I think it's well-explored by academia, with various "well-known" workable solutions, dating back a decade or two. I've not had a chance to explore it and try to wedge it into a grand-unified theory. > > 3. Word pairs have a Zipf distribution just like single words. > Uhh, yes, but no. Depends on what you are graphing. Here's an unpublished 2009 draft: https://github.com/opencog/opencog/raw/master/opencog/nlp/learn/learn-lang-diary/word-pairs-2009/word-pairs.pdf When I recently repeated a variant of this, different techniques on a different dataset, I got a very clean (logarithmic) Bell curve (Gaussian). Who knew? When I brought this up on a linguistics mailing (back in 2009) a general consensus emerged that "if you combine two Zipfs together, you get a logarithmic Gaussian", and that "this is probably easy to prove", but neither I nor anyone else took the effort to actually prove it. Its probably still a sufficiently novel result that a proof, together with data, is publishable. So much to do, so little time. > (I suspect it is also true of word triples representing grammar rules. It > suggests there are around 10^8 rules). > Well, that one also gets a "yes, but no." Here: * De facto, link-grammar manages to be a pretty decent model of English, with approx 2.2K rules. (or 11K = 2.2K + 8.8K rules, depending on what is counted as a "rule") (`cat 4.0.dict | grep ";" |wc` vs. `cat 4.0.dict | grep -o " or" |wc`) * If you brute-force accumulate statistics for disjuncts, you get approx 10^8 of them. (I have such datasets and can share them, if you have the RAM) * But those datasets contain rules connecting words, and not word-classes. The grammar needed to parse "I see a dog, I see a rock, I see a tree" does not require 3 distinct rules for dog,rock,tree; it just needs one: "I see a <common-noun>". If you are picky, you can have several different kinds of common nouns, but however picky you are, you need fewer classes than there are words. A typical class will contain many words. The rules connect word-classes, not individual words. * Automatically discovering word-classes is not terribly difficult, but is sufficiently confusing that it has tripped up others who have tried. That's why I spent a lot of time talking about "matrix factorization" in one of the papers. It turns out that "matrix factorization" is a form of "clustering", in disguise. * There are several styles of matrix factorization, one of which is more-or-less the same thing as "deep learning" (!) Which is perhaps the most important thing I have to say on this topic. * I cannot quote off the top of my head how many non-zero weights are in the current "state-of-the-art" natural-language deep-learning networks. I think under a million(?); depends on what exactly it is that you are trying to do with the NN model. * So, for the syntax of English, together with a shallow amount of semantics (e.g. at the level of framenet or of wordnet), I think this can be done with 10K to 1M rules (with LG providing a hard-lower-limit for a "realistic" language model, and "state-of-the-art" natural-language deep-learning providing a soft upper-limit.) Clearly, syntax plus shallow semantics is not at all AGI. But the point is that *if* we can create an automated system that can reproducibly, easily and regularly obtain syntax plus typical ontology-type structure (again, wordnet/framenet/SUMO/whatever levels-of-sophistication) ***AND*** the resulting structure is NOT a black box of floating-point values, but is queriable in a natural way ("a bird is an animal" "an animal is a living thing", etc. can be posed as questions with affirmative/negative answers, or even probabilities) ***THEN*** you have a platform on which to deploy research about reasoning, inference, common-sense knowledge, etc. To repeat: * I believe that purely-automated extraction of syntax plus shallow semantics, for arbitrary human written language, is well within reach, and requires about 10K to 1M rules (depending on how deep you drill/simplify/abstract) * That the above ruleset can have a natural knowledge-query interface to it, making it accessible to inferencing and other high-level algos * That all of the above has already been demonstrated in various distinct prototypes and proofs-of-concept, (in various typical NAACL-HLT journal articles) (of which I've personally reproduced some selected handful) * So .. Lets do it. Integrate & test. Its maybe rocket-science; but it's not science fiction. > I hope this work continues. It would be interesting if it advances the > state of the art on my large text benchmark or the Hutter prize. > Thanks! I think we're at the state of the art now, although I haven't been able to convince Ben. Who knew that he was a natural-born skeptic? I'm totally ignoring benchmarks and competitions; I simply do not have the hours in the day. (Actually, zero hours of the day, right now; as of about 8-9 months ago, the language-learning project has been handed over to a team. They are still getting up to speed; they don't have any background in linguistics or machine learning, and so have been trying to learn both at the same time, and getting tangled up and stalled, as a result. Currently they are stumbling at the "clustering" step; have not yet begun the "ontology" step. One step forward, two steps back. I'd like to get back on it, as its clear-as-a-bell to me, but .. time constraints.) -- Linas > > -- Matt Mahoney, [email protected] > *Artificial General Intelligence List <https://agi.topicbox.com/latest>* > / AGI / see discussions <https://agi.topicbox.com/groups/agi> + > participants <https://agi.topicbox.com/groups/agi/members> + delivery > options <https://agi.topicbox.com/groups/agi/subscription> Permalink > <https://agi.topicbox.com/groups/agi/Ta6fce6a7b640886a-M2868c79e3d7fbcd113691a5b> > -- cassette tapes - analog TV - film cameras - you ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/Ta6fce6a7b640886a-Mc48a258d09b45721f33f4376 Delivery options: https://agi.topicbox.com/groups/agi/subscription
