Hi Matt,

On Fri, Feb 8, 2019 at 4:31 PM Matt Mahoney <[email protected]> wrote:

>
>
> On Tue, Feb 5, 2019, 5:23 PM Linas Vepstas <[email protected] wrote:
>
>>
>> if there were an experimental results section that told us
>>> which ones were worth pursuing.
>>
>>
>> There's this:
>>
>>
>> https://github.com/opencog/opencog/raw/master/opencog/nlp/learn/learn-lang-diary/connector-sets-revised.pdf
>>
>>
>> https://github.com/opencog/opencog/raw/master/opencog/nlp/learn/learn-lang-diary/learn-lang-diary.pdf
>>
>
> Yes, that is what I was looking for. I haven't read all of it but so far I
> learned:
>
> 1. It is possible to learn parts of speech and a grammar from unlabeled
> text.
>

Well, I think that is the major claim, as it is effectively something
that's never been done before (and many people have doubts that it's even
possible, without NN's).  It's been the cause of much teeth-gnashing.


>
> 2. It is possible to learn word boundaries in continuous speech by finding
> boundaries with low mutual information. (I did similar experiments to
> restore deleted spaces in text using only n-gram statistics. It is how
> infants learn to segment speech at 7-10 months old, before learning any
> words.)
>

This claim is .. well, I think it's well-explored by academia, with various
"well-known" workable solutions, dating back a decade or two. I've not had
a chance to explore it and try to wedge it into a grand-unified theory.

>
> 3. Word pairs have a Zipf distribution just like single words.
>

Uhh, yes, but no. Depends on what you are graphing. Here's an unpublished
2009 draft:
https://github.com/opencog/opencog/raw/master/opencog/nlp/learn/learn-lang-diary/word-pairs-2009/word-pairs.pdf
When I recently repeated a variant of this, different techniques on a
different dataset, I got a very clean (logarithmic) Bell curve (Gaussian).
  Who knew?  When I brought this up on a linguistics mailing (back in 2009)
a general consensus emerged that "if you combine two Zipfs together, you
get a logarithmic Gaussian", and that "this is probably easy to prove", but
neither I nor anyone else took the effort to actually prove it.  Its
probably still a sufficiently novel result that a proof, together with
data, is publishable. So much to do, so little time.


> (I suspect it is also true of word triples representing grammar rules. It
> suggests there are around 10^8 rules).
>

Well, that one also gets a "yes, but no."  Here:

* De facto, link-grammar manages to be a pretty decent model of English,
with approx 2.2K rules. (or 11K = 2.2K + 8.8K rules, depending on what is
counted as a "rule")  (`cat 4.0.dict | grep ";" |wc` vs. `cat 4.0.dict |
grep -o " or" |wc`)

* If you brute-force accumulate statistics for disjuncts, you get approx
10^8 of them. (I have such datasets and can share them, if you have the
RAM)

* But those datasets contain rules connecting words, and not word-classes.
The grammar needed to parse "I see a dog, I see a rock, I see a tree" does
not require 3 distinct rules for dog,rock,tree; it just needs one: "I see a
<common-noun>". If you are picky, you can have several different kinds of
common nouns, but however picky you are, you need fewer classes than there
are words. A typical class will contain many words. The rules connect
word-classes, not individual words.

* Automatically discovering word-classes is not terribly difficult, but is
sufficiently confusing that it has tripped up others who have tried. That's
why I spent a lot of time talking about "matrix factorization" in one of
the papers. It turns out that "matrix factorization" is a form of
"clustering", in disguise.

* There are several styles of matrix factorization, one of which is
more-or-less the same thing as  "deep learning" (!) Which is perhaps the
most important thing I have to say on this topic.

* I cannot quote off the top of my head how many non-zero weights are in
the current "state-of-the-art" natural-language  deep-learning networks. I
think under a million(?); depends on what exactly it is that you are trying
to do with the NN model.

* So, for the syntax of English, together with a shallow amount of
semantics (e.g. at the level of framenet or of wordnet), I think this can
be done with 10K to 1M rules (with LG providing a hard-lower-limit for a
"realistic" language model, and "state-of-the-art" natural-language
deep-learning providing a soft upper-limit.)

Clearly, syntax plus shallow semantics is not at all AGI. But the point is
that *if* we can create an automated system that can reproducibly, easily
and regularly obtain syntax plus typical ontology-type structure (again,
wordnet/framenet/SUMO/whatever levels-of-sophistication) ***AND*** the
resulting structure is NOT a black box of floating-point values, but is
queriable in a natural way ("a bird is an animal" "an animal is a living
thing", etc. can be posed as questions with affirmative/negative answers,
or even probabilities) ***THEN*** you have a platform on which to deploy
research about reasoning, inference, common-sense knowledge, etc.

To repeat:
* I believe that purely-automated extraction of syntax plus shallow
semantics, for arbitrary human written language, is well within reach, and
requires about 10K to 1M rules (depending on how deep you
drill/simplify/abstract)
* That the above ruleset can have a natural knowledge-query interface to
it, making it accessible to inferencing and other high-level algos
* That all of the above has already been demonstrated in various distinct
prototypes and proofs-of-concept, (in various typical NAACL-HLT journal
articles) (of which I've personally reproduced some selected handful)
* So .. Lets do it. Integrate & test.  Its maybe rocket-science; but it's
not science fiction.


> I hope this work continues. It would be interesting if it advances the
> state of the art on my large text benchmark or the Hutter prize.
>

Thanks!

I think we're at the state of the art now, although I haven't been able to
convince Ben. Who knew that he was a natural-born skeptic?  I'm totally
ignoring benchmarks and competitions; I simply do not have the hours in the
day.  (Actually, zero hours of the day, right now; as of about 8-9 months
ago, the language-learning project has been handed over to a team. They are
still getting up to speed; they don't have any background in linguistics or
machine learning, and so have been trying to learn both at the same time,
and getting tangled up and stalled, as a result. Currently they are
stumbling at the "clustering" step; have not yet begun the "ontology" step.
One step forward, two steps back.  I'd like to get back on it, as its
clear-as-a-bell to me, but .. time constraints.)

-- Linas


>
> -- Matt Mahoney, [email protected]
> *Artificial General Intelligence List <https://agi.topicbox.com/latest>*
> / AGI / see discussions <https://agi.topicbox.com/groups/agi> +
> participants <https://agi.topicbox.com/groups/agi/members> + delivery
> options <https://agi.topicbox.com/groups/agi/subscription> Permalink
> <https://agi.topicbox.com/groups/agi/Ta6fce6a7b640886a-M2868c79e3d7fbcd113691a5b>
>


-- 
cassette tapes - analog TV - film cameras - you

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/Ta6fce6a7b640886a-Mc48a258d09b45721f33f4376
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to