I just encountered the Stanford L(ittle)LanguageModel called Alpaca:
https://crfm.stanford.edu/2023/03/13/alpaca.html
Apparently fine-tuned from Meta's LLaMa model.
In the writeup, the authors refer to the reduction in size involving
training on a 52K subset bootstrapped out of a larger set with 175 "seed
tasks". (if I am reading this correctly). It seems to suggest that
there is something of a "kernel" L(ittle)LM within the L(arge)LM?
Alpaca pipeline
I was triggered to speculate about the ?coincidental? scale of a 52K
training set and the scale of the traditional chinese character set
(estimated 50-60K?) which are reputed to be composed from order 215
semantic elements (and 1200 phonetic), etc. Does this rich logographic
character set represent in some sense a basis set for expression and
even conceptualization of human linguistic practice? Is it a
coincidence that LLaMA's 52k examples might represent a similar
complexity? And the full training set of the GPT source might be on the
order of *all* of the utterances ever written in chinese characters as
the original set was normalized (by use rather than edict like the
efforts to reduce the character set formally starting in the early
1900's culminated during Mao's reign?).
Folks here with much more formal linguistic training than I may see huge
holes in this speculation as well as those who know more about the
Chinese language and Logo/Ideo-graphic languages. I'd be interested
in any reflection on this speculation.
It also has some implications for our "FFing the ineFFable" threads perhaps?
-. --- - / ...- .- .-.. .. -.. / -- --- .-. ... . / -.-. --- -.. .
FRIAM Applied Complexity Group listserv
Fridays 9a-12p Friday St. Johns Cafe / Thursdays 9a-12p Zoom
https://bit.ly/virtualfriam
to (un)subscribe http://redfish.com/mailman/listinfo/friam_redfish.com
FRIAM-COMIC http://friam-comic.blogspot.com/
archives: 5/2017 thru present https://redfish.com/pipermail/friam_redfish.com/
1/2003 thru 6/2021 http://friam.383.s1.nabble.com/