Let me elaborate on what I mean by "ordering of *rows* rather than columns"
in the present context:

In my exploratory data analysis for autism using the Laboratory of the
States data, there were 51 "rows":  The 50 States plus Washington, D.C.
There is no "time" apparent in these rows unless one posits markov
processes *latent in the data*.  By that I mean (and this is something I
discussed with you some months ago in this very forum) some "States" may be
in a different "state" of time evolution of the underlying process: they
may be "behind the times" or "ahead of the times" so to speak.  This spread
of evolutionary states across States results in a scatter between two
variables.  One may *model* this process by some sort of regression fit
between the States.  Let's say, for instance, there is a time progression
of urbanization of the population.  One might posit all manner of causal
laws involving an individual experiencing high population density
environments vs those experiencing low population density environments.
Indeed, Steve Sailer
<https://www.amazon.com/Noticing-Essential-1973-2023-Steve-Sailer/dp/1959403028>
(who
used my old-now-defunct laboratoryopfthestates.org website quite a bit) got
himself in some serious Crimestop trouble by "noticing:
<https://web.archive.org/web/20050829182413/http://www.vdare.com/sailer/041212_secret.htm>

Overall, Bush carried the top 25 states ranked on years married for white
> women. The correlation coefficient with Bush's share of the vote is 0.91,
> or 83 percent of the variation "explained." That's extremely high. Years
> married also correlates with the 2000 election results at the 0.89 level
> (80 percent). So it's no fluke.


The r-squared when years married and fertility are combined in a multiple
> regression model improves to 88 percent. (Small-sounding change, perhaps,
> but actually an important (30%) reduction in the unaccounted variation -
> from 17 percent to 12 percent.)


That r^2 is among the highest if not *the* highest ever found in the social
sciences between two variables that are *both* salient *and* non-trivially
related. You will also notice that website no longer exists due to lawfare
against it by the same person that prosecuted Trump.  Sailer started down
the path toward that discovery by first looking at what he called
"affordable family formation" and then noticing the white TFR was a strong
predictor of electoral outcome.

The accusation that Sailer is a "white supremacist" followed on such
Crimethink about TFR -- which boils down to an indictment of his notincing
such a huge r^2 as resulting merely from "reasoning motivated by a desire
to kill all the nonwhites" or some such transparent bullshit -- bullshit
that is nevertheless highly politically motivating to places like Harvard's
sociology department.

On Fri, May 2, 2025 at 12:20 PM James Bowery <jabow...@gmail.com> wrote:

> I did something along these lines with the LaboratoryOfTheStates by taking
> all pairs of columns, applying all combinations of (identity, sqrt, log)
> between them, dropping the high and low datapoints (standard practice to
> inhibit outlier effects), picking the combination that results in the
> highest r^2 (coefficient of determination) for that pair (under that
> combination), and then averaging across all other variables to rank order
> variables (columns) according to how much they tell you about the rest of
> the dataset. That's when I got myself into some *serious* trouble
> (Crimestop-wise) which resulted in me thinking about something more
> principled that could get me *out* of trouble by eliminating any
> suspicion that I was engaged in motivated reasoning:  Hence the C-Prize
> idea.
>
> So, no it doesn't "make sense" in the sense I mean by seeking algorithmic
> information approximation as the information criterion for model
> selection.
>
> However, I think the algorithmic markov paper is saying something
> different:
>
> Any markov process has a notion of "state" which, in turn, has a notion of
> ordering of *rows* rather than columns.
>
>
>
> On Thu, May 1, 2025 at 6:21 PM Matt Mahoney <mattmahone...@gmail.com>
> wrote:
>
>> On Thu, May 1, 2025 at 2:50 PM Matt Mahoney <mattmahone...@gmail.com>
>> wrote:
>> >
>> > I think I understand. We say that X causes Y if we can describe Y as a
>> > function of X. If the simplest description of X and Y has the form Y =
>> > f(X), then we are using algorithmic information to find causality. For
>> > example,
>> >
>> > X Y
>> > - -
>> > 1 1
>> > 2 2
>> > 3 2
>> >
>> > then I can write Y as a function of X, but not X as a function of Y.
>> > Thus, the DAG X -> Y is more plausible than Y -> X.
>> >
>> > To make this practical, the paper postulates a noise signal, as Y =
>> > f(X, N), where N can be 0 in the first case but not in the second.
>> > Thus, less algorithmic information is needed to encode the first case.
>> 
>> But is this a reasonable definition of causality? To test if Y is a
>> simple function of X, we would use compression to approximate K(Y|X),
>> and say that X causes Y if this value is smaller than K(X|Y). To
>> measure K(Y|X) you would compress X (a column of numbers in a table),
>> and subtract that from the compressed size of X concatenated with Y.
>> To test the causal relationships between n variables, you need to
>> compress n^2 pairs of columns. But observe that
>> 
>> K(Y|X)K(X) = K(X, Y) = K(X|Y)K(Y)
>> 
>> So you really only need to compare K(X) and K(Y). Whichever is larger
>> causes the other. You compress n columns and sort them by size from
>> largest to smallest. That is your DAG.
>> 
>> This would be easy to do with thousands of rows and columns, for
>> example, LaboratoryOfTheCounties, where the rows are counties and the
>> columns are things like the percent of population under age 5 or the
>> number of farms between 50 and 100 acres, to see which causes the
>> other.
>> 
>> But does that make sense?
>> 
>> --
>> -- Matt Mahoney, mattmahone...@gmail.com

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T0f47884dae19d52d-Maeb840e5daa657196781341e
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to