Re: [FRIAM] Sorting Algorithm? AI? Identifying "types" within data

glen Tue, 10 Jan 2023 10:24:38 -0800

One tangential solution I've seen work well enough in synthetic health data is to treat 
the longitudinal data as a sequence in the same way the LLMs treat text. Rather than 
focus on the 2nd problem EricC mentioned (clustering based on *similarity*), focus more 
on the 1st ("around 10 different types of changes that could happen").

I suggest this because "we've" made lots of progress in such sequence prediction, and
less progress in *object* detection. By asking for "types of career", you're conflating
the two problems, lowering the efficacy of the sequential task and perhaps raising the efficacy of
the object detection task.

But a second idea might come from Russ' comment. Obviously, these are fairly well
integrated humans. (My rants against narrativity aside.) So the causes of one seemingly
independent feature *are* intertwined with the causes of some other independent feature.
You can imagine a graph that includes all the features as nodes, but that also includes
something like a Markov blanket just inside those measured features, and an internal
causal kernel inside that blanket. This is akin to structural equation modeling et al.
What you're looking for is a reconstructed state space of any human in the database ...
what the cool kids are calling a "digital twin", these days.

And, given your (Eric's) rants about bureaucracy and stupidity being at least
in part due to the system these agents navigate, you *might* be able to keep
the model simple by modeling the *options* any given agent might have ... as
defined by the space. I.e. model the environment, the dual of the agent(s). We
did that in a model for optimizing building and urban evacuation using
so-called stupid agents.

If you're looking for the One True Algorithm, my guess is you'll get lost. But
if you do find it, write an Excel macro for the rest of us. 8^D

On 1/10/23 09:16, Russ Abbott wrote:

Interesting problem.

Eric, as you said earlier, K-means requires a way to measure the distance between objects 
-- so that those with smaller distances can be grouped together. A problem is that there 
are a number of features, which may not be correlated. For example, there is an income 
trajectory, a change of company trajectory, a change of level-of-responsibility 
trajectory, a change of subject-matter-focus trajectory, and probably more.  You might 
build separate trajectories for each person and then see if you can group the 
trajectories. For example, a "company man" may or may not have an increasing 
responsibility trajectory. You would then have a multi-dimensional space into which to 
put people.
_
_
__-- Russ


On Mon, Jan 9, 2023 at 10:11 AM Nicholas Thompson <[email protected] 
<mailto:[email protected]>> wrote:

    To my uneducated eye, this seemed like one of Jon’s problems.

    Sent from my Dumb Phone

    On Jan 7, 2023, at 6:23 AM, Frank Wimberly <[email protected] 
<mailto:[email protected]>> wrote:

    
    This answer seems reasonable to me.  I worked on Project Talent during 1967 
which had some similar goals and data.  See

    https://en.m.wikipedia.org/wiki/Project_Talent 
<https://en.m.wikipedia.org/wiki/Project_Talent>

    Our data was for thousands of highschool students and our software was all 
written in Fortran.

    ---
    Frank C. Wimberly
    140 Calle Ojo Feliz,
    Santa Fe, NM 87505

    505 670-9918
    Santa Fe, NM

    On Fri, Jan 6, 2023, 11:32 PM Pieter Steenekamp <[email protected] 
<mailto:[email protected]>> wrote:

        I asked https://chat.openai.com/chat <https://chat.openai.com/chat> and 
here is the conversation:

        *Pieter Steenekamp*
        can you suggest a solution for the following problem "I'm hoping someone here could help out. Let's imagine I 
had some data where each row was a person's career. We could list major events every year.For example: 2004 they were 
highered, 2007 they get a promotion, 2010 they leave for a different company, 2012 they come back at a higher level, 2015 
get a promotion, then no change until 2022.  Let's say I had data like this for roughly 2 million people, and that there are 
around 10 different types of changes that could happen during any time period (could be yearly, quarterly, monthly, I can 
make it how I want).  I was hoping we could ask a computer to tell us if there were "types of careers" that people 
had. We could say "put all these careers into 4 buckets" or "7 buckets" based on similarity. Then we 
could look at the piles the computer made and try to make sense of them. One type might be "company man" for 
people who tend to stay in place for 20 or more years,
        another type could be a "rotator", who leaves and returns every 3 years or 
so. Etc. The point is, I want a computer to make the piles for me, rather than trying to come 
up with potential piles a priori. Are there methods for doing this? I know it's a problem 
we've *talked* about a lot, but I don't know if there are solutions. "
        *https://chat.openai.com/chat <https://chat.openai.com/chat>*
        There are several approaches that you could use to group the careers of 
these 2 million people into clusters based on similarity. One approach is to 
use unsupervised machine learning techniques, specifically clustering 
algorithms, to automatically group the data into clusters. Some popular 
clustering algorithms that you could use for this problem include k-means, 
hierarchical clustering, and DBSCAN (Density-Based Spatial Clustering of 
Applications with Noise). You could also use dimensionality reduction 
techniques, such as PCA (Principal Component Analysis) or t-SNE (t-Distributed 
Stochastic Neighbor Embedding), to reduce the dimensionality of the data and 
then use a clustering algorithm to group the data into clusters. Another 
approach is to use supervised machine learning techniques, where you would need 
to manually label a subset of the data with the cluster labels, and then use 
this labeled data to train a classifier to predict the cluster labels for the 
remaining
        data. It's worth noting that these approaches will require some feature 
engineering, where you will need to extract relevant features from the data 
that can be used to represent the careers of the people in the dataset. Once 
you have identified and extracted these features, you can then use them as 
input to the machine learning algorithms to group the data into clusters.
        *Pieter Steenekamp*
        would it then be possible to get  for example if there were "types of 
careers" that people had?
        *https://chat.openai.com/chat <https://chat.openai.com/chat>*
        Yes, once you have grouped the careers of the 2 million people into clusters using one of the 
approaches I mentioned above, you can then analyze the characteristics of each cluster to identify common 
patterns and trends within each cluster. This will allow you to identify "types of careers" that 
people have, based on the similarity of their career events. For example, if one of the clusters is 
characterized by people who have had long tenures at a single company, you could label this cluster as 
"company man" careers. Similarly, if another cluster is characterized by people who frequently 
switch between companies, you could label this cluster as "rotator" careers. You can then use this 
information to gain insights into the types of careers that are represented in the dataset, and potentially 
draw conclusions about the factors that may influence career paths and success.
        *Pieter Steenekamp*
        can you propose python code to do sometingh like this please
        *https://chat.openai.com/chat <https://chat.openai.com/chat>*
        Sure, here is an example of how you could use the scikit-learn library 
in Python to group the careers of the 2 million people into clusters using the 
k-means algorithm:
        Copy code
        from sklearn.cluster import KMeans
        # Load the data into a NumPy array
        data = np.loadtxt('career_data.txt')
        # Create the k-means model
        model = KMeans(n_clusters=4)
        # Fit the model to the data
        model.fit(data)
        # Predict the cluster labels for each data point
        labels = model.predict(data)
        # Print the cluster labels
        print(labels)
        This code will group the data into 4 clusters, as specified by the 
n_clusters parameter. You can adjust this parameter to change the number of 
clusters that the data is grouped into.
        To extract features from the data that can be used as input to the 
k-means model, you will need to preprocess the data and select relevant 
features. This will likely involve some feature engineering, which will depend 
on the specific characteristics of the data and the goals of your analysis.
        I hope this helps! Let me know if you have any questions.

        On Fri, 6 Jan 2023 at 19:34, Eric Charles <[email protected] 
<mailto:[email protected]>> wrote:

            Greetings all,
            I'm hoping someone here could help out. Let's imagine I had some 
data where each row was a person's career. We could list major events every 
year.

            For example: 2004 they were highered, 2007 they get a promotion, 
2010 they leave for a different company, 2012 they come back at a higher level, 
2015 get a promotion, then no change until 2022.

            Let's say I had data like this for roughly 2 million people, and 
that there are around 10 different types of changes that could happen during 
any time period (could be yearly, quarterly, monthly, I can make it how I want).

            I was hoping we could ask a computer to tell us if there were "types of careers" that 
people had. We could say "put all these careers into 4 buckets" or "7 buckets" based on 
similarity. Then we could look at the piles the computer made and try to make sense of them.

            One type might be "company man" for people who tend to stay in place for 20 
or more years, another type could be a "rotator", who leaves and returns every 3 years or 
so. Etc. The point is, I want a computer to make the piles for me, rather than trying to come up 
with potential piles a priori.

            Are there methods for doing this? I know it's a problem we've 
*talked* about a lot, but I don't know if there are solutions.

            Any help would be appreciated.


--
ꙮ Mɥǝu ǝlǝdɥɐuʇs ɟᴉƃɥʇ' ʇɥǝ ƃɹɐss snɟɟǝɹs˙ ꙮ
-. --- - / ...- .- .-.. .. -.. / -- --- .-. ... . / -.-. --- -.. .
FRIAM Applied Complexity Group listserv
Fridays 9a-12p Friday St. Johns Cafe   /   Thursdays 9a-12p Zoom 
https://bit.ly/virtualfriam
to (un)subscribe http://redfish.com/mailman/listinfo/friam_redfish.com
FRIAM-COMIC http://friam-comic.blogspot.com/
archives:  5/2017 thru present https://redfish.com/pipermail/friam_redfish.com/
 1/2003 thru 6/2021  http://friam.383.s1.nabble.com/

Re: [FRIAM] Sorting Algorithm? AI? Identifying "types" within data

Reply via email to