Heikki Linnakangas wrote:
However, the problem is how to represent and store the
cross-correlation. For fields with low cardinality, like "gender" and
boolean "breast-cancer-or-not" you can count the prevalence of all the
different combinations, but that doesn't scale. Another often cited
example is zip code + street address. There's clearly a strong
correlation between them, but how do you represent that?
For scalar values we currently store a histogram. I suppose we could
create a 2D histogram for two columns, but that doesn't actually help
with the zip code + street address problem.
In my head the neuron for 'principle component analysis' went on while
reading this. Back in college it was used to prepare input data before
feeding it into a neural network. Maybe ideas from PCA could be helpful?
regards,
Yeb Havinga
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers