Is it necessary to convert categorical data into integers? Any tips would be greatly appreciated!
-Rex On Sun, Jun 14, 2015 at 10:05 AM, Rex X <dnsr...@gmail.com> wrote: > For clustering analysis, we need a way to measure distances. > > When the data contains different levels of measurement - > *binary / categorical (nominal), counts (ordinal), and ratio (scale)* > > To be concrete, for example, working with attributes of > *city, zip, satisfaction_level, price* > > In the meanwhile, the real data usually also contains string attributes, > for example, book titles. The distance between two strings can be measured > by minimum-edit-distance. > > > In SPSS, it provides Two-Step Cluster, which can handle both ratio scale > and ordinal numbers. > > > What is right algorithm to do hierarchical clustering analysis with all > these four-kind attributes above with *MLlib*? > > > If we cannot find a right metric to measure the distance, an alternative > solution is to do a topological data analysis (e.g. linkage, and etc). > Can we do such kind of analysis with *GraphX*? > > > -Rex > >