Hello,
Please advice on encoding data for the following clustering problem. 
I have a dataset with car usage info. Dataset has the following fields:
1. Car model  (Toyoya Celica, BMW, Nissan X-Trail, Mazda Cosmo, etc.)
2. Year built 
3. Country where the car runs 
4. Distance run by car before major repairs 

Important: The above dataset is sparse. 
In most cases "Distance" is not known for all countries for a given car.   

Problem: 
For a given car predict the "Distance" it will run before major repairs in a 
country for which "Distance" is unknown.

My approach:
I want to represent each record in the dataset as a sparse vector with the 
following components:
1. Binary (1/0) car model components. Number of these components equals the 
number of all possible models in the dataset.
2. Binary (1/0) country where the car runs. Number of these components equals 
the number of all possible countries in the dataset.
3. Distance. A single integer component, equals the distance run by car.

Next I want to cluster (k-means) these vectors and analyze resulting groups. 

Questions:
1) In my vectors I mix components of different nature - binary (model, 
country)  and continuous (distance). How to calculate component-wise distance 
between vectors? Cosine similarity?
2) Other ways to encode components with finite set of values (model, country) 
to work well with continuous components (such as distance)?

Thanks!
Anton
        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to