Hi, This *is* off-topic but with python being a language with a somewhat scientific audience, I might get lucky ;) I have a set of documents (helpdesk tickets in fact) and I would like to automatically collect them in bundles so I can visualise some statistics depending on content.
A while ago I wrote a very simple clustering library which can cluster about everything where you can calculate some form of distance. Meaning: You can supply a function that calculates numeric value given two objects (helpdesk request text-body in this case). The closer the two objects are related, the smaller the returned value with 0.0 meaning that the two objects are identical. Is it possible to calculate a distance between two chunks of text? I suppose one could simply do a simple word-count on the chunks (removing common noise words of course). And then go from there. Maybe even assigning different weighting to words. But maybe there is a well- tested and useful algorithm already available? Text processing is a very blurry area for me. I don't expect any solutions for the problem right away. Maybe just some pointers as to *what* I can google for. I'll pick the rest up from there. Eventually I would like to have the possibility to say: "This set of texts contains 20 requests dealing with emails, 30 requests dealing with Office Applications and 210 requests dealing with databases". I am aware that labelling the different text-bundles will have to be done manually I suppose. But I will aim for no more than 10 bundles anyway. So that's OK. -- http://mail.python.org/mailman/listinfo/python-list