Hi, I play with Cassandra 0.7.0 and Hadoop, developing simple MapReduce tasks. While developing really simple MR task, I've found that a combiantion of Hadoop optimalization and Cassandra ColumnFamilyRecordWriter queue creates wrong keys to send to batch_mutate(). The proble is in the reduce part, the storage behind the key parameter is reused. For example when storing URL I'll get:
http://119.cz/index.php/vypalovaci-mechaniky-a-vypalovani-disk/120-jak-zjistit-verzi-firmwaru-vypalovaky-ve-windows-vista (1) http://11superstars.xf.cz/index.php?page=12y-a-vypalovani-disk/120-jak-zjistit-verzi-firmwaru-vypalovaky-ve-windows-vista (2) http://12kmenu.unas.cz/18-6-2011-(Isachar).htmlvypalovani-disk/120-jak-zjistit-verzi-firmwaru-vypalovaky-ve-windows-vista (3) You can see, that part of the URL (1) is repeating in the URL (2) and URL (3). I've changed the my reduce method to clone the key before calling the context.write(), but I think it should be cloned inside the Cassandra ColumnFamilyRecordWriter because I as a user I don't care about how is it implemented inside, I just write values there. For example the FileOutputFormat, I don't need to clone the key when writting to it. I'd like to know what's your opinion. Best regards, Patrik
