Hi all, I have read the docs and lots of posts on this forum and it feels like things are starting to click into place, but I wanted to make sure. Bear with me please :)
Basically we have the following model: Client { projects: Collection<Project> users: Collection<User> } Project { owner: User (in a firm) cost: number startDate: Date endDate: Date } A project might last for 6 months and we snapshot it every day: ProjectSnapshot { snapshotDate: Date cost: number -- estimated cost of the project project: Project predictedEndDate: Date -- estimated endDate } So quite simple: Client 1..* User 1.. * Project 1..ProjectSnapshot Actually, not so simple - a Project might be decomposed into smaller projects - Project 1..* Project but let's not worry about that for now. In terms of reporting we need to be able to answer the following questions: - for a client, how many projects were closed each month - for a client, how many projects were on-going each month - for a client, what is the sum of the costs of all on-going projects (where cost(project) == avg(value(snapshots(project))) - for *all* clients, how many projects were closed each month etc. Note: the choice of month is arbitrary - it could be 36 seconds :) so I cannot have an explicit grouping by that. In terms of numbers, there might be 100 clients with 10 users each with 100 projects. Each project lives for 6 months and they are snapshotted each day while they are 'alive' so every year there are (100 * 10 * 100 * (365/2)) = 18,250,000 rows. Of course some can be archived after a year or so. I was thinking about the following structure: KS:Client<clientId>.CF:Project[<projectId>]: { owner:<text>, cost:<number>, startDate:<date>, endDate: <Date> } KS:Client<clientId>.CF:ProjectSnapshots[<snapshotDate_inNanos>]: { projectId:UUID, cost:<number>, predictedEndDate:<Date> } If I understand everything (big if!) this means that: - I can lookup all snapshots for all projects (within a client) using a key range across the keys in Client<clientId>.ProjectSnapshots. I can then shovel this into a map/reduce to group by projectId and aggregate the snapshots(project) - The keys in CF:Project and CF:ProjectSnapshot will spread equally over a cluster so that each node has a chunk of contiguous projects and/or a chunk of contiguous snapshots. - Adding a new snapshot should be really quick Some questions (assuming the above statements are true): - The cluster nodes with the most project projectIds will become hot spots. I really do need to lookup a project by its ID so I cannot have a random key for CF:Project. I don't know how to handle this. - If I wanted to load all the snapshots for the a project do I need to map/reduce to find the snapshots for a project and then bulk-delete those projectIds or can I filter on CF:ProjectSnapshot[*]{projectId:X}? I realise I could have a CF:SnapshotsForProject[<projectId>]: { snapshotDate1:<snapshotUUID>, snapshotDate2:<snapshotUUID> }. - what about removing a project? I think I understand that updates on a CF are atomic but I need to delete two entries from two CFs - each of those will be atomic but they are two separate operations right? I realise that any reports across all clients will need to be manually aggregated (as operations are scoped within a keystore right?)- that is fine. Finally, I think I can do this within a single SCF: KS:Client<clientId>: SCF:Project[<projectId>]: { CF:Projects, CF:SnapshotsForProject }. But this isn't ideal for a few reasons (I think): - how can I use the keys to search for all snapshots across all projects? Would I need to do this in map/reduce? - adding a new snapshot is now quite expensive in terms of I/O It does make all operations on a project atomic now, which is excellent. Ok - many many thanks for reading this - all advice/thoughts/hints are welcome. After looking at this (and document DBs) for a few days and just not getting it, thinking 'what is all the hype about?', I think it is slowly starting to sink in and I am *very* excited about it. Assuming I haven't completely missed the point :) Col