If anyone is still interested here's what I've since found out. Ron's suggestion of querying the repository for the existing document does impact performance, but not by as much as I thought it would. With a batch of 50,000 documents, my rate of migration dropped from about 29 docs/second to about 22 docs/second.
You cannot set the objectId of an object, it's read-only, so instead I used the cmis:name property as a test. I simply used the ID from the old FileNet (source) system as a unique identifier. If the document already exists then I just throw an exception and tell Spring Batch to skip it. Funnily enough, I wanted to do this more for catastrophic failures, but under load I found that FileNet can add the document and not manage to send a correct response back (CmisConnectionException, XML parser errors, etc) and this led to attempts to migrate duplicates. I am hammering it pretty hard though (15 concurrent threads) so I'm not expecting it to behave perfectly. I still don't know how things will play out once we get into the hundreds of thousands range, but at least I know this is a viable approach. Anyway, thought you'd like to know - thanks again for the help...:-) Tim On Mon, Jun 23, 2014 at 7:26 PM, Tim Webster <tim.webs...@gmail.com> wrote: > I have to confess I haven't, I was making an assumption that it would slow > it down too much, and it would really be a last resort. > > I should at least try it out and see how it impacts performance before I > dismiss it though. > > > > > On Mon, Jun 23, 2014 at 6:23 PM, Ron DiFrango < > rdifra...@captechconsulting.com> wrote: > >> Tim, >> >> Just curious have you tried the search before you insert method to see >> what impact it has on performance? >> >> Thanks, >> >> Ron DiFrango >> Director / Architect | CapTech >> (804) 855-9196 | rdifra...@captechconsulting.com >> <https://email4.captechventures.com/owa/UrlBlockedError.aspx> >> >> >> >> >> >> On 6/22/14, 2:42 PM, "Tim Webster" <tim.webs...@gmail.com> wrote: >> >> >Hi, >> > >> >Thanks for the advice guys...:-) >> > >> >Unfortunately the target CMIS repository isn't my own implementation - >> >it's >> >FileNet P8. The 'source' system is FileNet Content Services (not sure >> the >> >version - but it's non-CMIS compliant and about to become unsupported by >> >IBM - hence the migration). >> > >> >So...what that means is I can't really do anything server-side about >> this. >> > >> >Sascha raises an interesting option - I didn't realize I could set the >> >ObjectId myself. If I did that, and multiple documents had the same >> >ObjectId, surely the server would throw an exception, meaning I wouldn't >> >need to check if it already existed in the server? >> > >> >I could maybe throw some other bits into the hash computation (like >> >creation date or something) to ensure uniqueness...? >> > >> >The 'search before insert' is of course an option, but it would slow >> >everything down so much. For single-version documents I'm able to add 30 >> >documents/second, which is the minimum requirement. >> > >> > >> > >> > >> > >> > >> >On Sun, Jun 22, 2014 at 3:23 PM, Ron DiFrango < >> >rdifra...@captechconsulting.com> wrote: >> > >> >> Tim, >> >> >> >> The suggestion below from Sascha is a good one. The other approach >> I¹ve >> >> take before is to perform a search in the repo for a given document and >> >> only if it does not exist would I insert it, otherwise perform an >> update >> >> or just log it as an ³error². >> >> >> >> Thanks, >> >> >> >> Ron DiFrango >> >> Director / Architect | CapTech >> >> >> >> >> >> >> >> On 6/22/14, 5:37 AM, "Sascha Homeier" <shome...@meyle-mueller.de> >> wrote: >> >> >> >> >Hi Tim, >> >> > >> >> >you said you need to migrate the documents from FileNet to a CMIS >> >> >compliant server. >> >> >Is the CMIS compliant server your implementation? >> >> >If so you could calculate a Hash like MD5 over the content stream and >> >> >set it as the object ID. >> >> >Due to the CMIS spec this object ID needs to be unique. So it must be >> >> >ensured that no two objects with the same object ID exists in the same >> >> >CMIS repository which is equivalent to have two objects with the same >> >> >content stream. >> >> >This approach whould also ensure to not add equal documents in the >> >>future >> >> >after migration is done. >> >> >Nevertheless here you also need to find a performant way of >> determining >> >> >if an object with an ID already exists (and find a solution if the >> hash >> >> >is changed only by a timestamp inside the content stream etc.) >> >> >With about two million objects you maybe need to extend the RAM on the >> >> >migration machine to keep such many objects in memory and comparing it >> >>by >> >> >using Hashmaps and Hashtables with own implementations of equals() and >> >> >hashCode() ;) >> >> > >> >> >Anyway a stimulating task. I'm curious about the ideas of others here >> >>to >> >> >solve it in a performant way ;) >> >> > >> >> >Cheers >> >> >Sascha >> >> > >> >> >-----Ursprüngliche Nachricht----- >> >> >Von: Tim Webster [mailto:tim.webs...@gmail.com] >> >> >Gesendet: Samstag, 21. Juni 2014 17:55 >> >> >An: dev@chemistry.apache.org >> >> >Betreff: Re: document 'uniqueness' >> >> > >> >> >Hello, >> >> > >> >> >yes thanks for the suggestion - it sort of does that already with the >> >> >Spring Batch progress tracking, but it still won't prevent another >> >> >document being added to the repository that is identical to a previous >> >> >one if it somehow failed - like a JVM crash or power failure. Because >> >> >there is no transaction management for the CMIS part, you can't really >> >> >ensure this, except for a constraint in the repository itself. >> >> > >> >> >Anyway, yeah I think you're right and I need to look at FileNet >> >> >specifically. I just wasn't sure if I missed something and there was >> >> >something in the CMIS spec that I could use (e.g. some property or >> >> >something). >> >> > >> >> >Thanks, >> >> > >> >> >Tim >> >> > >> >> > >> >> > >> >> >On Fri, Jun 20, 2014 at 10:24 PM, Lucas, Mike <mike.lu...@gwl.ca> >> >>wrote: >> >> > >> >> >> I'm sure you've already thought of this, but couldn't your migration >> >> >> process just persist the legacy ids in a separate location (e.g. >> >> >> database table, possibly cached in memory for performance)? Then you >> >> >> would just need to check that for each document being migrated, to >> >> >> make sure that the same doc hasn't been seen previously. >> >> >> >> >> >> Not a CMIS related solution, but seems like it would work fine... >> >> >> >> >> >> The other option, as you suggest, is to see if FileNet supports a >> >> >> 'uniqueness' constraint for custom metadata properties. I believe >> >> >> Sharepoint does but not sure about FileNet. >> >> >> >> >> >> Thanks >> >> >> michael lucas | Senior Software Developer | Great-West Life >> >> >> >> >> >> >> >> >> -----Original Message----- >> >> >> From: Tim Webster [mailto:tim.webs...@gmail.com] >> >> >> Sent: June 20, 2014 8:15 AM >> >> >> To: dev@chemistry.apache.org >> >> >> Subject: document 'uniqueness' >> >> >> >> >> >> Hi, >> >> >> >> >> >> I am developing a migration process (using Spring Batch) to migrate >> >> >> documents from a legacy CMS into a CMIS-compliant system, and I need >> >> >> to ensure that duplicate documents are not created accidentally. >> >> >> >> >> >> However, our CMIS system (IBM FileNet) allows the addition of >> >> >> documents with the same name. Documents with identical values for >> >> >> cmis:name or cmis:contentStreamFilename are allowed. Even if this >> >> >> could be disabled (I don't know if it can or cannot), it is a >> >>business >> >> >> requirement and I wouldn't be able to. >> >> >> >> >> >> The only thing I can think of to prevent this is to save the >> 'legacy' >> >> >> ID of the document in a new CMIS property and somehow check that it >> >> >> doesn't already exist when adding a new document. However this will >> >>be >> >> >> very inefficient and slow down the migration (we're talking about up >> >> >> to 2 million documents). >> >> >> >> >> >> Ideally the 'uniqueness constraint' would be checked on the server >> >>and >> >> >> would throw an exception, which I could then deal with. >> >> >> >> >> >> Does anyone know of an easier way to do this, or is there anything I >> >> >> can make use of in the CMIS spec to help? >> >> >> >> >> >> Thanks, >> >> >> >> >> >> >> >> >> >