Tim, Just curious have you tried the search before you insert method to see what impact it has on performance?
Thanks, Ron DiFrango Director / Architect | CapTech (804) 855-9196 | rdifra...@captechconsulting.com <https://email4.captechventures.com/owa/UrlBlockedError.aspx> On 6/22/14, 2:42 PM, "Tim Webster" <tim.webs...@gmail.com> wrote: >Hi, > >Thanks for the advice guys...:-) > >Unfortunately the target CMIS repository isn't my own implementation - >it's >FileNet P8. The 'source' system is FileNet Content Services (not sure the >version - but it's non-CMIS compliant and about to become unsupported by >IBM - hence the migration). > >So...what that means is I can't really do anything server-side about this. > >Sascha raises an interesting option - I didn't realize I could set the >ObjectId myself. If I did that, and multiple documents had the same >ObjectId, surely the server would throw an exception, meaning I wouldn't >need to check if it already existed in the server? > >I could maybe throw some other bits into the hash computation (like >creation date or something) to ensure uniqueness...? > >The 'search before insert' is of course an option, but it would slow >everything down so much. For single-version documents I'm able to add 30 >documents/second, which is the minimum requirement. > > > > > > >On Sun, Jun 22, 2014 at 3:23 PM, Ron DiFrango < >rdifra...@captechconsulting.com> wrote: > >> Tim, >> >> The suggestion below from Sascha is a good one. The other approach I¹ve >> take before is to perform a search in the repo for a given document and >> only if it does not exist would I insert it, otherwise perform an update >> or just log it as an ³error². >> >> Thanks, >> >> Ron DiFrango >> Director / Architect | CapTech >> >> >> >> On 6/22/14, 5:37 AM, "Sascha Homeier" <shome...@meyle-mueller.de> wrote: >> >> >Hi Tim, >> > >> >you said you need to migrate the documents from FileNet to a CMIS >> >compliant server. >> >Is the CMIS compliant server your implementation? >> >If so you could calculate a Hash like MD5 over the content stream and >> >set it as the object ID. >> >Due to the CMIS spec this object ID needs to be unique. So it must be >> >ensured that no two objects with the same object ID exists in the same >> >CMIS repository which is equivalent to have two objects with the same >> >content stream. >> >This approach whould also ensure to not add equal documents in the >>future >> >after migration is done. >> >Nevertheless here you also need to find a performant way of determining >> >if an object with an ID already exists (and find a solution if the hash >> >is changed only by a timestamp inside the content stream etc.) >> >With about two million objects you maybe need to extend the RAM on the >> >migration machine to keep such many objects in memory and comparing it >>by >> >using Hashmaps and Hashtables with own implementations of equals() and >> >hashCode() ;) >> > >> >Anyway a stimulating task. I'm curious about the ideas of others here >>to >> >solve it in a performant way ;) >> > >> >Cheers >> >Sascha >> > >> >-----Ursprüngliche Nachricht----- >> >Von: Tim Webster [mailto:tim.webs...@gmail.com] >> >Gesendet: Samstag, 21. Juni 2014 17:55 >> >An: dev@chemistry.apache.org >> >Betreff: Re: document 'uniqueness' >> > >> >Hello, >> > >> >yes thanks for the suggestion - it sort of does that already with the >> >Spring Batch progress tracking, but it still won't prevent another >> >document being added to the repository that is identical to a previous >> >one if it somehow failed - like a JVM crash or power failure. Because >> >there is no transaction management for the CMIS part, you can't really >> >ensure this, except for a constraint in the repository itself. >> > >> >Anyway, yeah I think you're right and I need to look at FileNet >> >specifically. I just wasn't sure if I missed something and there was >> >something in the CMIS spec that I could use (e.g. some property or >> >something). >> > >> >Thanks, >> > >> >Tim >> > >> > >> > >> >On Fri, Jun 20, 2014 at 10:24 PM, Lucas, Mike <mike.lu...@gwl.ca> >>wrote: >> > >> >> I'm sure you've already thought of this, but couldn't your migration >> >> process just persist the legacy ids in a separate location (e.g. >> >> database table, possibly cached in memory for performance)? Then you >> >> would just need to check that for each document being migrated, to >> >> make sure that the same doc hasn't been seen previously. >> >> >> >> Not a CMIS related solution, but seems like it would work fine... >> >> >> >> The other option, as you suggest, is to see if FileNet supports a >> >> 'uniqueness' constraint for custom metadata properties. I believe >> >> Sharepoint does but not sure about FileNet. >> >> >> >> Thanks >> >> michael lucas | Senior Software Developer | Great-West Life >> >> >> >> >> >> -----Original Message----- >> >> From: Tim Webster [mailto:tim.webs...@gmail.com] >> >> Sent: June 20, 2014 8:15 AM >> >> To: dev@chemistry.apache.org >> >> Subject: document 'uniqueness' >> >> >> >> Hi, >> >> >> >> I am developing a migration process (using Spring Batch) to migrate >> >> documents from a legacy CMS into a CMIS-compliant system, and I need >> >> to ensure that duplicate documents are not created accidentally. >> >> >> >> However, our CMIS system (IBM FileNet) allows the addition of >> >> documents with the same name. Documents with identical values for >> >> cmis:name or cmis:contentStreamFilename are allowed. Even if this >> >> could be disabled (I don't know if it can or cannot), it is a >>business >> >> requirement and I wouldn't be able to. >> >> >> >> The only thing I can think of to prevent this is to save the 'legacy' >> >> ID of the document in a new CMIS property and somehow check that it >> >> doesn't already exist when adding a new document. However this will >>be >> >> very inefficient and slow down the migration (we're talking about up >> >> to 2 million documents). >> >> >> >> Ideally the 'uniqueness constraint' would be checked on the server >>and >> >> would throw an exception, which I could then deal with. >> >> >> >> Does anyone know of an easier way to do this, or is there anything I >> >> can make use of in the CMIS spec to help? >> >> >> >> Thanks, >> >> >> >>