Tim, Thanks for sharing!
Ron DiFrango Director / Architect | CapTech (804) 855-9196 | rdifra...@captechconsulting.com <https://email4.captechventures.com/owa/UrlBlockedError.aspx> On 6/24/14, 5:03 PM, "Tim Webster" <tim.webs...@gmail.com> wrote: >If anyone is still interested here's what I've since found out. > >Ron's suggestion of querying the repository for the existing document does >impact performance, but not by as much as I thought it would. With a >batch >of 50,000 documents, my rate of migration dropped from about 29 >docs/second >to about 22 docs/second. > >You cannot set the objectId of an object, it's read-only, so instead I >used >the cmis:name property as a test. I simply used the ID from the old >FileNet (source) system as a unique identifier. If the document already >exists then I just throw an exception and tell Spring Batch to skip it. > Funnily enough, I wanted to do this more for catastrophic failures, but >under load I found that FileNet can add the document and not manage to >send >a correct response back (CmisConnectionException, XML parser errors, etc) >and this led to attempts to migrate duplicates. I am hammering it pretty >hard though (15 concurrent threads) so I'm not expecting it to behave >perfectly. > >I still don't know how things will play out once we get into the hundreds >of thousands range, but at least I know this is a viable approach. > >Anyway, thought you'd like to know - thanks again for the help...:-) > >Tim > > > >On Mon, Jun 23, 2014 at 7:26 PM, Tim Webster <tim.webs...@gmail.com> >wrote: > >> I have to confess I haven't, I was making an assumption that it would >>slow >> it down too much, and it would really be a last resort. >> >> I should at least try it out and see how it impacts performance before I >> dismiss it though. >> >> >> >> >> On Mon, Jun 23, 2014 at 6:23 PM, Ron DiFrango < >> rdifra...@captechconsulting.com> wrote: >> >>> Tim, >>> >>> Just curious have you tried the search before you insert method to see >>> what impact it has on performance? >>> >>> Thanks, >>> >>> Ron DiFrango >>> Director / Architect | CapTech >>> (804) 855-9196 | rdifra...@captechconsulting.com >>> <https://email4.captechventures.com/owa/UrlBlockedError.aspx> >>> >>> >>> >>> >>> >>> On 6/22/14, 2:42 PM, "Tim Webster" <tim.webs...@gmail.com> wrote: >>> >>> >Hi, >>> > >>> >Thanks for the advice guys...:-) >>> > >>> >Unfortunately the target CMIS repository isn't my own implementation - >>> >it's >>> >FileNet P8. The 'source' system is FileNet Content Services (not sure >>> the >>> >version - but it's non-CMIS compliant and about to become unsupported >>>by >>> >IBM - hence the migration). >>> > >>> >So...what that means is I can't really do anything server-side about >>> this. >>> > >>> >Sascha raises an interesting option - I didn't realize I could set the >>> >ObjectId myself. If I did that, and multiple documents had the same >>> >ObjectId, surely the server would throw an exception, meaning I >>>wouldn't >>> >need to check if it already existed in the server? >>> > >>> >I could maybe throw some other bits into the hash computation (like >>> >creation date or something) to ensure uniqueness...? >>> > >>> >The 'search before insert' is of course an option, but it would slow >>> >everything down so much. For single-version documents I'm able to >>>add 30 >>> >documents/second, which is the minimum requirement. >>> > >>> > >>> > >>> > >>> > >>> > >>> >On Sun, Jun 22, 2014 at 3:23 PM, Ron DiFrango < >>> >rdifra...@captechconsulting.com> wrote: >>> > >>> >> Tim, >>> >> >>> >> The suggestion below from Sascha is a good one. The other approach >>> I¹ve >>> >> take before is to perform a search in the repo for a given document >>>and >>> >> only if it does not exist would I insert it, otherwise perform an >>> update >>> >> or just log it as an ³error². >>> >> >>> >> Thanks, >>> >> >>> >> Ron DiFrango >>> >> Director / Architect | CapTech >>> >> >>> >> >>> >> >>> >> On 6/22/14, 5:37 AM, "Sascha Homeier" <shome...@meyle-mueller.de> >>> wrote: >>> >> >>> >> >Hi Tim, >>> >> > >>> >> >you said you need to migrate the documents from FileNet to a CMIS >>> >> >compliant server. >>> >> >Is the CMIS compliant server your implementation? >>> >> >If so you could calculate a Hash like MD5 over the content stream >>>and >>> >> >set it as the object ID. >>> >> >Due to the CMIS spec this object ID needs to be unique. So it must >>>be >>> >> >ensured that no two objects with the same object ID exists in the >>>same >>> >> >CMIS repository which is equivalent to have two objects with the >>>same >>> >> >content stream. >>> >> >This approach whould also ensure to not add equal documents in the >>> >>future >>> >> >after migration is done. >>> >> >Nevertheless here you also need to find a performant way of >>> determining >>> >> >if an object with an ID already exists (and find a solution if the >>> hash >>> >> >is changed only by a timestamp inside the content stream etc.) >>> >> >With about two million objects you maybe need to extend the RAM on >>>the >>> >> >migration machine to keep such many objects in memory and >>>comparing it >>> >>by >>> >> >using Hashmaps and Hashtables with own implementations of equals() >>>and >>> >> >hashCode() ;) >>> >> > >>> >> >Anyway a stimulating task. I'm curious about the ideas of others >>>here >>> >>to >>> >> >solve it in a performant way ;) >>> >> > >>> >> >Cheers >>> >> >Sascha >>> >> > >>> >> >-----Ursprüngliche Nachricht----- >>> >> >Von: Tim Webster [mailto:tim.webs...@gmail.com] >>> >> >Gesendet: Samstag, 21. Juni 2014 17:55 >>> >> >An: dev@chemistry.apache.org >>> >> >Betreff: Re: document 'uniqueness' >>> >> > >>> >> >Hello, >>> >> > >>> >> >yes thanks for the suggestion - it sort of does that already with >>>the >>> >> >Spring Batch progress tracking, but it still won't prevent another >>> >> >document being added to the repository that is identical to a >>>previous >>> >> >one if it somehow failed - like a JVM crash or power failure. >>>Because >>> >> >there is no transaction management for the CMIS part, you can't >>>really >>> >> >ensure this, except for a constraint in the repository itself. >>> >> > >>> >> >Anyway, yeah I think you're right and I need to look at FileNet >>> >> >specifically. I just wasn't sure if I missed something and there >>>was >>> >> >something in the CMIS spec that I could use (e.g. some property or >>> >> >something). >>> >> > >>> >> >Thanks, >>> >> > >>> >> >Tim >>> >> > >>> >> > >>> >> > >>> >> >On Fri, Jun 20, 2014 at 10:24 PM, Lucas, Mike <mike.lu...@gwl.ca> >>> >>wrote: >>> >> > >>> >> >> I'm sure you've already thought of this, but couldn't your >>>migration >>> >> >> process just persist the legacy ids in a separate location (e.g. >>> >> >> database table, possibly cached in memory for performance)? Then >>>you >>> >> >> would just need to check that for each document being migrated, >>>to >>> >> >> make sure that the same doc hasn't been seen previously. >>> >> >> >>> >> >> Not a CMIS related solution, but seems like it would work fine... >>> >> >> >>> >> >> The other option, as you suggest, is to see if FileNet supports a >>> >> >> 'uniqueness' constraint for custom metadata properties. I believe >>> >> >> Sharepoint does but not sure about FileNet. >>> >> >> >>> >> >> Thanks >>> >> >> michael lucas | Senior Software Developer | Great-West Life >>> >> >> >>> >> >> >>> >> >> -----Original Message----- >>> >> >> From: Tim Webster [mailto:tim.webs...@gmail.com] >>> >> >> Sent: June 20, 2014 8:15 AM >>> >> >> To: dev@chemistry.apache.org >>> >> >> Subject: document 'uniqueness' >>> >> >> >>> >> >> Hi, >>> >> >> >>> >> >> I am developing a migration process (using Spring Batch) to >>>migrate >>> >> >> documents from a legacy CMS into a CMIS-compliant system, and I >>>need >>> >> >> to ensure that duplicate documents are not created accidentally. >>> >> >> >>> >> >> However, our CMIS system (IBM FileNet) allows the addition of >>> >> >> documents with the same name. Documents with identical values >>>for >>> >> >> cmis:name or cmis:contentStreamFilename are allowed. Even if >>>this >>> >> >> could be disabled (I don't know if it can or cannot), it is a >>> >>business >>> >> >> requirement and I wouldn't be able to. >>> >> >> >>> >> >> The only thing I can think of to prevent this is to save the >>> 'legacy' >>> >> >> ID of the document in a new CMIS property and somehow check that >>>it >>> >> >> doesn't already exist when adding a new document. However this >>>will >>> >>be >>> >> >> very inefficient and slow down the migration (we're talking >>>about up >>> >> >> to 2 million documents). >>> >> >> >>> >> >> Ideally the 'uniqueness constraint' would be checked on the >>>server >>> >>and >>> >> >> would throw an exception, which I could then deal with. >>> >> >> >>> >> >> Does anyone know of an easier way to do this, or is there >>>anything I >>> >> >> can make use of in the CMIS spec to help? >>> >> >> >>> >> >> Thanks, >>> >> >> >>> >> >>> >> >>> >>> >>