I have to confess I haven't, I was making an assumption that it would slow it down too much, and it would really be a last resort.
I should at least try it out and see how it impacts performance before I dismiss it though. On Mon, Jun 23, 2014 at 6:23 PM, Ron DiFrango < rdifra...@captechconsulting.com> wrote: > Tim, > > Just curious have you tried the search before you insert method to see > what impact it has on performance? > > Thanks, > > Ron DiFrango > Director / Architect | CapTech > (804) 855-9196 | rdifra...@captechconsulting.com > <https://email4.captechventures.com/owa/UrlBlockedError.aspx> > > > > > > On 6/22/14, 2:42 PM, "Tim Webster" <tim.webs...@gmail.com> wrote: > > >Hi, > > > >Thanks for the advice guys...:-) > > > >Unfortunately the target CMIS repository isn't my own implementation - > >it's > >FileNet P8. The 'source' system is FileNet Content Services (not sure the > >version - but it's non-CMIS compliant and about to become unsupported by > >IBM - hence the migration). > > > >So...what that means is I can't really do anything server-side about this. > > > >Sascha raises an interesting option - I didn't realize I could set the > >ObjectId myself. If I did that, and multiple documents had the same > >ObjectId, surely the server would throw an exception, meaning I wouldn't > >need to check if it already existed in the server? > > > >I could maybe throw some other bits into the hash computation (like > >creation date or something) to ensure uniqueness...? > > > >The 'search before insert' is of course an option, but it would slow > >everything down so much. For single-version documents I'm able to add 30 > >documents/second, which is the minimum requirement. > > > > > > > > > > > > > >On Sun, Jun 22, 2014 at 3:23 PM, Ron DiFrango < > >rdifra...@captechconsulting.com> wrote: > > > >> Tim, > >> > >> The suggestion below from Sascha is a good one. The other approach I¹ve > >> take before is to perform a search in the repo for a given document and > >> only if it does not exist would I insert it, otherwise perform an update > >> or just log it as an ³error². > >> > >> Thanks, > >> > >> Ron DiFrango > >> Director / Architect | CapTech > >> > >> > >> > >> On 6/22/14, 5:37 AM, "Sascha Homeier" <shome...@meyle-mueller.de> > wrote: > >> > >> >Hi Tim, > >> > > >> >you said you need to migrate the documents from FileNet to a CMIS > >> >compliant server. > >> >Is the CMIS compliant server your implementation? > >> >If so you could calculate a Hash like MD5 over the content stream and > >> >set it as the object ID. > >> >Due to the CMIS spec this object ID needs to be unique. So it must be > >> >ensured that no two objects with the same object ID exists in the same > >> >CMIS repository which is equivalent to have two objects with the same > >> >content stream. > >> >This approach whould also ensure to not add equal documents in the > >>future > >> >after migration is done. > >> >Nevertheless here you also need to find a performant way of determining > >> >if an object with an ID already exists (and find a solution if the hash > >> >is changed only by a timestamp inside the content stream etc.) > >> >With about two million objects you maybe need to extend the RAM on the > >> >migration machine to keep such many objects in memory and comparing it > >>by > >> >using Hashmaps and Hashtables with own implementations of equals() and > >> >hashCode() ;) > >> > > >> >Anyway a stimulating task. I'm curious about the ideas of others here > >>to > >> >solve it in a performant way ;) > >> > > >> >Cheers > >> >Sascha > >> > > >> >-----Ursprüngliche Nachricht----- > >> >Von: Tim Webster [mailto:tim.webs...@gmail.com] > >> >Gesendet: Samstag, 21. Juni 2014 17:55 > >> >An: dev@chemistry.apache.org > >> >Betreff: Re: document 'uniqueness' > >> > > >> >Hello, > >> > > >> >yes thanks for the suggestion - it sort of does that already with the > >> >Spring Batch progress tracking, but it still won't prevent another > >> >document being added to the repository that is identical to a previous > >> >one if it somehow failed - like a JVM crash or power failure. Because > >> >there is no transaction management for the CMIS part, you can't really > >> >ensure this, except for a constraint in the repository itself. > >> > > >> >Anyway, yeah I think you're right and I need to look at FileNet > >> >specifically. I just wasn't sure if I missed something and there was > >> >something in the CMIS spec that I could use (e.g. some property or > >> >something). > >> > > >> >Thanks, > >> > > >> >Tim > >> > > >> > > >> > > >> >On Fri, Jun 20, 2014 at 10:24 PM, Lucas, Mike <mike.lu...@gwl.ca> > >>wrote: > >> > > >> >> I'm sure you've already thought of this, but couldn't your migration > >> >> process just persist the legacy ids in a separate location (e.g. > >> >> database table, possibly cached in memory for performance)? Then you > >> >> would just need to check that for each document being migrated, to > >> >> make sure that the same doc hasn't been seen previously. > >> >> > >> >> Not a CMIS related solution, but seems like it would work fine... > >> >> > >> >> The other option, as you suggest, is to see if FileNet supports a > >> >> 'uniqueness' constraint for custom metadata properties. I believe > >> >> Sharepoint does but not sure about FileNet. > >> >> > >> >> Thanks > >> >> michael lucas | Senior Software Developer | Great-West Life > >> >> > >> >> > >> >> -----Original Message----- > >> >> From: Tim Webster [mailto:tim.webs...@gmail.com] > >> >> Sent: June 20, 2014 8:15 AM > >> >> To: dev@chemistry.apache.org > >> >> Subject: document 'uniqueness' > >> >> > >> >> Hi, > >> >> > >> >> I am developing a migration process (using Spring Batch) to migrate > >> >> documents from a legacy CMS into a CMIS-compliant system, and I need > >> >> to ensure that duplicate documents are not created accidentally. > >> >> > >> >> However, our CMIS system (IBM FileNet) allows the addition of > >> >> documents with the same name. Documents with identical values for > >> >> cmis:name or cmis:contentStreamFilename are allowed. Even if this > >> >> could be disabled (I don't know if it can or cannot), it is a > >>business > >> >> requirement and I wouldn't be able to. > >> >> > >> >> The only thing I can think of to prevent this is to save the 'legacy' > >> >> ID of the document in a new CMIS property and somehow check that it > >> >> doesn't already exist when adding a new document. However this will > >>be > >> >> very inefficient and slow down the migration (we're talking about up > >> >> to 2 million documents). > >> >> > >> >> Ideally the 'uniqueness constraint' would be checked on the server > >>and > >> >> would throw an exception, which I could then deal with. > >> >> > >> >> Does anyone know of an easier way to do this, or is there anything I > >> >> can make use of in the CMIS spec to help? > >> >> > >> >> Thanks, > >> >> > >> > >> > >