If anyone is still interested here's what I've since found out.

Ron's suggestion of querying the repository for the existing document does
impact performance, but not by as much as I thought it would.  With a batch
of 50,000 documents, my rate of migration dropped from about 29 docs/second
to about 22 docs/second.

You cannot set the objectId of an object, it's read-only, so instead I used
the cmis:name property as a test.  I simply used the ID from the old
FileNet (source) system as a unique identifier.  If the document already
exists then I just throw an exception and tell Spring Batch to skip it.
 Funnily enough, I wanted to do this more for catastrophic failures, but
under load I found that FileNet can add the document and not manage to send
a correct response back (CmisConnectionException, XML parser errors, etc)
and this led to attempts to migrate duplicates.  I am hammering it pretty
hard though (15 concurrent threads) so I'm not expecting it to behave
perfectly.

I still don't know how things will play out once we get into the hundreds
of thousands range, but at least I know this is a viable approach.

Anyway, thought you'd like to know - thanks again for the help...:-)

Tim



On Mon, Jun 23, 2014 at 7:26 PM, Tim Webster <tim.webs...@gmail.com> wrote:

> I have to confess I haven't, I was making an assumption that it would slow
> it down too much, and it would really be a last resort.
>
> I should at least try it out and see how it impacts performance before I
> dismiss it though.
>
>
>
>
> On Mon, Jun 23, 2014 at 6:23 PM, Ron DiFrango <
> rdifra...@captechconsulting.com> wrote:
>
>> Tim,
>>
>> Just curious have you tried the search before you insert method to see
>> what impact it has on performance?
>>
>> Thanks,
>>
>> Ron DiFrango
>> Director / Architect  |  CapTech
>> (804) 855-9196  |  rdifra...@captechconsulting.com
>> <https://email4.captechventures.com/owa/UrlBlockedError.aspx>
>>
>>
>>
>>
>>
>> On 6/22/14, 2:42 PM, "Tim Webster" <tim.webs...@gmail.com> wrote:
>>
>> >Hi,
>> >
>> >Thanks for the advice guys...:-)
>> >
>> >Unfortunately the target CMIS repository isn't my own implementation -
>> >it's
>> >FileNet P8.  The 'source' system is FileNet Content Services (not sure
>> the
>> >version - but it's non-CMIS compliant and about to become unsupported by
>> >IBM - hence the migration).
>> >
>> >So...what that means is I can't really do anything server-side about
>> this.
>> >
>> >Sascha raises an interesting option - I didn't realize I could set the
>> >ObjectId myself.  If I did that, and multiple documents had the same
>> >ObjectId, surely the server would throw an exception, meaning I wouldn't
>> >need to check if it already existed in the server?
>> >
>> >I could maybe throw some other bits into the hash computation (like
>> >creation date or something) to ensure uniqueness...?
>> >
>> >The 'search before insert' is of course an option, but it would slow
>> >everything down so much.  For single-version documents I'm able to add 30
>> >documents/second, which is the minimum requirement.
>> >
>> >
>> >
>> >
>> >
>> >
>> >On Sun, Jun 22, 2014 at 3:23 PM, Ron DiFrango <
>> >rdifra...@captechconsulting.com> wrote:
>> >
>> >> Tim,
>> >>
>> >> The suggestion below from Sascha is a good one.  The other approach
>> I¹ve
>> >> take before is to perform a search in the repo for a given document and
>> >> only if it does not exist would I insert it, otherwise perform an
>> update
>> >> or just log it as an ³error².
>> >>
>> >> Thanks,
>> >>
>> >> Ron DiFrango
>> >> Director / Architect  |  CapTech
>> >>
>> >>
>> >>
>> >> On 6/22/14, 5:37 AM, "Sascha Homeier" <shome...@meyle-mueller.de>
>> wrote:
>> >>
>> >> >Hi Tim,
>> >> >
>> >> >you said you need to migrate the documents from FileNet to a CMIS
>> >> >compliant server.
>> >> >Is the CMIS compliant server your implementation?
>> >> >If so you could calculate a Hash like MD5 over the  content stream and
>> >> >set it as the object ID.
>> >> >Due to the CMIS spec this object ID needs to be unique. So it must be
>> >> >ensured that no two objects with the same object ID exists in the same
>> >> >CMIS repository which is equivalent to have two objects with the same
>> >> >content stream.
>> >> >This approach whould also ensure to not add equal documents in the
>> >>future
>> >> >after migration is done.
>> >> >Nevertheless here you also need to find a performant way of
>> determining
>> >> >if an object with an ID already exists (and find a solution if the
>> hash
>> >> >is changed only by a timestamp inside the content stream etc.)
>> >> >With about two million objects you maybe need to extend the RAM on the
>> >> >migration machine to keep such many objects in memory and comparing it
>> >>by
>> >> >using Hashmaps and Hashtables with own implementations of equals() and
>> >> >hashCode() ;)
>> >> >
>> >> >Anyway a stimulating task. I'm curious about the ideas of others here
>> >>to
>> >> >solve it in a performant way ;)
>> >> >
>> >> >Cheers
>> >> >Sascha
>> >> >
>> >> >-----Ursprüngliche Nachricht-----
>> >> >Von: Tim Webster [mailto:tim.webs...@gmail.com]
>> >> >Gesendet: Samstag, 21. Juni 2014 17:55
>> >> >An: dev@chemistry.apache.org
>> >> >Betreff: Re: document 'uniqueness'
>> >> >
>> >> >Hello,
>> >> >
>> >> >yes thanks for the suggestion - it sort of does that already with the
>> >> >Spring Batch progress tracking, but it still won't prevent another
>> >> >document being added to the repository that is identical to a previous
>> >> >one if it somehow failed - like a JVM crash or power failure.  Because
>> >> >there is no transaction management for the CMIS part, you can't really
>> >> >ensure this, except for a constraint in the repository itself.
>> >> >
>> >> >Anyway, yeah I think you're right and I need to look at FileNet
>> >> >specifically.  I just wasn't sure if I missed something and there was
>> >> >something in the CMIS spec that I could use (e.g. some property or
>> >> >something).
>> >> >
>> >> >Thanks,
>> >> >
>> >> >Tim
>> >> >
>> >> >
>> >> >
>> >> >On Fri, Jun 20, 2014 at 10:24 PM, Lucas, Mike <mike.lu...@gwl.ca>
>> >>wrote:
>> >> >
>> >> >> I'm sure you've already thought of this, but couldn't your migration
>> >> >> process just persist the legacy ids in a separate location (e.g.
>> >> >> database table, possibly cached in memory for performance)? Then you
>> >> >> would just need to check that for each document being migrated, to
>> >> >> make sure that the same doc hasn't been seen previously.
>> >> >>
>> >> >> Not a CMIS related solution, but seems like it would work fine...
>> >> >>
>> >> >> The other option, as you suggest, is to see if FileNet supports a
>> >> >> 'uniqueness' constraint for custom metadata properties. I believe
>> >> >> Sharepoint does but not sure about FileNet.
>> >> >>
>> >> >> Thanks
>> >> >> michael lucas  |  Senior Software Developer  |  Great-West Life
>> >> >>
>> >> >>
>> >> >> -----Original Message-----
>> >> >> From: Tim Webster [mailto:tim.webs...@gmail.com]
>> >> >> Sent: June 20, 2014 8:15 AM
>> >> >> To: dev@chemistry.apache.org
>> >> >> Subject: document 'uniqueness'
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> I am developing a migration process (using Spring Batch) to migrate
>> >> >> documents from a legacy CMS into a CMIS-compliant system, and I need
>> >> >> to ensure that duplicate documents are not created accidentally.
>> >> >>
>> >> >> However, our CMIS system (IBM FileNet) allows the addition of
>> >> >> documents with the same name.  Documents with identical values for
>> >> >> cmis:name or cmis:contentStreamFilename are allowed.  Even if this
>> >> >> could be disabled (I don't know if it can or cannot), it is a
>> >>business
>> >> >> requirement and I wouldn't be able to.
>> >> >>
>> >> >> The only thing I can think of to prevent this is to save the
>> 'legacy'
>> >> >> ID of the document in a new CMIS property and somehow check that it
>> >> >> doesn't already exist when adding a new document. However this will
>> >>be
>> >> >> very inefficient and slow down the migration (we're talking about up
>> >> >> to 2 million documents).
>> >> >>
>> >> >> Ideally the 'uniqueness constraint' would be checked on the server
>> >>and
>> >> >> would throw an exception, which I could then deal with.
>> >> >>
>> >> >> Does anyone know of an easier way to do this, or is there anything I
>> >> >> can make use of in the CMIS spec to help?
>> >> >>
>> >> >> Thanks,
>> >> >>
>> >>
>> >>
>>
>>
>

Reply via email to