Re: AW: document 'uniqueness'

Ron DiFrango Wed, 25 Jun 2014 04:44:24 -0700

Tim,

Thanks for sharing!


Ron DiFrango       
Director / Architect  |  CapTech
(804) 855-9196  |  rdifra...@captechconsulting.com
<https://email4.captechventures.com/owa/UrlBlockedError.aspx>





On 6/24/14, 5:03 PM, "Tim Webster" <tim.webs...@gmail.com> wrote:

>If anyone is still interested here's what I've since found out.
>
>Ron's suggestion of querying the repository for the existing document does
>impact performance, but not by as much as I thought it would.  With a
>batch
>of 50,000 documents, my rate of migration dropped from about 29
>docs/second
>to about 22 docs/second.
>
>You cannot set the objectId of an object, it's read-only, so instead I
>used
>the cmis:name property as a test.  I simply used the ID from the old
>FileNet (source) system as a unique identifier.  If the document already
>exists then I just throw an exception and tell Spring Batch to skip it.
> Funnily enough, I wanted to do this more for catastrophic failures, but
>under load I found that FileNet can add the document and not manage to
>send
>a correct response back (CmisConnectionException, XML parser errors, etc)
>and this led to attempts to migrate duplicates.  I am hammering it pretty
>hard though (15 concurrent threads) so I'm not expecting it to behave
>perfectly.
>
>I still don't know how things will play out once we get into the hundreds
>of thousands range, but at least I know this is a viable approach.
>
>Anyway, thought you'd like to know - thanks again for the help...:-)
>
>Tim
>
>
>
>On Mon, Jun 23, 2014 at 7:26 PM, Tim Webster <tim.webs...@gmail.com>
>wrote:
>
>> I have to confess I haven't, I was making an assumption that it would
>>slow
>> it down too much, and it would really be a last resort.
>>
>> I should at least try it out and see how it impacts performance before I
>> dismiss it though.
>>
>>
>>
>>
>> On Mon, Jun 23, 2014 at 6:23 PM, Ron DiFrango <
>> rdifra...@captechconsulting.com> wrote:
>>
>>> Tim,
>>>
>>> Just curious have you tried the search before you insert method to see
>>> what impact it has on performance?
>>>
>>> Thanks,
>>>
>>> Ron DiFrango
>>> Director / Architect  |  CapTech
>>> (804) 855-9196  |  rdifra...@captechconsulting.com
>>> <https://email4.captechventures.com/owa/UrlBlockedError.aspx>
>>>
>>>
>>>
>>>
>>>
>>> On 6/22/14, 2:42 PM, "Tim Webster" <tim.webs...@gmail.com> wrote:
>>>
>>> >Hi,
>>> >
>>> >Thanks for the advice guys...:-)
>>> >
>>> >Unfortunately the target CMIS repository isn't my own implementation -
>>> >it's
>>> >FileNet P8.  The 'source' system is FileNet Content Services (not sure
>>> the
>>> >version - but it's non-CMIS compliant and about to become unsupported
>>>by
>>> >IBM - hence the migration).
>>> >
>>> >So...what that means is I can't really do anything server-side about
>>> this.
>>> >
>>> >Sascha raises an interesting option - I didn't realize I could set the
>>> >ObjectId myself.  If I did that, and multiple documents had the same
>>> >ObjectId, surely the server would throw an exception, meaning I
>>>wouldn't
>>> >need to check if it already existed in the server?
>>> >
>>> >I could maybe throw some other bits into the hash computation (like
>>> >creation date or something) to ensure uniqueness...?
>>> >
>>> >The 'search before insert' is of course an option, but it would slow
>>> >everything down so much.  For single-version documents I'm able to
>>>add 30
>>> >documents/second, which is the minimum requirement.
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >On Sun, Jun 22, 2014 at 3:23 PM, Ron DiFrango <
>>> >rdifra...@captechconsulting.com> wrote:
>>> >
>>> >> Tim,
>>> >>
>>> >> The suggestion below from Sascha is a good one.  The other approach
>>> I¹ve
>>> >> take before is to perform a search in the repo for a given document
>>>and
>>> >> only if it does not exist would I insert it, otherwise perform an
>>> update
>>> >> or just log it as an ³error².
>>> >>
>>> >> Thanks,
>>> >>
>>> >> Ron DiFrango
>>> >> Director / Architect  |  CapTech
>>> >>
>>> >>
>>> >>
>>> >> On 6/22/14, 5:37 AM, "Sascha Homeier" <shome...@meyle-mueller.de>
>>> wrote:
>>> >>
>>> >> >Hi Tim,
>>> >> >
>>> >> >you said you need to migrate the documents from FileNet to a CMIS
>>> >> >compliant server.
>>> >> >Is the CMIS compliant server your implementation?
>>> >> >If so you could calculate a Hash like MD5 over the  content stream
>>>and
>>> >> >set it as the object ID.
>>> >> >Due to the CMIS spec this object ID needs to be unique. So it must
>>>be
>>> >> >ensured that no two objects with the same object ID exists in the
>>>same
>>> >> >CMIS repository which is equivalent to have two objects with the
>>>same
>>> >> >content stream.
>>> >> >This approach whould also ensure to not add equal documents in the
>>> >>future
>>> >> >after migration is done.
>>> >> >Nevertheless here you also need to find a performant way of
>>> determining
>>> >> >if an object with an ID already exists (and find a solution if the
>>> hash
>>> >> >is changed only by a timestamp inside the content stream etc.)
>>> >> >With about two million objects you maybe need to extend the RAM on
>>>the
>>> >> >migration machine to keep such many objects in memory and
>>>comparing it
>>> >>by
>>> >> >using Hashmaps and Hashtables with own implementations of equals()
>>>and
>>> >> >hashCode() ;)
>>> >> >
>>> >> >Anyway a stimulating task. I'm curious about the ideas of others
>>>here
>>> >>to
>>> >> >solve it in a performant way ;)
>>> >> >
>>> >> >Cheers
>>> >> >Sascha
>>> >> >
>>> >> >-----Ursprüngliche Nachricht-----
>>> >> >Von: Tim Webster [mailto:tim.webs...@gmail.com]
>>> >> >Gesendet: Samstag, 21. Juni 2014 17:55
>>> >> >An: dev@chemistry.apache.org
>>> >> >Betreff: Re: document 'uniqueness'
>>> >> >
>>> >> >Hello,
>>> >> >
>>> >> >yes thanks for the suggestion - it sort of does that already with
>>>the
>>> >> >Spring Batch progress tracking, but it still won't prevent another
>>> >> >document being added to the repository that is identical to a
>>>previous
>>> >> >one if it somehow failed - like a JVM crash or power failure.
>>>Because
>>> >> >there is no transaction management for the CMIS part, you can't
>>>really
>>> >> >ensure this, except for a constraint in the repository itself.
>>> >> >
>>> >> >Anyway, yeah I think you're right and I need to look at FileNet
>>> >> >specifically.  I just wasn't sure if I missed something and there
>>>was
>>> >> >something in the CMIS spec that I could use (e.g. some property or
>>> >> >something).
>>> >> >
>>> >> >Thanks,
>>> >> >
>>> >> >Tim
>>> >> >
>>> >> >
>>> >> >
>>> >> >On Fri, Jun 20, 2014 at 10:24 PM, Lucas, Mike <mike.lu...@gwl.ca>
>>> >>wrote:
>>> >> >
>>> >> >> I'm sure you've already thought of this, but couldn't your
>>>migration
>>> >> >> process just persist the legacy ids in a separate location (e.g.
>>> >> >> database table, possibly cached in memory for performance)? Then
>>>you
>>> >> >> would just need to check that for each document being migrated,
>>>to
>>> >> >> make sure that the same doc hasn't been seen previously.
>>> >> >>
>>> >> >> Not a CMIS related solution, but seems like it would work fine...
>>> >> >>
>>> >> >> The other option, as you suggest, is to see if FileNet supports a
>>> >> >> 'uniqueness' constraint for custom metadata properties. I believe
>>> >> >> Sharepoint does but not sure about FileNet.
>>> >> >>
>>> >> >> Thanks
>>> >> >> michael lucas  |  Senior Software Developer  |  Great-West Life
>>> >> >>
>>> >> >>
>>> >> >> -----Original Message-----
>>> >> >> From: Tim Webster [mailto:tim.webs...@gmail.com]
>>> >> >> Sent: June 20, 2014 8:15 AM
>>> >> >> To: dev@chemistry.apache.org
>>> >> >> Subject: document 'uniqueness'
>>> >> >>
>>> >> >> Hi,
>>> >> >>
>>> >> >> I am developing a migration process (using Spring Batch) to
>>>migrate
>>> >> >> documents from a legacy CMS into a CMIS-compliant system, and I
>>>need
>>> >> >> to ensure that duplicate documents are not created accidentally.
>>> >> >>
>>> >> >> However, our CMIS system (IBM FileNet) allows the addition of
>>> >> >> documents with the same name.  Documents with identical values
>>>for
>>> >> >> cmis:name or cmis:contentStreamFilename are allowed.  Even if
>>>this
>>> >> >> could be disabled (I don't know if it can or cannot), it is a
>>> >>business
>>> >> >> requirement and I wouldn't be able to.
>>> >> >>
>>> >> >> The only thing I can think of to prevent this is to save the
>>> 'legacy'
>>> >> >> ID of the document in a new CMIS property and somehow check that
>>>it
>>> >> >> doesn't already exist when adding a new document. However this
>>>will
>>> >>be
>>> >> >> very inefficient and slow down the migration (we're talking
>>>about up
>>> >> >> to 2 million documents).
>>> >> >>
>>> >> >> Ideally the 'uniqueness constraint' would be checked on the
>>>server
>>> >>and
>>> >> >> would throw an exception, which I could then deal with.
>>> >> >>
>>> >> >> Does anyone know of an easier way to do this, or is there
>>>anything I
>>> >> >> can make use of in the CMIS spec to help?
>>> >> >>
>>> >> >> Thanks,
>>> >> >>
>>> >>
>>> >>
>>>
>>>
>>

Re: AW: document 'uniqueness'

Reply via email to