New submission from Martin <ma...@dtu.dk>: difflib.SequenceMatcher fails to make a proper alignment between 2 sequences with only 3 single letter changes. Its performance is completely off with a similarity ratio of 0.16, in stead of the more accurate 0.99.
Here is a snippet to replicate the failure: >>> aa_ref = >>> 'MTLFTTLLVLIFERLFKLGEHWQLDHRLEAFFRRVKHFSLGRTLGMTIIAMGVTFLLLRALQGVLFNVPTLLVWLLIGLLCIGAGKVRLHYHAYLTAASRNDSHARATMAGELTMIHGVPAGCDEREYLRELQNALLWINFRFYLAPLFWLIVGGTWGPVTLMGYAFLRAWQYWLARYQTPHHRLQSGIDAVLHVLDWVPVRLAGVVYALIGHGEKALPAWFASLGDFHTSQYQVLTRLAQFSLAREPHVDKVETPKAAVSMAKKTSFVVVVVIALLTIYGALV' >>> aa_seq = >>> 'MTLFTTLLVLIFERLFKLGEHWQLDHRLEAFFRRVKHFSLGRTLCMTIIAMGVTFLLLRALQGVLFNVPTLLVWLLIGLLCIGAGKVRLHYHAYLTAASRNDSHAHATMAGELTMIHGVPAGCDEREYLRELQNALLWINFRFYLAPLFWLIVGGTWGPVTLMGYAFLRAWQYWLARYQTPHHRLQSGIDAVLHALDWVPVRLAGVVYALIGHGEKALPAWFASLGDFHTSQYQVLTRLAQFSLAREPHVDKVETPKAAVSMAKKTSFVVVVVIALLTIYGALV' >>> sum(a!=b for a, b in zip(aa_ref, aa_seq)) 3 >>> match = SequenceMatcher(a=aa_ref, b=aa_seq) >>> match.ratio() 0.1619718309859155 >>> match.get_opcodes() [('equal', 0, 43, 0, 43), ('delete', 43, 79, 43, 43), ('equal', 79, 81, 43, 45), ('replace', 81, 122, 45, 80), ('equal', 122, 123, 80, 81), ('replace', 123, 284, 81, 284)] ---------- messages: 314163 nosy: mcft priority: normal severity: normal status: open title: SequenceMatcher bug type: behavior versions: Python 3.5 _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue33112> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com