Hi Michael, > Processing LDIF is one thing, doing LDAP operations another. > > LDIF itself is meant to be ASCII-clean. But each attribute value can carry any > byte sequence (e.g. attribute 'jpegPhoto'). There's no further processing by > module LDIF - it simply returns byte sequences. > > The access protocol LDAPv3 mandates UTF-8 encoding for Unicode strings on the > wire if attribute syntax is DirectoryString, IA5String (mainly ASCII) or > similar. > > So if you're LDIF input returns UTF-16 encoded attribute values for e.g. > attribute 'cn' or 'o' or another attribute not being of OctetString or Binary > syntax something's wrong with the producer of the LDIF data.
That could be, I am using ms's ldifde.exe to dump a domino and AD directory for comparative processing. The problem is I don't have much control on the data in the directory and I do know that DN's have non ascii characters unique to the > I wonder what the string really is. At least the base64-encoding you provided > before decodes as UTF-8 but I'm not sure whether it's the right sequence of > Unicode code points you're expecting. > > >>> 'ZGV0XDMzMTB3YmJccGc='.decode('base64').decode('utf-8') > u'det\\3310wbb\\pg' > > I still can't figure out what you're really doing though. I'd recommend to > strip down your operations to a very simple test code snippet illustrating the > issue and post that here. So I have removed all my likely broken attempts at working with this data and will soon have some simple code but at this point I may have an indication of what is awry with my data. After parsing the data for a user I am simply taking a value from the ldif file and writing it back out to another which fails, the value parsed is: officestreetaddress:: T3R0by1NZcOfbWVyLVN0cmHDn2UgMQ== File "C:\Python27\lib\site-packages\ldif.py", line 202, in unparse self._unparseChangeRecord(record) File "C:\Python27\lib\site-packages\ldif.py", line 181, in _unparseChangeRecord self._unparseAttrTypeandValue(mod_type,mod_val) File "C:\Python27\lib\site-packages\ldif.py", line 142, in _unparseAttrTypeandValue self._unfoldLDIFLine(':: '.join([attr_type,base64.encodestring(attr_value).replace('\n','')])) File "C:\Python27\lib\base64.py", line 315, in encodestring pieces.append(binascii.b2a_base64(chunk)) UnicodeEncodeError: 'ascii' codec can't encode character u'\xdf' in position 7: ordinal not in range(128) > c:\python27\lib\base64.py(315)encodestring() -> pieces.append(binascii.b2a_base64(chunk)) (Pdb) l 310 def encodestring(s): 311 """Encode a string into multiple lines of base-64 data.""" 312 pieces = [] 313 for i in range(0, len(s), MAXBINSIZE): 314 chunk = s[i : i + MAXBINSIZE] 315 -> pieces.append(binascii.b2a_base64(chunk)) 316 return "".join(pieces) 317 318 319 def decodestring(s): 320 """Decode a string.""" (Pdb) args s = Otto-Meßmer-Straße 1 So moving up a frame or two and looking at the entry dict, I see a modlist entry of: ('streetAddress', [u'Otto-Me\xdfmer-Stra\xdfe 1']) which is correct: In [2]: 'T3R0by1NZcOfbWVyLVN0cmHDn2UgMQ=='.decode('base64').decode('utf-8') Out[2]: u'Otto-Me\xdfmer-Stra\xdfe 1' Looking at the stack trace, I think I see the issue: (Pdb) import base64 (Pdb) base64.encodestring(u'Otto-Me\xdfmer-Stra\xdfe 1'.encode('utf-8')).replace('\n','') 'T3R0by1NZcOfbWVyLVN0cmHDn2UgMQ==' I now have the exact the value I started with. Ensuring where I ever handle the original values that I return utf-8 decoded objects for use in a modlist to later write and Sub classing LDIFWriter and overriding _unparseAttrTypeandValue to do the encoding has eliminated all the errors. What remains finally is ldifde.exe's output of what looks like U+00BF, or an inverted question mark for some values, otherwise this issue looks solved. Thanks for everything, jlc -- http://mail.python.org/mailman/listinfo/python-list