This has turned out to be non-trivial. Current progress in in PR [1].

Easy

- Added a property to BaseNCodec to enable strict decoding

All other properties of the codec are set in the constructor and are final. So 
using a property is a change. I can change to overloaded constructors but there 
are already a lot of these so a property or builder type pattern is cleaner e.g:

// Returns a new codec with strict decoding, all other properties the same
Base64 withStrictDecoding();

I can remove the isStrictEncoding() method. However this would have to replaced 
by the package/protected level access to the property for Base32/Base64. Other 
properties in the class are protected. So maybe drop the isStrictEncoding() 
method and change the accessor level for the property. However there is another 
property in Base64 (isUrlSafe()) so properties are not excluded from the design 
pattern.

- Added tests to show Base64 strict mode works

This is fine. Base64 encoding cannot create a final 6-bit character. Previously 
this was ignored. It now optionally throws in strict mode.

All other validation of trailing 12 or 18 bits works. I added tests to show 
that a decoded byte[] is re-encoded to the same byte[] when in strict mode but 
not in lenient mode.


Issues

- Added tests to show Base32 strict mode works

It does. It also shows that the current decoder is capable of decoding 
non-sense.

Referring to the final section for the encode method shows encoding can create 
a count of 2, 4, 5, or 7 trailing characters. So a valid Base32 encoding will 
never have 1, 3, or 6 trailing characters.

Previously 1 trailing character was ignored. It now optionally throws. So far, 
so good.

But the previous behaviour for 3 or 6 trailing characters is to decode them 
into as many bytes as is possible. This does not make sense. They could only 
have come from an invalid encoding:

3 chars = 15-bits = 1 byte and 7 remainder. So this should have used 2 trailing 
characters and a pad.
6 chars = 30-bits = 3 bytes and 6 remainder. So this should have used 5 
trailing characters and a pad.

In the PR I have preserved this behaviour for lenient mode. This is backwards 
compatible. In strict mode this will now throw. So lenient mode is allowing 
non-pad characters to act as padding in certain cases.

As before I added tests to show that a decoded byte[] is re-encoded to the same 
byte[] when in strict mode but not in lenient mode.


Unresolved

Base64 is used by codec.net.BCodec. This also had tests for impossible 
encodings. These now fail as the default is to revert to lenient mode.

So this requires a decision on whether BCodec should:

- Regress to lenient mode
- Move to strict mode and throw more exceptions than before
- Also add a property to enable strict mode


Have a look at the PR and give some suggestions.

Alex



[1] https://github.com/apache/commons-codec/pull/35 
<https://github.com/apache/commons-codec/pull/35>

Reply via email to