Hello, Igniters! Recently we encountered an unexpected issue. Let me start with its roots, before I start discussing potential fixes.
We noticed that certain benchmarks showed some inefficiencies when being run on new MacBooks. They were related to low-level serialization code, and the cause of it was an unaligned read in GridUnsafe. "aarch64" allows it, but the architecture is not included in the "GridUnsafe#unaligned" check, which resulted in the execution of fall-back code that reads and writes everything byte by byte. The fix seemed trivial, and we did it in [1] by adding "aarch64" into the list of architectures that support unaligned memory access. After a while, when we enabled the "ItCompatibilityTest#testCompatibility", we realized that compatibility on MacBooks is broken. The incompatibility has been caused by [1], and as a hotfix, it has been temporarily reverted in [2]. How was that possible? When we finished the investigation, it turned out "DirectByteBufferStreamImplV1#writeUuid" and "DirectByteBufferStreamImplV1#readUuid" have a particularly nasty bug in them. This is how these methods behave in 3.0: - If we run on an "i386", "x86", "amd64", or "x86_64", we will write parts of UUID in Big Endian. - If we run on other Little Endian architectures, we will write these parts in Little Endian. - If we run on a Big Endian architecture, we will write these parts in Big Endian. When we added "aarch64" to the list of "unaligned" architectures, we started treating its data as BE in "main" while Ignite 3.0 treats it as LE. For the clarification - this stream is used for - Network communication, runtime only. - Serialization of raft commands, this data is written to the storage. That's why fix [1] broke compatibility. Such a behavior constitutes a problem, because network protocol and raft serialization must be architecture-independent: - It is possible that nodes in the same cluster are run in different environments with different architectures. - It is possible, and almost guaranteed, that raft command serialization happens on a different node, and thus must also be architecture-independent. (node A does the serialization, node B writes resulted payload into the log storage) That's issue number 1. The issue number 2 was found when we inspected the code of "DirectByteBufferStreamImplV1". "writeFixedInt"/"readFixedInt" (long too) methods parity is violated in BE architectures. Writes are always LE, but read uses native bytes ordering. In other words, Ignite 3.0 doesn't really work on Big Endian architectures. Fixing this place in particular is trivial, we will do it in 3.1. Fixing broken Little Endian architectures might not be as trivial. My proposal is the following: - We fix the bug in UUID serialization, and always use Big Endian for encoding there. This will make our protocols correct on all architectures at once. This fix will break backwards compatibility on Little Endian architectures that are NOT included in the following list: "i386", "x86", "amd64", and "x86_64". This means that an upgrade from 3.0 to 3.1 will be impossible*. - We add "aarch64" into the list of architectures that support unaligned memory access. - We explicitly disable "ItCompatibilityTest#testCompatibility" on a number of architectures. - * If it turns out that we have a user, who uses one of those architectures and who must upgrade their cluster from 3.0, we will prepare and provide a log storage conversion tool that will replace all Little Endian UUIDs to Big Endian format. As far as I'm aware, only log storage is affected at the moment. It's better to fix it in 3.1, because it will be more widely adopted than 3.0. I will do that in [3]. Please provide your feedback to the proposal. What are your thoughts? Thank you! [1] https://issues.apache.org/jira/browse/IGNITE-25564 [2] https://issues.apache.org/jira/browse/IGNITE-25796 [3] https://issues.apache.org/jira/browse/IGNITE-25797 -- Sincerely yours, Ivan Bessonov