the-other-tim-brown opened a new pull request, #13686:
URL: https://github.com/apache/hudi/pull/13686
### Change Logs
Currently, our payload based paths use `HoodieAvroRecord` to transport data
between Spark Executors. As we move away from the Payloads though, we can start
relying on the other record objects directly. The `HoodieAvroIndexedRecord` can
fit our needs for transporting the Avro data but needs some changes to match
the existing performance.
This change introduces a new class `SerializableIndexedRecord` which is used
to manage the serialization of the data in the `HoodieAvroIndexedRecord`.
Unlike payloads, the data is only written out to a byte array when it is
required. This allows us to keep performance on par with the existing
performance when working with data that only resides within a single machine.
For existing workflows that use `HoodieAvroIndexedRecord` like compaction,
we expect to see the same performance. This is validate with a JMH
Microbenchmark where I validate that the call the `setSchema` does not cause
throughput changes when working with the object.
The serialized size of the object when using Kryo is also about 2/3 the size
of the existing record with a fairly basic object with 15 fields with mainly
numeric or small strings as values.
### Impact
Allows us to move away from Payloads without performance degradation on
serialization
### Risk level (write none, low medium or high below)
Low
### Documentation Update
_Describe any necessary documentation update if there is any new feature,
config, or user-facing change. If not, put "none"._
- _The config description must be updated if new configs are added or the
default value of the configs are changed_
- _Any new feature or user-facing change requires updating the Hudi website.
Please create a Jira ticket, attach the
ticket number here and follow the
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to
make
changes to the website._
### Contributor's checklist
- [ ] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [ ] Change Logs and Impact were stated clearly
- [ ] Adequate tests were added if applicable
- [ ] CI passed
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]