vvivekiyer opened a new issue, #10127:
URL: https://github.com/apache/pinot/issues/10127
### Issue Description
This issue is only applicable when a **REALTIME** segment contains a column
with the following properties:
1. Multivalue (MV) column
2. VarByte datatype - String, Bytes, BigDecimal
3. Raw aka noDictionary
When a consuming segment has a column with the above properties, segment
building fails with the following error:
[pinot-server] [] Could not build segment
```
[pinot-server] [] Could not build segment
java.lang.IllegalArgumentException: integer overflow detected
at
com.google.common.base.Preconditions.checkArgument(Preconditions.java:145)
~[guava-31.1-jre.jar:?]
at
org.apache.pinot.segment.local.segment.creator.impl.fwd.MultiValueVarByteRawIndexCreator.getTotalRowStorageBytes(MultiValueVarByteRawIndexCreator.java:154)
~[pinot-segment-local-0.12.0-dev-755.jar:0.12.0-dev-755-d458668690df6366ebab7a162720d081f9647a2b]
at
org.apache.pinot.segment.local.segment.creator.impl.fwd.MultiValueVarByteRawIndexCreator.<init>(MultiValueVarByteRawIndexCreator.java:78)
~[pinot-segment-local-0.12.0-dev-755.jar:0.12.0-dev-755-d458668690df6366ebab7a162720d081f9647a2b]
at
org.apache.pinot.segment.local.segment.creator.impl.DefaultIndexCreatorProvider.getRawIndexCreatorForMVColumn(DefaultIndexCreatorProvider.java:263)
~[pinot-segment-local-0.12.0-dev-755.jar:0.12.0-dev-755-d458668690df6366ebab7a162720d081f9647a2b]
at
org.apache.pinot.segment.local.segment.creator.impl.DefaultIndexCreatorProvider.newForwardIndexCreator(DefaultIndexCreatorProvider.java:87)
~[pinot-segment-local-0.12.0-dev-755.jar:0.12.0-dev-755-d458668690df6366ebab7a162720d081f9647a2b]
at
org.apache.pinot.segment.spi.index.IndexingOverrides$Default.newForwardIndexCreator(IndexingOverrides.java:156)
~[pinot-segment-spi-0.12.0-dev-755.jar:0.12.0-dev-755-d458668690df6366ebab7a162720d081f9647a2b]
at
org.apache.pinot.segment.local.segment.creator.impl.SegmentColumnarIndexCreator.init(SegmentColumnarIndexCreator.java:228)
~[pinot-segment-local-0.12.0-dev-755.jar:0.12.0-dev-755-d458668690df6366ebab7a162720d081f9647a2b]
at
org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl.build(SegmentIndexCreationDriverImpl.java:211)
~[pinot-segment-local-0.12.0-dev-755.jar:0.12.0-dev-755-d458668690df6366ebab7a162720d081f9647a2b]
at
org.apache.pinot.segment.local.realtime.converter.RealtimeSegmentConverter.build(RealtimeSegmentConverter.java:110)
~[pinot-segment-local-0.12.0-dev-755.jar:0.12.0-dev-755-d458668690df6366ebab7a162720d081f9647a2b]
at
org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager.buildSegmentInternal(LLRealtimeSegmentDataManager.java:895)
[pinot-core-0.12.0-dev-755.jar:0.12.0-dev-755-d458668690df6366ebab7a162720d081f9647a2b]
at
org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager.buildSegmentForCommit(LLRealtimeSegmentDataManager.java:806)
[pinot-core-0.12.0-dev-755.jar:0.12.0-dev-755-d458668690df6366ebab7a162720d081f9647a2b]
at
org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager$PartitionConsumer.run(LLRealtimeSegmentDataManager.java:705)
[pinot-core-0.12.0-dev-755.jar:0.12.0-dev-755-d458668690df6366ebab7a162720d081f9647a2b]
at java.lang.Thread.run(Thread.java:834) [?:?]
```
### Root Cause Analysis
The mutable segment creates the column as dictEnabled. However, the offline
segment creation attempts to create the column as noDict. But the Writer
doesn't have maxRowLengthInBytes metadata to construct the forwardIndex.
A longer version of the above RCA is below:
1. In a real time table, if an MV column of dataType String, Bytes,
BigDecimal is created with noDictionary, we still end up creating a dictionary
for the MutableSegment. This limitation is because we don't have an
implementation for `MutableForwardIndex` that handles noDict VarByte columns.
https://github.com/apache/pinot/blob/ca86efca006453d407475ba074af1d4d492b920f/pinot-segment-local/src/main/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentImpl.java#L427
2. When `RealtimeSegmentConverter` tries to build a completed segment in
this case, the segment build is done in two phases - (i) collect column
Statistics for each column (ii) Read each mutable segment row and index it for
offline segment creation. Note that the column stats gathering for mutable
segments doesn't need to read each record. It is done through
`MutableColumnStatistics`.
4. When `SegmentColumnarIndexCreator` tries to create an index creator for
this column (mentioned in 1), it honors table config and tries to create a
noDict column. So it uses the `MultiValueVarByteRawIndexCreator`. This creator
requires `maxRowLengthInBytes` which is not available through
`MutableColumnStatistics` and there is no way to compute it on the fly without
reading all the records in the mutable segment.
### Potential Solutions
1. (Ideal Solution) Implement a MutableForwardIndex version that supports
noDict VarByte columns. This will automatically create a noDict column for the
Mutable Segment. Converting a mutable segment to Completed segment when column
property is noDict in both will automatically be handled. Until this solution
is implemented, we can address the Assert by creating a dictEnabled column in
the offline segment as well automatically during conversion.
2. (Hacky Solution): Detect that a column is needs to change from Dict ->
noDict during realtime segment conversion. If this is the case, perform an
additional read of all the records in the mutable segment to construct
ColumnStatistics with maxRowLengthInBytes for this column. Use this to create
the `MultiValueVarByteRawIndexCreator`
Opening this issue to get feedback from the community about which way to
proceed. Also wanted to check if there are other solutions to address this
problem.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]