gauravkhatri05 commented on PR #3272:
URL: https://github.com/apache/parquet-java/pull/3272#issuecomment-3249142635
Hi @gszadovszky,
Thanks for your follow-up! I want to clarify the key distinction between the
current approach and our patch:
### What’s happening in current patch
With the `IdentityHashMap` check in `convertField(...)`:
```java
if (seenSchemas.containsKey(schema)) {
throw new UnsupportedOperationException(
"Recursive Avro schemas are not supported by parquet-avro: " +
schema.getFullName());
}
```
As soon as the converter encounters a recursive reference (e.g., `Employee
-> Employee`), it **throws immediately**, preventing even common, benign
recursive structures from working.
### What our patch achieves (max-depth)
We don’t block recursion outright. Instead, with a **configurable
max-depth** (e.g., default = 10), we allow traversal to continue until that
limit is reached. Only then do we prevent further descent. This allows
recursive schemas to function *within a bounded, safe depth*.
A typical use case is an `Employee` record with a `List<Employee>` for
subordinates—common in organizational hierarchies.
Here’s an example payload (max depth = 4) that works with our patch but
fails with the current implementation (IdentityHashMap):
```
A (CEO)
|
|--------------- B (CTO)
| |
| |---------------- B1 (Mgr 1)
| | |---------------- B1-1
(Architect)
| | |---------------- B1-2
(Team Lead)
| |
|---------------- B1-2-1 (SDE)
| |
| |---------------- B2 (Product Owner)
|
|--------------- C (HR)
```
* **Current patch (IdentityHashMap):** Rejects the recursive reference at
the first encounter.
* **Max-depth patch:** Supports the structure up to the safe limit, enabling
real-world recursive models like trees or org charts.
### For reference: Similar approach advocated elsewhere
This
[Medium](https://medium.com/hadoop-noob/recursive-avro-schema-for-parquet-a87c84b6aca1?utm_source=chatgpt.com)
post explicitly suggests the same solution—stop schema generation when a
specified **max\_depth** is reached to handle recursive Avro schemas for
Parquet gracefully.
### Here’s the essential snippet from our max-depth patch:
```java
private final int maxDepth = Integer.parseInt(
getApplicationProps().getProperty("parquet.schema.parser.recursion.max.depth",
"10")
);
// MAX-DEPTH PATCH START
public MessageType convert(Schema avroSchema) {
if (!avroSchema.getType().equals(Schema.Type.RECORD)) {
throw new IllegalArgumentException("Avro schema must be a record.");
}
return new MessageType(
avroSchema.getFullName(),
convertFields(avroSchema.getFields(), "", maxDepth)
);
}
private List<Type> convertFields(List<Schema.Field> fields, String
schemaPath, int depth) {
var types = new ArrayList<Type>();
--depth;
for (Schema.Field field : fields) {
if (field.schema().getType().equals(NULL)) {
continue;
}
if (depth > 0 || !isRecord(field)) {
types.add(convertField(field, appendPath(schemaPath, field.name()),
depth));
}
}
return types;
}
private boolean isRecord(Schema.Field field) {
var fieldSchema = field.schema();
if (UNION == fieldSchema.getType()) {
var optionalNonNullUnionElementSchema = fieldSchema.getTypes().stream()
.filter(schema -> NULL != schema.getType())
.findFirst();
if (optionalNonNullUnionElementSchema.isPresent()) {
var nonNullUnionElementSchema =
optionalNonNullUnionElementSchema.get();
if (ARRAY == nonNullUnionElementSchema.getType()) {
return checkArrayForRecord(nonNullUnionElementSchema);
} else if (MAP == nonNullUnionElementSchema.getType()) {
return checkMapForRecord(nonNullUnionElementSchema);
}
return RECORD == nonNullUnionElementSchema.getType();
}
return false;
} else if (ARRAY == fieldSchema.getType()) {
return checkArrayForRecord(fieldSchema);
} else if (MAP == fieldSchema.getType()) {
return checkMapForRecord(fieldSchema);
}
return fieldSchema.getType() == RECORD;
}
private boolean checkMapForRecord(Schema mapSchema) {
var valueSchema = mapSchema.getValueType();
if (UNION == valueSchema.getType()) {
return valueSchema.getTypes().stream().anyMatch(
schema -> RECORD == schema.getType()
);
} else {
return RECORD == valueSchema.getType();
}
}
private boolean checkArrayForRecord(Schema arraySchema) {
if (UNION == arraySchema.getElementType().getType()) {
return arraySchema.getElementType().getTypes().stream().anyMatch(
schema -> RECORD == schema.getType()
);
} else {
return RECORD == arraySchema.getElementType().getType();
}
}
// ... rest of convertField, convertUnion, etc., updated similarly ...
// MAX-DEPTH PATCH END
```
This enhancement allows recursive schemas to be **usable and safe**, not
just blocked. The depth is configurable so it can be tuned based on workload
and performance needs—default = 10 provides stability; dropping below 7 has
shown issues in our tests.
Let me know if my explanation makes sense or if you’d like me to clarify
further.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]