ahshahid commented on code in PR #48252: URL: https://github.com/apache/spark/pull/48252#discussion_r1844719865
########## sql/api/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala: ########## @@ -148,34 +163,180 @@ object JavaTypeInference { // TODO: we should only collect properties that have getter and setter. However, some tests // pass in scala case class as java bean class which doesn't have getter and setter. val properties = getJavaBeanReadableProperties(c) - // add type variables from inheritance hierarchy of the class - val classTV = JavaTypeUtils.getTypeArguments(c, classOf[Object]).asScala.toMap ++ - typeVariables - // Note that the fields are ordered by name. - val fields = properties.map { property => - val readMethod = property.getReadMethod - val encoder = encoderFor(readMethod.getGenericReturnType, seenTypeSet + c, classTV) - // The existence of `javax.annotation.Nonnull`, means this field is not nullable. - val hasNonNull = readMethod.isAnnotationPresent(classOf[Nonnull]) - EncoderField( - property.getName, - encoder, - encoder.nullable && !hasNonNull, - Metadata.empty, - Option(readMethod.getName), - Option(property.getWriteMethod).map(_.getName)) + + // if the properties is empty and this is not a top level enclosing class, then we + // should not consider class as bean, as otherwise it will be treated as empty schema + // and loose the data on deser. Review Comment: Lets say the top level class for which encoder is being created, has a field x which is a POJO, but has no Bean type getters. This means field x corresponding schema is empty. So when the DataSet corresponding to top level class is converted to a dataframe, there is no representation of x, in the Row object. So when this data frame is converted back to DataSet, the field x : POJO will be set to null and there is data loss. But when we started , it was NOT NULL. It became null, because schema was empty. So to handle that case, a POJO without getters, should be represented as BinaryType , so that when the dataframe is reconverted, field x gets deserialized pojo. The reason why it is not done for top class is that there are existing tests, which assert that if top level class has no getters, schema should be empty, implying 0 rows and no schema. Now whether that is desirable, or it should be represented as a binary type is debatable. As in any case no meaningful sql operation can be done on binary data . So a distinction is made using the boolean. That is Top level class with no getters need to be treated differently from any field having no getters. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org