[GitHub] [hudi] nsivabalan commented on a change in pull request #4818: [HUDI-3396] Make sure `BaseFileOnlyViewRelation` only reads projected columns

GitBox Sun, 06 Mar 2022 12:13:17 -0800


nsivabalan commented on a change in pull request #4818:
URL: https://github.com/apache/hudi/pull/4818#discussion_r820281218




##########
File path: 
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileScanRDD.scala
##########
@@ -93,17 +74,8 @@ class HoodieFileScanRDD(
     // Register an on-task-completion callback to close the input stream.
     context.addTaskCompletionListener[Unit](_ => iterator.close())
 
-    // extract required columns from row
-    val iterAfterExtract = HoodieDataSourceHelper.extractRequiredSchema(

Review comment:
       ok, I see. got it. thanks!

##########
File path: 
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieUnsafeRDD.scala
##########
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.{Partition, SparkContext, TaskContext}
+
+/**
+ * !!! PLEASE READ CAREFULLY !!!
+ *
+ * Base class for all of the custom low-overhead RDD implementations for Hudi.
+ *
+ * To keep memory allocation footprint as low as possible, each inheritor of 
this RDD base class
+ *
+ * <pre>
+ *   1. Does NOT deserialize from [[InternalRow]] to [[Row]] (therefore only 
providing access to
+ *   Catalyst internal representations (often mutable) of the read row)
+ *
+ *   2. DOES NOT COPY UNDERLYING ROW OUT OF THE BOX, meaning that
+ *
+ *      a) access to this RDD is NOT thread-safe
+ *
+ *      b) iterating over it reference to a _mutable_ underlying instance (of 
[[InternalRow]]) is
+ *      returned, entailing that after [[Iterator#next()]] is invoked on the 
provided iterator,
+ *      previous reference becomes **invalid**. Therefore, you will have to 
copy underlying mutable
+ *      instance of [[InternalRow]] if you plan to access it after 
[[Iterator#next()]] is invoked (filling
+ *      it with the next row's payload)
+ *
+ *      c) due to item b) above, no operation other than the iteration will 
produce meaningful
+ *      results on it and will likely fail [1]
+ * </pre>
+ *
+ * [1] For example, [[RDD#collect]] method on this implementation would not 
work correctly, as it's
+ * simply using Scala's default [[Iterator#toArray]] method which will simply 
concat all the references onto
+ * the same underlying mutable object into [[Array]]. Instead each individual 
[[InternalRow]] _has to be copied_,
+ * before concatenating into the final output. Please refer to 
[[HoodieRDDUtils#collect]] for more details.

Review comment:
       is this addressed ?

##########
File path: 
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadIncrementalRelation.scala
##########
@@ -148,20 +140,20 @@ class MergeOnReadIncrementalRelation(sqlContext: 
SQLContext,
         hadoopConf = new Configuration(conf)
       )
 
-      val hoodieTableState = HoodieMergeOnReadTableState(fileIndex, 
HoodieRecord.RECORD_KEY_METADATA_FIELD, preCombineFieldOpt)
+      val hoodieTableState = 
HoodieTableState(HoodieRecord.RECORD_KEY_METADATA_FIELD, preCombineFieldOpt)
 
       // TODO implement incremental span record filtering w/in RDD to make 
sure returned iterator is appropriately
       //      filtered, since file-reader might not be capable to perform 
filtering
-      val rdd = new HoodieMergeOnReadRDD(
+      new HoodieMergeOnReadRDD(
         sqlContext.sparkContext,
         jobConf,
         fullSchemaParquetReader,
         requiredSchemaParquetReader,
         hoodieTableState,
         tableSchema,
-        requiredSchema
+        requiredSchema,
+        fileIndex

Review comment:
       ok.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] nsivabalan commented on a change in pull request #4818: [HUDI-3396] Make sure `BaseFileOnlyViewRelation` only reads projected columns

Reply via email to