Hi Priyanka, I've been exploring this part of Spark SQL and could help a little bit.
> but for some reason it never hit the breakpoints I placed in these classes. Was this for local[*]? I ran "SPARK_SUBMIT_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005" ./bin/spark-shell" and attached IDEA to debug the code. I used Spark 2.4.1 (with Scala 2.12) and it worked fine for the following queries: spark.range(5).write.save("hello") spark.read.parquet("hello").show > has anyone tried returning all the data as ColumnarBatch? Is there any reading material you can point me to? You may find some information in the internals book at https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-ColumnarBatch.html It's WIP. Let me know what part to explore in more detail. I'd do this momentarily (as I'm exploring parquet data source in more detail as we speak). Pozdrawiam, Jacek Laskowski ---- https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski On Mon, Apr 22, 2019 at 10:29 AM Priyanka Gomatam <priyanka.goma...@microsoft.com.invalid> wrote: > Hi, > > I am new to Spark and have been playing around with the Parquet reader > code. I have two questions: > > 1. I saw the code that starts at DataSourceScanExec class, and moves > on to the ParquetFileFormat class and does a VectorizedParquetRecordReader. > I tried doing a spark.read.parquet(…) and debugged through the code, but > for some reason it never hit the breakpoints I placed in these classes. > Perhaps I am doing something wrong, but is there a certain versioning for > parquet readers that I am missing out on? How do I make the code take the > DataSourceScanExec -> … -> ParquetReader … -> VectorizedParqeutRecordRead … > route? > 2. If I do manage to make it take the above path, I see there is a > point at which the data is filled into ColumnarBatch objects, has anyone > tried returning all the data as ColumnarBatch? Is there any reading > material you can point me to? > > Thanks in advance, this will be super helpful for me! >