Sandeep Joshi wrote:

    So, as you see, I managed to create the required code to return a
    valid schema, and was also able to write unittests for it.
    I copied "protected[spark]" from the CSV implementation, but I
    commented it out because it prevents compilation from being
    successful and it does not seem to be required.
    And most importantly, I have no idea how to create a valid
    dataframe to be returned by buildScan so that the data that is
    stored on disk is not loaded all at once in memory (it may be very
    huge, like hundreds of millions of rows).



You are effectively building a datasource for Spark

You can subclass the RDD class and create your own RDD which will be returned in buildScan above. This RDD class must implement a compute() method which will return an Iterator
The iterator.next() will then be invoked by Spark as it executes.

Look at how the Cassandra connector does it
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraTableScanRDD.scala#L354
Ah yes, that makes sense. Somehow, I was fixating on creating a RDD[Row] instance instead of deriving my own class from RDD[Row].

    I read the documentation here:
    
https://spark.apache.org/docs/2.1.1/api/java/org/apache/spark/sql/sources/BaseRelation.html
    
<https://spark.apache.org/docs/2.1.1/api/java/org/apache/spark/sql/sources/BaseRelation.html>
    It says "Concrete implementation should inherit from one of the
    descendant Scan classes" but I could not find those any of those
    descendant in the documentation nor in the source code.


The scan classes referred to here are these, in addition to the CatalystScan at the bottom of the file

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L245-L277
Great, it clarifies the situation.

With that in mind, I was able to create the complete set of classes and work with my custom format.
Thanks for your help.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to