Re: [How-To][SQL] Create a dataframe inside the TableScan.buildScan method of a relation

OBones Tue, 27 Jun 2017 07:36:31 -0700

Sandeep Joshi wrote:


    So, as you see, I managed to create the required code to return a
    valid schema, and was also able to write unittests for it.
    I copied "protected[spark]" from the CSV implementation, but I
    commented it out because it prevents compilation from being
    successful and it does not seem to be required.
    And most importantly, I have no idea how to create a valid
    dataframe to be returned by buildScan so that the data that is
    stored on disk is not loaded all at once in memory (it may be very
    huge, like hundreds of millions of rows).



You are effectively building a datasource for Spark

You can subclass the RDD class and create your own RDD which will bereturned in buildScan above.This RDD class must implement a compute() method which will return anIterator

The iterator.next() will then be invoked by Spark as it executes.

Look at how the Cassandra connector does it
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraTableScanRDD.scala#L354

Ah yes, that makes sense. Somehow, I was fixating on creating a RDD[Row]instance instead of deriving my own class from RDD[Row].

    I read the documentation here:
    
https://spark.apache.org/docs/2.1.1/api/java/org/apache/spark/sql/sources/BaseRelation.html
    
<https://spark.apache.org/docs/2.1.1/api/java/org/apache/spark/sql/sources/BaseRelation.html>
    It says "Concrete implementation should inherit from one of the
    descendant Scan classes" but I could not find those any of those
    descendant in the documentation nor in the source code.

The scan classes referred to here are these, in addition to theCatalystScan at the bottom of the file


https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L245-L277

Great, it clarifies the situation.

With that in mind, I was able to create the complete set of classes andwork with my custom format.

Thanks for your help.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [How-To][SQL] Create a dataframe inside the TableScan.buildScan method of a relation

Reply via email to