Sandeep Joshi wrote:
So, as you see, I managed to create the required code to return a
valid schema, and was also able to write unittests for it.
I copied "protected[spark]" from the CSV implementation, but I
commented it out because it prevents compilation from being
successful and it does not seem to be required.
And most importantly, I have no idea how to create a valid
dataframe to be returned by buildScan so that the data that is
stored on disk is not loaded all at once in memory (it may be very
huge, like hundreds of millions of rows).
You are effectively building a datasource for Spark
You can subclass the RDD class and create your own RDD which will be
returned in buildScan above.
This RDD class must implement a compute() method which will return an
Iterator
The iterator.next() will then be invoked by Spark as it executes.
Look at how the Cassandra connector does it
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraTableScanRDD.scala#L354
Ah yes, that makes sense. Somehow, I was fixating on creating a RDD[Row]
instance instead of deriving my own class from RDD[Row].
I read the documentation here:
https://spark.apache.org/docs/2.1.1/api/java/org/apache/spark/sql/sources/BaseRelation.html
<https://spark.apache.org/docs/2.1.1/api/java/org/apache/spark/sql/sources/BaseRelation.html>
It says "Concrete implementation should inherit from one of the
descendant Scan classes" but I could not find those any of those
descendant in the documentation nor in the source code.
The scan classes referred to here are these, in addition to the
CatalystScan at the bottom of the file
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L245-L277
Great, it clarifies the situation.
With that in mind, I was able to create the complete set of classes and
work with my custom format.
Thanks for your help.
---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org