[How-To][SQL] Create a dataframe inside the TableScan.buildScan method of a relation

2017-06-22 Thread OBones
Hello, I'm trying to extend Spark so that it can use our own binary format as a read-only source for pipeline based computations. I already have a java class that gives me enough elements to build a complete StructType with enough metadata (NominalAttribute for instance). It also gives me the r

Re: [How-To][SQL] Create a dataframe inside the TableScan.buildScan method of a relation

2017-06-27 Thread OBones
Sandeep Joshi wrote: So, as you see, I managed to create the required code to return a valid schema, and was also able to write unittests for it. I copied "protected[spark]" from the CSV implementation, but I commented it out because it prevents compilation from being success

More efficient RDD.count() implementation

2017-07-12 Thread OBones
Hello, As I have written my own data source, I also wrote a custom RDD[Row] implementation to provide getPartitions and compute overrides. This works very well but doing some performance analysis, I see that for any given pipeline fit operation, a fair amount of time is spent in the RDD.count