Hello,
I'm trying to extend Spark so that it can use our own binary format as a
read-only source for pipeline based computations.
I already have a java class that gives me enough elements to build a
complete StructType with enough metadata (NominalAttribute for instance).
It also gives me the r
Sandeep Joshi wrote:
So, as you see, I managed to create the required code to return a
valid schema, and was also able to write unittests for it.
I copied "protected[spark]" from the CSV implementation, but I
commented it out because it prevents compilation from being
success
Hello,
As I have written my own data source, I also wrote a custom RDD[Row]
implementation to provide getPartitions and compute overrides.
This works very well but doing some performance analysis, I see that for
any given pipeline fit operation, a fair amount of time is spent in the
RDD.count