Hello,

At CERN we are developing a Big Data system called NXCALS that uses Spark as
Extraction API.
We have implemented a custom datasource that was wrapping 2 existing ones
(parquet and Hbase) in order to hide the implementation details (location of
the parquet files, hbase tables, etc) and to provide an abstraction layer to
our users. 
We have entered a stage where we execute some performance tests on our data
and we have noticed that this approach did not provide the expected
performance observed using pure Spark. In other words reading a parquet file
with some simple predicates behaves 15 times slower if the same code is
executed from within a custom datasource (that just uses Spark to read
parquet). 
After some investigation we've learnt that Spark did not apply the same
optimisations for both. 
We could see that in Spark 2.3.0 there was a new V2 version that abstracts
from SparkSession and focuses on low level Row API. 
Could you give us some suggestions of how to correctly implement our
datasource using the V2 API? 
Is this a correct way of doing it at all? 

What we want to achieve is to join existing datasources with some level of
additional abstraction on top. 
At the same time we want to profit from all catalyst & parquet optimisations
that exist for the original ones.
We also don't want to reimplement access to parquet files or Hbase at the
low level (like Row) but just profit from the Dataset API. 
We could have achieved the same by providing an external library on top of
Spark but the datasource approach looked like a more elegant solution. Only
the performance is still far from the desired one. 

Any help or direction in that matter would be greatly appreciated as we have
only started to build our Spark expertise yet.  

Best regards,
Jakub Wozniak
Software Engineer
CERN



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to