Hi experts, I'll be very grateful if someone could help.
Thanks, Abhishek On Fri, May 24, 2019 at 7:06 PM Abhishek Somani <abhisheksoman...@gmail.com> wrote: > Hi experts, > > I am trying to create a custom Spark Datasource(v1) to read from a > transactional data endpoint, and I need to acquire a lock with the endpoint > before fetching data and release the lock after reading. Note that the lock > acquisition and release needs to happen in the Driver JVM. > > I have created a custom RDD for this purpose, and tried acquiring the lock > in MyRDD.getPartitions(), and releasing the lock at the end of the job by > registering a QueryExecutionListener. > > Now as I have learnt, this is not the right approach as the RDD can get > reused on further actions WITHOUT calling getPartitions() again(as the > partitions of an RDD get cached). For example, if someone calls > Dataset.collect() twice, the first time MyRDD.getPartitions() will get > invoked, I will acquire a lock and release the lock at the end. However the > second time collect() is called, getPartitions will NOT be called again as > the RDD would be reused and the partitions would have gotten cached in the > RDD. > > Can someone advice me on where would be the right places to acquire and > release a lock with my data endpoint in this scenario. > > Thanks a lot, > Abhishek Somani >