Hi experts, I am trying to create a custom Spark Datasource(v1) to read from a transactional data endpoint, and I need to acquire a lock with the endpoint before fetching data and release the lock after reading. Note that the lock acquisition and release needs to happen in the Driver JVM.
I have created a custom RDD for this purpose, and tried acquiring the lock in MyRDD.getPartitions(), and releasing the lock at the end of the job by registering a QueryExecutionListener. Now as I have learnt, this is not the right approach as the RDD can get reused on further actions WITHOUT calling getPartitions() again(as the partitions of an RDD get cached). For example, if someone calls Dataset.collect() twice, the first time MyRDD.getPartitions() will get invoked, I will acquire a lock and release the lock at the end. However the second time collect() is called, getPartitions will NOT be called again as the RDD would be reused and the partitions would have gotten cached in the RDD. Can someone advice me on where would be the right places to acquire and release a lock with my data endpoint in this scenario. Thanks a lot, Abhishek Somani