Hi Theo, We had a very similar problem with one of our spark streaming jobs. Best solution was to create a custom source having all external records in cache, periodically reading external data and comparing it to cache. All changed records were then broadcasted to task managers. We tried to implement background loading in separate thread, but this solution was more complicated, we needed to create shadow copy of cache and then quickly switch them. And with spark streaming there were additional problems.
Hope this helps, Maxim.