Hi, We have a kafka cluster running in production and there are two spark streaming job (J1 and J2) that fetches the data from the same topic.
We noticed that if one of the two jobs (say J1) starts reading data from old offset (that job failed for 2 hours and when we started the job after fixing the failure the offset was old), that data is read from disk instead of reading from OS cache. When this happens the other job's (J2) throughput is reduced even though that job's offset is recent. We believe that the recent data is most likely in memory so we are not sure why the other job's (J2) throughput is reduced. Did anyone come across such an issue in production? If yes how did you fix the issue? -Mayur -- Learn more about our inaugural *FirstScreen Conference <http://www.firstscreenconf.com/>*! *Where the worlds of mobile advertising and technology meet!* June 15, 2016 @ Urania Berlin