HI! We are developing scoring system for recruitment. Recruiter enters vacancy requirements, and we score tens of thousands of CVs to this requirements, and return e.g. top 10 matches. We do not use fulltext search and sometimes even dont filter input CVs prior to scoring (some vacancies do not have mandatory requirements that can be used as a filter effectively).
So we have scoring function F(CV,VACANCY) that is currently inplemented in SQL and runs on Postgresql cluster. In worst case F is executed once on every CV in database. VACANCY part is fixed for one query, but changes between queries and there's very little we can process in advance. We expect to have about 100 000 000 CVs in next year, and do not expect our current implementation to offer desired low latency responce (<1 s) on 100M CVs. So we look for a horizontaly scaleable and fault-tolerant in-memory solution. Will Spark be usefull for our task? All tutorials I could find describe stream processing, or ML applications. What Spark extensions/backends can be useful? With best regards, Segey Melekhin