Yes, spark will be useful for following areas of your application: 1. Running same function on every CV in parallel and score 2. Improve scoring function by better access to classification and clustering algorithms, within and beyond mllib.
These are first benefits you can start with and then think of further improvement s specific to various use case, like pre annotated CV s etc Best Ayan On 25 May 2015 15:59, "Сергей Мелехин" <cpro...@gmail.com> wrote: > HI! > We are developing scoring system for recruitment. Recruiter enters vacancy > requirements, and we score tens of thousands of CVs to this requirements, > and return e.g. top 10 matches. > We do not use fulltext search and sometimes even dont filter input CVs > prior to scoring (some vacancies do not have mandatory requirements that > can be used as a filter effectively). > > So we have scoring function F(CV,VACANCY) that is currently inplemented in > SQL and runs on Postgresql cluster. In worst case F is executed once on > every CV in database. VACANCY part is fixed for one query, but changes > between queries and there's very little we can process in advance. > > We expect to have about 100 000 000 CVs in next year, and do not expect > our current implementation to offer desired low latency responce (<1 s) on > 100M CVs. So we look for a horizontaly scaleable and fault-tolerant > in-memory solution. > > Will Spark be usefull for our task? All tutorials I could find describe > stream processing, or ML applications. What Spark extensions/backends can > be useful? > > > With best regards, Segey Melekhin >