Hi, ankur! Thanks for your reply! CVs are a just bunch of IDs, each ID represents some object of some class (eg. class=JOB, object=SW Developer). We have already processed texts and extracted all facts. So we don't need to do any text processing in Spark, just to run scoring function on many many CVs, and return top 10 matches.
С Уважением, Сергей Мелехин. 2015-05-25 16:28 GMT+10:00 ankur chauhan <an...@malloc64.com>: > Hi, > > I am sure you can use spark for this but it seems like a problem that > should be delegated to a text based indexing technology like elastic search > or something based on lucene to serve the requests. Spark can be used to > prepare the data that can be fed to the indexing service. > > Using spark directly seems like there would be a lot of repeated > computations between requests which can be avoided. > > There are a bunch of spark-elasticsearch bindings that can be used to make > the process easier. > > Again, sparksql can help you convert most of the logic directly to spark > jobs but I would suggest exploring text indexing technologies too. > > -- ankur > ------------------------------ > From: Сергей Мелехин <cpro...@gmail.com> > Sent: 5/24/2015 10:59 PM > To: user@spark.apache.org > Subject: Using Spark like a search engine > > HI! > We are developing scoring system for recruitment. Recruiter enters vacancy > requirements, and we score tens of thousands of CVs to this requirements, > and return e.g. top 10 matches. > We do not use fulltext search and sometimes even dont filter input CVs > prior to scoring (some vacancies do not have mandatory requirements that > can be used as a filter effectively). > > So we have scoring function F(CV,VACANCY) that is currently inplemented in > SQL and runs on Postgresql cluster. In worst case F is executed once on > every CV in database. VACANCY part is fixed for one query, but changes > between queries and there's very little we can process in advance. > > We expect to have about 100 000 000 CVs in next year, and do not expect > our current implementation to offer desired low latency responce (<1 s) on > 100M CVs. So we look for a horizontaly scaleable and fault-tolerant > in-memory solution. > > Will Spark be usefull for our task? All tutorials I could find describe > stream processing, or ML applications. What Spark extensions/backends can > be useful? > > > With best regards, Segey Melekhin >