Thanks I'll give it a try! С Уважением, Сергей Мелехин.
2015-05-26 12:56 GMT+10:00 Alex Chavez <alexkcha...@gmail.com>: > Сергей, > A simple implementation would be to create a DataFrame of CVs by issuing a > Spark SQL query against your Postgres database, persist it in memory, and > then to map F over it at query time and return the top > <https://spark.apache.org/docs/1.3.1/api/scala/org/apache/spark/rdd/RDD.html#top(num:Int)(implicitord:Ordering[T]):Array[T]> > N > on the mapped data structure. However, this might not meet your latency > needs depending on how expensive your scoring function F is (I imagine it's > something like computing the overlap or Jaccard similarity between the > vacancy IDs and the set of IDs for each CV). It might be worth trying. > > For example, following a similar strategy on a cluster with ~100GB RAM > and ~160 cores, I get sorted list of the top 10,000 documents from a set of > 50 million documents in less than ten seconds for a query. In my case, the > cost of scoring each query-document pair is dominated by computing ~50 dot > products of 100-dimensional vectors. > > Best, > Alex > > On Mon, May 25, 2015 at 2:59 AM, Сергей Мелехин <cpro...@gmail.com> wrote: > >> Hi, ankur! >> Thanks for your reply! >> CVs are a just bunch of IDs, each ID represents some object of some class >> (eg. class=JOB, object=SW Developer). We have already processed texts and >> extracted all facts. So we don't need to do any text processing in Spark, >> just to run scoring function on many many CVs, and return top 10 matches. >> >> С Уважением, Сергей Мелехин. >> >> 2015-05-25 16:28 GMT+10:00 ankur chauhan <an...@malloc64.com>: >> >>> Hi, >>> >>> I am sure you can use spark for this but it seems like a problem that >>> should be delegated to a text based indexing technology like elastic search >>> or something based on lucene to serve the requests. Spark can be used to >>> prepare the data that can be fed to the indexing service. >>> >>> Using spark directly seems like there would be a lot of repeated >>> computations between requests which can be avoided. >>> >>> There are a bunch of spark-elasticsearch bindings that can be used to >>> make the process easier. >>> >>> Again, sparksql can help you convert most of the logic directly to spark >>> jobs but I would suggest exploring text indexing technologies too. >>> >>> -- ankur >>> ------------------------------ >>> From: Сергей Мелехин <cpro...@gmail.com> >>> Sent: 5/24/2015 10:59 PM >>> To: user@spark.apache.org >>> Subject: Using Spark like a search engine >>> >>> HI! >>> We are developing scoring system for recruitment. Recruiter enters >>> vacancy requirements, and we score tens of thousands of CVs to this >>> requirements, and return e.g. top 10 matches. >>> We do not use fulltext search and sometimes even dont filter input CVs >>> prior to scoring (some vacancies do not have mandatory requirements that >>> can be used as a filter effectively). >>> >>> So we have scoring function F(CV,VACANCY) that is currently inplemented >>> in SQL and runs on Postgresql cluster. In worst case F is executed once on >>> every CV in database. VACANCY part is fixed for one query, but changes >>> between queries and there's very little we can process in advance. >>> >>> We expect to have about 100 000 000 CVs in next year, and do not expect >>> our current implementation to offer desired low latency responce (<1 s) on >>> 100M CVs. So we look for a horizontaly scaleable and fault-tolerant >>> in-memory solution. >>> >>> Will Spark be usefull for our task? All tutorials I could find describe >>> stream processing, or ML applications. What Spark extensions/backends can >>> be useful? >>> >>> >>> With best regards, Segey Melekhin >>> >> >> >