Thanks I'll give it a try!

С Уважением, Сергей Мелехин.

2015-05-26 12:56 GMT+10:00 Alex Chavez <alexkcha...@gmail.com>:

> Сергей,
> A simple implementation would be to create a DataFrame of CVs by issuing a
> Spark SQL query against your Postgres database, persist it in memory, and
> then to map F over it at query time and return the top
> <https://spark.apache.org/docs/1.3.1/api/scala/org/apache/spark/rdd/RDD.html#top(num:Int)(implicitord:Ordering[T]):Array[T]>
>  N
> on the mapped data structure. However, this might not meet your latency
> needs depending on how expensive your scoring function F is (I imagine it's
> something like computing the overlap or Jaccard similarity between the
> vacancy IDs and the set of IDs for each CV). It might be worth trying.
>
> For example, following a similar strategy on a cluster with ~100GB RAM
> and ~160 cores, I get sorted list of the top 10,000 documents from a set of
> 50 million documents in less than ten seconds for a query. In my case, the
> cost of scoring each query-document pair is dominated by computing ~50 dot
> products of 100-dimensional vectors.
>
> Best,
> Alex
>
> On Mon, May 25, 2015 at 2:59 AM, Сергей Мелехин <cpro...@gmail.com> wrote:
>
>> Hi, ankur!
>> Thanks for your reply!
>> CVs are a just bunch of IDs, each ID represents some object of some class
>> (eg. class=JOB, object=SW Developer). We have already processed texts and
>> extracted all facts. So we don't need to do any text processing in Spark,
>> just to run scoring function on many many CVs, and return top 10 matches.
>>
>> С Уважением, Сергей Мелехин.
>>
>> 2015-05-25 16:28 GMT+10:00 ankur chauhan <an...@malloc64.com>:
>>
>>> Hi,
>>>
>>> I am sure you can use spark for this but it seems like a problem that
>>> should be delegated to a text based indexing technology like elastic search
>>> or something based on lucene to serve the requests. Spark can be used to
>>> prepare the data that can be fed to the indexing service.
>>>
>>> Using spark directly seems like there would be a lot of repeated
>>> computations between requests which can be avoided.
>>>
>>> There are a bunch of spark-elasticsearch bindings that can be used to
>>> make the process easier.
>>>
>>> Again, sparksql can help you convert most of the logic directly to spark
>>> jobs but I would suggest exploring text indexing technologies too.
>>>
>>> -- ankur
>>> ------------------------------
>>> From: Сергей Мелехин <cpro...@gmail.com>
>>> Sent: ‎5/‎24/‎2015 10:59 PM
>>> To: user@spark.apache.org
>>> Subject: Using Spark like a search engine
>>>
>>> HI!
>>> We are developing scoring system for recruitment. Recruiter enters
>>> vacancy requirements, and we score tens of thousands of CVs to this
>>> requirements, and return e.g. top 10 matches.
>>> We do not use fulltext search and sometimes even dont filter input CVs
>>> prior to scoring (some vacancies do not have mandatory requirements that
>>> can be used as a filter effectively).
>>>
>>> So we have scoring function F(CV,VACANCY) that is currently inplemented
>>> in SQL and runs on Postgresql cluster. In worst case F is executed once on
>>> every CV in database. VACANCY part is fixed for one query, but changes
>>> between queries and there's very little we can process in advance.
>>>
>>> We expect to have about 100 000 000 CVs in next year, and do not expect
>>> our current implementation to offer desired low latency responce (<1 s) on
>>> 100M CVs. So we look for a horizontaly scaleable and fault-tolerant
>>> in-memory solution.
>>>
>>> Will Spark be usefull for our task? All tutorials I could find describe
>>> stream processing, or ML applications. What Spark extensions/backends can
>>> be useful?
>>>
>>>
>>> With best regards, Segey Melekhin
>>>
>>
>>
>

Reply via email to