Re: solr in spark

Costin Leau Wed, 29 Apr 2015 08:15:44 -0700

On 4/29/15 6:02 PM, Jeetendra Gangele wrote:

Thanks for detail explanation. My only worry is to search the all combinations 
of company names through ES looks hard.

I'm not sure what makes you think "ES looks hard". Have you tried browsing the Elasticsearch reference or the definitiveguide?


[1] http://www.elastic.co/guide/en/elasticsearch/reference/current/index.html
[2] http://www.elastic.co/guide/en/elasticsearch/guide/current/index.html

in solr we define everything in xml files like all attributes in 
WordDocumentFilterFactory and shingles factory. how to
do this in elastic search?

See the links above, the IRC or the mailing list. I don't want to derail this thread any longer so I'll wrap up bypointing to one

of the many resources that pop up on google - a blog post on shingles and a 
post from Found.no on text analysis and shingles

https://www.elastic.co/blog/searching-with-shingles
https://www.found.no/foundation/text-analysis-part-1/#optimizing-phrase-searches-with-shingles

If you need more help, do reach out to the Elasticsearch mailing list:
https://www.elastic.co/community

Cheers,

On 29 April 2015 at 20:03, Costin Leau <costin.l...@gmail.com
<mailto:costin.l...@gmail.com>> wrote:

# disclaimer I'm an employee of Elastic (the company behind Elasticsearch)
and lead of Elasticsearch Hadoop integration

Some things to clarify on the Elasticsearch side:

1. Elasticsearch is a distributed, real-time search and analytics engine.
Search is just one aspect of it and it can
work with any type of data (whether it's text, image encoding, etc...):
Github, Wikipedia, Stackoverflow are popular
examples of known websites that are powered by Elasticsearch. In fact you
can find plenty of use cases and
information about this on the website [1].

2. Elasticsearch is stand-alone and can be run on the same or separate
machines as other services. In fact, on the
_same_ machine, one can run _multiple_ Elasticsearch nodes (and thus
clusters). For best performance, having
dedicated hardware (as Nick suggested) works best.

3. The Elasticsearch Spark integration has been available for over a year
through Map/Reduce and the native (Scala
and Java) API since q3 last year. There are plenty of features available
which are fully documented here [2]. Better
yet, there's a talk by yours truly from Spark Summit East [3] that is fully
focused on exactly this topic.

4. elasticsearch-hadoop is certified by Databricks, Cloudera, Hortonworks
and MapR and supports both Spark core and
Spark SQL 1.0-1.3. There are binaries for Scala 2.10 and 2.11. And for what
it's worth, it provided on of the first
(if not the first) implementation of DataSource API outside Databricks,
which means not only using Elasticsearch in
declarative fasion but also having push-down support for operators.

Hopefully these materials will get you started with Spark and Elasticsearch
and also clarify some of the
misconceptions about Elasticsearch.

Cheers,

[1] https://www.elastic.co/products/elasticsearch
[2]
http://www.elastic.co/guide/en/elasticsearch/hadoop/master/reference.html
[3]
http://spark-summit.org/east/2015/talk/using-spark-and-elasticsearch-for-real-time-data-analysis

On 4/28/15 8:16 PM, Nick Pentreath wrote:

Depends on your use case and search volume. Typically you'd have a
dedicated ES cluster if your app is doing a
lot of
real time indexing and search.

If it's only for spark integration then you could colocate ES and spark

—
Sent from Mailbox <https://www.dropbox.com/mailbox>

On Tue, Apr 28, 2015 at 6:41 PM, Jeetendra Gangele <gangele...@gmail.com
<mailto:gangele...@gmail.com>
<mailto:gangele...@gmail.com <mailto:gangele...@gmail.com>>> wrote:

Thanks for reply.

Elastic search index will be within my Cluster? or I need the
separate host the elastic search?

On 28 April 2015 at 22:03, Nick Pentreath <nick.pentre...@gmail.com
<mailto:nick.pentre...@gmail.com>
<mailto:nick.pentre...@gmail.com <mailto:nick.pentre...@gmail.com>>>
wrote:

I haven't used Solr for a long time, and haven't used Solr in
Spark.

However, why do you say "Elasticsearch is not a good option
..."? ES absolutely supports full-text
search and
not just filtering and grouping (in fact it's original purpose
was and still is text search, though
filtering,
grouping and aggregation are heavily used).

http://www.elastic.co/guide/en/elasticsearch/guide/master/full-text-search.html

On Tue, Apr 28, 2015 at 6:27 PM, Jeetendra Gangele
<gangele...@gmail.com <mailto:gangele...@gmail.com>
<mailto:gangele...@gmail.com <mailto:gangele...@gmail.com>>> wrote:

Does anyone tried using solr inside spark?
below is the project describing it.
https://github.com/LucidWorks/spark-solr.

I have a requirement in which I want to index 20 millions
companies name and then search as and
when new
data comes in. the output should be list of companies
matching the query.

Spark has inbuilt elastic search but for this purpose
Elastic search is not a good option since this is
totally text search problem?

Elastic search is good for filtering and grouping.

Does any body used solr inside spark?

Regards
jeetendra

--
Costin

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: user-h...@spark.apache.org
<mailto:user-h...@spark.apache.org>


--
Costin


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: solr in spark

Reply via email to