Re: [PR] #620 Add support for shards - SolrSpout [incubator-stormcrawler]

via GitHub Sun, 27 Oct 2024 12:50:35 -0700


jnioche commented on code in PR #1343:
URL: 
https://github.com/apache/incubator-stormcrawler/pull/1343#discussion_r1818169400



##########
external/solr/README.md:
##########
@@ -1,117 +1,30 @@
-stormcrawler-solr
-==================
+# stormcrawler-solr
 
-Set of Solr resources for StormCrawler that allows you to create topologies 
that consume from a Solr collection and store metrics, status or parsed content 
into Solr.
+Set of [Solr](https://solr.apache.org/) resources for StormCrawler that allows 
you to create topologies that consume from a Solr collection and store metrics, 
status or parsed content into Solr.
 
-## How to use
+## Getting started
 
-In your project you can use this by adding the following dependency:
+The easiest way is currently to use the archetype for Solr with:
 
-```xml
-<dependency>
-    <groupId>org.apache.stormcrawler</groupId>
-    <artifactId>stormcrawler-solr</artifactId>
-    <version>${stormcrawler.version}</version>
-</dependency>
-```
+`mvn archetype:generate -DarchetypeGroupId=org.apache.stormcrawler 
-DarchetypeArtifactId=stormcrawler-solr-archetype 
-DarchetypeVersion=3.1.1-SNAPSHOT`
 
-## Available resources
-
-* `IndexerBolt`: Implementation of `AbstractIndexerBolt` that allows to index 
the parsed data and metadata into a specified Solr collection.
-
-* `MetricsConsumer`: Class that allows to store Storm metrics in Solr.
-
-* `SolrSpout`: Spout that allows to get URLs from a specified Solr collection.
-
-* `StatusUpdaterBolt`: Implementation of `AbstractStatusUpdaterBolt` that 
allows to store the status of each URL along with the serialized metadata in 
Solr.
-
-* `SolrCrawlTopology`: Example implementation of a topology that use the 
provided classes, this is intended as an example or a guide on how to use this 
resources.
-
-* `SeedInjector`: Topology that allow to read URLs from a specified file and 
store the URLs in a Solr collection using the `StatusUpdaterBolt`. This can be 
used as a starting point to inject URLs into Solr.
-
-## Configuration options
-
-The available configuration options can be found in the 
[`solr-conf.yaml`](solr-conf.yaml) file.
-
-For configuring the connection with the Solr server, the following parameters 
are available: `solr.TYPE.url`, `solr.TYPE.zkhost`, `solr.TYPE.collection`.
-
-> In the previous example `TYPE` can be one of the following values:
-
-> * `indexer`: To reference the configuration parameters of the `IndexerBolt` 
class.
-> * `status`: To reference the configuration parameters of the `SolrSpout` and 
`StatusUpdaterBolt` classes.
-> * `metrics`: To reference the configuration parameters of the 
`MetricsConsumer` class.
-
-> *Note: Some of this classes provide additional parameter configurations.*
-
-### General parameters
-
-* `solr.TYPE.url`: The URL of the Solr server including the name of the 
collection that you want to use.
-
-## Additional configuration options
-
-#### MetricsConsumer
-
-In the case of the `MetricsConsumer` class a couple of additional 
configuration parameters are provided to use the [Document 
Expiration](https://lucidworks.com/blog/document-expiration/) feature available 
in Solr since version 4.8.
+You'll be asked to enter a groupId (e.g. com.mycompany.crawler), an artefactId 
(e.g. stormcrawler), a version, a package name and details about the user agent 
to use.
 
-* `solr.metrics.ttl`: [Date 
expression](https://cwiki.apache.org/confluence/display/solr/Working+with+Dates)
 to specify when the document should expire.
-* `solr.metrics.ttl.field`: Field to be used to specify the [date 
expression](https://cwiki.apache.org/confluence/display/solr/Working+with+Dates)
 that defines when the document should expire.
+This will not only create a fully formed project containing a POM with the 
dependency above but also a set of resources, configuration files and sample 
topology classes. Enter the directory you just created (should be the same as 
the artefactId you specified earlier) and follow the instructions on the README 
file.
 
-*Note: The date expression specified in the `solr.metrics.ttl` parameter is 
not validated. To use this feature some changes in the Solr configuration must 
be done.*
+You will of course need to have both Storm and Solr installed.

Review Comment:
   call them by their full names and maybe indicate which version of Apache 
Solr is needed?
   9.7.0 currently



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@stormcrawler.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] #620 Add support for shards - SolrSpout [incubator-stormcrawler]

Reply via email to