jnioche commented on code in PR #1343: URL: https://github.com/apache/incubator-stormcrawler/pull/1343#discussion_r1818169400
########## external/solr/README.md: ########## @@ -1,117 +1,30 @@ -stormcrawler-solr -================== +# stormcrawler-solr -Set of Solr resources for StormCrawler that allows you to create topologies that consume from a Solr collection and store metrics, status or parsed content into Solr. +Set of [Solr](https://solr.apache.org/) resources for StormCrawler that allows you to create topologies that consume from a Solr collection and store metrics, status or parsed content into Solr. -## How to use +## Getting started -In your project you can use this by adding the following dependency: +The easiest way is currently to use the archetype for Solr with: -```xml -<dependency> - <groupId>org.apache.stormcrawler</groupId> - <artifactId>stormcrawler-solr</artifactId> - <version>${stormcrawler.version}</version> -</dependency> -``` +`mvn archetype:generate -DarchetypeGroupId=org.apache.stormcrawler -DarchetypeArtifactId=stormcrawler-solr-archetype -DarchetypeVersion=3.1.1-SNAPSHOT` -## Available resources - -* `IndexerBolt`: Implementation of `AbstractIndexerBolt` that allows to index the parsed data and metadata into a specified Solr collection. - -* `MetricsConsumer`: Class that allows to store Storm metrics in Solr. - -* `SolrSpout`: Spout that allows to get URLs from a specified Solr collection. - -* `StatusUpdaterBolt`: Implementation of `AbstractStatusUpdaterBolt` that allows to store the status of each URL along with the serialized metadata in Solr. - -* `SolrCrawlTopology`: Example implementation of a topology that use the provided classes, this is intended as an example or a guide on how to use this resources. - -* `SeedInjector`: Topology that allow to read URLs from a specified file and store the URLs in a Solr collection using the `StatusUpdaterBolt`. This can be used as a starting point to inject URLs into Solr. - -## Configuration options - -The available configuration options can be found in the [`solr-conf.yaml`](solr-conf.yaml) file. - -For configuring the connection with the Solr server, the following parameters are available: `solr.TYPE.url`, `solr.TYPE.zkhost`, `solr.TYPE.collection`. - -> In the previous example `TYPE` can be one of the following values: - -> * `indexer`: To reference the configuration parameters of the `IndexerBolt` class. -> * `status`: To reference the configuration parameters of the `SolrSpout` and `StatusUpdaterBolt` classes. -> * `metrics`: To reference the configuration parameters of the `MetricsConsumer` class. - -> *Note: Some of this classes provide additional parameter configurations.* - -### General parameters - -* `solr.TYPE.url`: The URL of the Solr server including the name of the collection that you want to use. - -## Additional configuration options - -#### MetricsConsumer - -In the case of the `MetricsConsumer` class a couple of additional configuration parameters are provided to use the [Document Expiration](https://lucidworks.com/blog/document-expiration/) feature available in Solr since version 4.8. +You'll be asked to enter a groupId (e.g. com.mycompany.crawler), an artefactId (e.g. stormcrawler), a version, a package name and details about the user agent to use. -* `solr.metrics.ttl`: [Date expression](https://cwiki.apache.org/confluence/display/solr/Working+with+Dates) to specify when the document should expire. -* `solr.metrics.ttl.field`: Field to be used to specify the [date expression](https://cwiki.apache.org/confluence/display/solr/Working+with+Dates) that defines when the document should expire. +This will not only create a fully formed project containing a POM with the dependency above but also a set of resources, configuration files and sample topology classes. Enter the directory you just created (should be the same as the artefactId you specified earlier) and follow the instructions on the README file. -*Note: The date expression specified in the `solr.metrics.ttl` parameter is not validated. To use this feature some changes in the Solr configuration must be done.* +You will of course need to have both Storm and Solr installed. Review Comment: call them by their full names and maybe indicate which version of Apache Solr is needed? 9.7.0 currently -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@stormcrawler.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org