Re: [I] Inquiry About StormCrawler Features and Capabilities [incubator-stormcrawler]

via GitHub Sun, 28 Jul 2024 04:20:13 -0700


jnioche commented on issue #1253:
URL: 
https://github.com/apache/incubator-stormcrawler/issues/1253#issuecomment-2254480835


   hi @alikaz3mi 
   
   > Storage of Textual Information: When using StormCrawler with Elasticsearch 
as outlined in your documentation, will all the textual information from the 
crawled websites be stored directly in Elasticsearch?
   
   The Elasticsearch module has been removed from SC due to licensing issues. 
You can use OpenSearch as an alternative. As explained in the documentation, 
the textual content of the pages is stored in the `content` index. A number of 
fields are configured to be indexed by default but this is extensible.
   
   > Handling Multimedia Content: How does StormCrawler manage images and other 
multimedia content found on websites? Are these types of content also stored in 
Elasticsearch, or do they require a different approach or storage solution?
   
   By default StormCrawler does not crawl or index multimedia files but it can 
be done (in fact several organisations do that with StormCrawler on a large 
scale). You will have to use a custom bolt to store the content - you could put 
it in OpenSearch but other forms of storage are probably more appropriate 
depending on your use case.
   
   > Crawling Authenticated Websites: Is StormCrawler capable of crawling 
websites that require user authentication? If so, how can I provide 
authentication details (e.g., usernames and passwords) to enable StormCrawler 
to access and crawl these sites?
   
   See https://github.com/apache/incubator-stormcrawler/wiki/Protocols
   There is currently support for basic authentication, see 
https://github.com/apache/incubator-stormcrawler/blob/701999eb56c5ebe5632b012a2f0771d6538425aa/core/src/main/java/com/digitalpebble/stormcrawler/protocol/okhttp/HttpProtocol.java#L157
   
   Please note that it is not currently possible to handle authentication per 
hostname or domain - only a single pair of username / password can be set. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@stormcrawler.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] Inquiry About StormCrawler Features and Capabilities [incubator-stormcrawler]

Reply via email to