Hi Ashkar,
Yes you can do all these things - but not with Solr, which doesn't come
with a built-in website crawler. You'll need to look at some other
projects for that such as:
http://crawler.archive.org/index.html Heritrix
http://lucene.apache.org/nutch/ Nutch (created by Doug Cutting who also
created Lucene) - there's a tutorial that includes Solr
https://cwiki.apache.org/confluence/display/nutch/NutchTutorial
https://manifoldcf.apache.org/en_US/index.html ManifoldCF
There's a few other options on this (slightly old) page
https://cwiki.apache.org/confluence/display/SOLR/SolrEcosystem - and
there are probably hundreds of other options, including writing your own.
Best
Charlie
On 04/12/2023 08:28, Ashkar wrote:
Hi Solr Users,
I have a few questions.
1. Can I crawl One Drive and index the documents?
2. Are we able to crawl a website that has a login?
3. Can we crawl documents from an HTTP/HTTPS-based portal and do the
indexing?
Regards,
Logo
*Ashkar*
System Analyst
*M***+91 9605043094
*E ****_ash...@chimeratechnologies.com
<mailto:apoor...@chimeratechnologies.com>_*
*W *_www.chimeratechnologies.com <http://www.chimeratechnologies.com/>_
Solutions for : FinTech | InsurTech | HRTech | Monitoring | Governance
Offered as : Product Development | Application Management | QA and Testing
****Disclaimer **** This e-mail contains PRIVILEGED AND CONFIDENTIAL
INFORMATION intended solely for the use of the addressee(s). If you
are not the intended recipient, please notify the sender by e-mail and
delete the original message. The unauthorized use, dissemination,
distribution, or reproduction of this e-mail, including attachments,
is prohibited and may be unlawful. This e-mail may contain viruses.
Chimera has taken every reasonable precaution to minimize this risk
but is not liable for any damage you may sustain as a result of any
virus in this e-mail. You should carry out your own virus checks
before opening the e-mail or attachment. Chimera reserves the right to
monitor and review the content of all messages sent to or from this
e-mail address. Messages sent to or from this e-mail address may be
stored on the Chimeras' e-mail system..
--
Charlie Hull - Managing Consultant at OpenSource Connections Limited
Founding member of The Search Network and co-author of Searching the Enterprise
tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828
OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
Amtsgericht Charlottenburg | HRB 230712 B
Geschäftsführer: John M. Woodell | David E. Pugh
Finanzamt: Berlin Finanzamt für Körperschaften II