Hello,

in short

1) when you have collection in local filesystem you have your shard folders
in solr.home (collectionnamelocalfs 1 shard 1 replica example):
/var/lib/solr/collectionnamelocalfs_shard1_replica_n1/core.properties
/var/lib/solr/collectionnamelocalfs_shard1_replica_n1/data
/var/lib/solr/collectionnamelocalfs_shard1_replica_n1/data/index
/var/lib/solr/collectionnamelocalfs_shard1_replica_n1/data/tlog
/var/lib/solr/collectionnamelocalfs_shard1_replica_n1/data/snapshot_metadata

your solrconfig.xml for local fs will be like:
<directoryFactory name="DirectoryFactory"
class="${solr.directoryFactory:solr.NRTCachingDirectoryFactory}">
<lockType>${solr.lock.type:native}</lockType>

solr collection api clusterstate will not show any folder:
            "replicas":{"core_node2":{
                "core":"collectionnamelocalfs_shard1_replica_n1",
                "node_name":"node2:8995_solr",
                "base_url":"https://node2:8995/solr";,
                "state":"active",
                "type":"NRT",
                "force_set_state":"false",
                "leader":"true"}},

2) when using hdfs for indexes everything that is in the data folder will
be on solr.hdfs.home with core_nodeX name instead of shard name like
(collectionnamehdfs 1 shard 1 replica example):
hdfs://solr/collectionnamehdfs/core_node2/data/index
hdfs://solr/collectionnamehdfs/core_node2/data/tlog
hdfs://solr/collectionnamehdfs/core_node2/data/snapshot_metadata
and on local filesystem only core.properties
/var/lib/solr/collectionnamehdfs_shard1_replica_n1/core.properties

your solrconfig.xml for hdfs will be like:
<directoryFactory name="DirectoryFactory"
class="${solr.directoryFactory:org.apache.solr.core.HdfsDirectoryFactory}">
<lockType>${solr.lock.type:hdfs}</lockType>

solr collection api clusterstate will show dataDir and ulogDir with hdfs://
folders

            "replicas":{"core_node2":{

"dataDir":"hdfs://node2:8020/solr-infra/collectionnamehdfs/core_node2/data/",
                "node_name":"node2:8995_solr",
                "base_url":"https://node2:8995/solr";,
                "type":"NRT",
                "force_set_state":"false",

"ulogDir":"hdfs://node2:8020/solr-infra/collectionnamehdfs/core_node2/data/tlog",
                "core":"collectionnamehdfs_shard1_replica_n1",
                "shared_storage":"true",
                "state":"active",
                "leader":"true"}},

On Solr 4  hdfs index will not show ulogDir/dataDir. from Solr 5+ hdfs
index will show ulogDir/dataDir.

3) I think this is the answer to your initial question of where the data is
stored:
-if you have more than one shard, the shard where the data is stored
depends on the routing/hash others. Is the same for local fs or hdfs index:
https://solr.apache.org/guide/solr/latest/deployment-guide/solrcloud-shards-indexing.html#document-routing
-how is stored in HDFS, if you have datanode (DN) on same Solr node the
first copy of the block will be written to local DN and then the other 2
copies (hdfs default of 3 copies) will be copied to other DN asap depending
on your config (rack awareness, hdfs replication factor, etc..). If you
don't have DN on solr node, then the first copy of the block will be copied
to one of the DN depending on your config.
https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

Hope this helps.

Kind Regards,
Alejandro Arrieta

On Fri, Aug 30, 2024 at 3:41 AM Roberto Maggi @ Debian <debian...@gmail.com>
wrote:

>
> Hi Eric,
> thanks for your interesting.
> I'm using Solr9, this doc to setup the cluster
>
> https://apache.github.io/hadoop/hadoop-project-dist/hadoop-common/ClusterSetup.html
> and the one you quoted for the "collaboration" with hadoop.
>
> the collections and the relative indexes are correctly present in the
> solr database and in hadoop fs but I can't undestand how to check where
> are they and how it actually works.
>
> Any idea ?
>
> Rob
>
> On 8/29/24 5:23 PM, David Eric Pugh wrote:
> > Roberto, I'm hoping the community shares some knowledge, as it's not an
> area I am familiar with, and I'd love to see more content added to the Ref
> Guide.
> > You are using Solr 9 I think?  Is this with using the
> https://solr.apache.org/guide/solr/latest/deployment-guide/solr-on-hdfs.html
> approach?
> >     On Thursday, August 29, 2024 at 09:30:26 AM EDT, Roberto Maggi @
> Debian <debian...@gmail.com> wrote:
> >
> >   Hi you all,
> > I'm still new to solr and hadoop and I can't find an answer  to this
> > question that rose in me.
> >
> > In a multi cluster setup with 3 solr9 hosts and and 3 hadoop datanodes
> > I'm wondering where and how the data is stored.
> >
> > If I instruct the creation of a collection with 3 shard splitting, 1/3
> > on each solr node, but the indexes are actually wrote onto the hadoop
> > node: how would it work?
> >
> > Thanks in advance.
> >
> >
>

Reply via email to