Hello, in short
1) when you have collection in local filesystem you have your shard folders in solr.home (collectionnamelocalfs 1 shard 1 replica example): /var/lib/solr/collectionnamelocalfs_shard1_replica_n1/core.properties /var/lib/solr/collectionnamelocalfs_shard1_replica_n1/data /var/lib/solr/collectionnamelocalfs_shard1_replica_n1/data/index /var/lib/solr/collectionnamelocalfs_shard1_replica_n1/data/tlog /var/lib/solr/collectionnamelocalfs_shard1_replica_n1/data/snapshot_metadata your solrconfig.xml for local fs will be like: <directoryFactory name="DirectoryFactory" class="${solr.directoryFactory:solr.NRTCachingDirectoryFactory}"> <lockType>${solr.lock.type:native}</lockType> solr collection api clusterstate will not show any folder: "replicas":{"core_node2":{ "core":"collectionnamelocalfs_shard1_replica_n1", "node_name":"node2:8995_solr", "base_url":"https://node2:8995/solr", "state":"active", "type":"NRT", "force_set_state":"false", "leader":"true"}}, 2) when using hdfs for indexes everything that is in the data folder will be on solr.hdfs.home with core_nodeX name instead of shard name like (collectionnamehdfs 1 shard 1 replica example): hdfs://solr/collectionnamehdfs/core_node2/data/index hdfs://solr/collectionnamehdfs/core_node2/data/tlog hdfs://solr/collectionnamehdfs/core_node2/data/snapshot_metadata and on local filesystem only core.properties /var/lib/solr/collectionnamehdfs_shard1_replica_n1/core.properties your solrconfig.xml for hdfs will be like: <directoryFactory name="DirectoryFactory" class="${solr.directoryFactory:org.apache.solr.core.HdfsDirectoryFactory}"> <lockType>${solr.lock.type:hdfs}</lockType> solr collection api clusterstate will show dataDir and ulogDir with hdfs:// folders "replicas":{"core_node2":{ "dataDir":"hdfs://node2:8020/solr-infra/collectionnamehdfs/core_node2/data/", "node_name":"node2:8995_solr", "base_url":"https://node2:8995/solr", "type":"NRT", "force_set_state":"false", "ulogDir":"hdfs://node2:8020/solr-infra/collectionnamehdfs/core_node2/data/tlog", "core":"collectionnamehdfs_shard1_replica_n1", "shared_storage":"true", "state":"active", "leader":"true"}}, On Solr 4 hdfs index will not show ulogDir/dataDir. from Solr 5+ hdfs index will show ulogDir/dataDir. 3) I think this is the answer to your initial question of where the data is stored: -if you have more than one shard, the shard where the data is stored depends on the routing/hash others. Is the same for local fs or hdfs index: https://solr.apache.org/guide/solr/latest/deployment-guide/solrcloud-shards-indexing.html#document-routing -how is stored in HDFS, if you have datanode (DN) on same Solr node the first copy of the block will be written to local DN and then the other 2 copies (hdfs default of 3 copies) will be copied to other DN asap depending on your config (rack awareness, hdfs replication factor, etc..). If you don't have DN on solr node, then the first copy of the block will be copied to one of the DN depending on your config. https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html Hope this helps. Kind Regards, Alejandro Arrieta On Fri, Aug 30, 2024 at 3:41 AM Roberto Maggi @ Debian <debian...@gmail.com> wrote: > > Hi Eric, > thanks for your interesting. > I'm using Solr9, this doc to setup the cluster > > https://apache.github.io/hadoop/hadoop-project-dist/hadoop-common/ClusterSetup.html > and the one you quoted for the "collaboration" with hadoop. > > the collections and the relative indexes are correctly present in the > solr database and in hadoop fs but I can't undestand how to check where > are they and how it actually works. > > Any idea ? > > Rob > > On 8/29/24 5:23 PM, David Eric Pugh wrote: > > Roberto, I'm hoping the community shares some knowledge, as it's not an > area I am familiar with, and I'd love to see more content added to the Ref > Guide. > > You are using Solr 9 I think? Is this with using the > https://solr.apache.org/guide/solr/latest/deployment-guide/solr-on-hdfs.html > approach? > > On Thursday, August 29, 2024 at 09:30:26 AM EDT, Roberto Maggi @ > Debian <debian...@gmail.com> wrote: > > > > Hi you all, > > I'm still new to solr and hadoop and I can't find an answer to this > > question that rose in me. > > > > In a multi cluster setup with 3 solr9 hosts and and 3 hadoop datanodes > > I'm wondering where and how the data is stored. > > > > If I instruct the creation of a collection with 3 shard splitting, 1/3 > > on each solr node, but the indexes are actually wrote onto the hadoop > > node: how would it work? > > > > Thanks in advance. > > > > >