Description
I have now seen in several Solr clusters in k8s that a POD has experienced complete disk (PVC) loss due to underlying volume provisioning issues, and the POD eventually comes back online but with an empty disk / volume.
In such a case, all the replicas that were on that Solr node (as recorded in collection state) fails recovery and ends up in a permanent DOWN state. The soluition is to manually call DELETEREPLICA on them and then ADDREPLICA to create a new replica. This process has even been scripted https://gist.github.com/relwell/51aecaf7a435c68a1651872f0febbb5b.
There may of course be other reasons for a DOWN state replica than empty disk, which may also be solved by deleting the replica and adding a new one.
Question is whether we want either Solr itself or SolrOperator to be able to auto recover from this situation. It need not be the default action, but can be enabled by configuration. Thoughts?
Activity