Brouillon

Haute dispo cluster failover redhat

Voir aussi :

OpenSVC
Paquet resource-agents

Ressources :

myvip
fence_node-1
fence_node-2
ping
srvweb
ClusterMon-External

Liens intros :

Installation

Voir :

Prérequis

Date syncho
SELinux désactivé
service NetworkManager arrêté
Règles pare-feu
Conf /etc/hosts

Date synchro (ntp)

Les nœuds doivent avoir la date et l'heure synchronisée (voir NTP)

Vérif

date

Exemple avec Clush cluster_shell_parallele

echo date |clush -B -w node-[1-2]

SELinux désactivé

setenforce 0
sed -i.bak "s/SELINUX=enforcing/SELINUX=permissive/g" /etc/selinux/config

Vérif

sestatus

Service NetworkManager arrêté et désactivé

systemctl stop NetworkManager
systemctl disable NetworkManager

Pare-feu

Si pare-feu activé

firewall-cmd --permanent --add-service=high-availability
firewall-cmd --add-service=high-availability

Désactivation du parefeux

systemctl stop firewalld
systemctl disable firewalld
#rpm -e firewalld

Vérif

iptables -L -n -v

Résolution noms

Chaque nœud doit pouvoir pinguer les autres via son nom. Il est conseiller d'utiliser /etc/hosts plutôt que DNS.

/etc/hosts

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
127.0.1.1	   node-1.localdomain
192.168.97.221      node-1.localdomain node-1
192.168.97.222      node-2.localdomain node-2

Install

Install paquets

yum install -y pacemaker pcs psmisc policycoreutils-python

echo "P@ssw0rd" | passwd hacluster --stdin
 
systemctl start pcsd.service
systemctl enable pcsd.service
 
#unset http_proxy
#export NO_PROXY=localhost,127.0.0.1,node-1,node-2
pcs cluster auth node-1 node-2 #-u hacluster -p passwd
 
#pcs cluster setup --start --name my_cluster node-1 node-2
pcs cluster setup --name my_cluster node-1 node-2
 
pcs cluster start --all
pcs cluster enable --all

Le fichier corosync.conf est automatiquement crée

/etc/corosync/corosync.conf

totem {
    version: 2
    secauth: off
    cluster_name: my_cluster
    transport: udpu
}
 
node-list {
    node- {
        ring0_addr: node-1
        node-id: 1
    }
 
    node- {
        ring0_addr: node-2
        node-id: 2
    }
}
 
quorum {
    provider: corosync_votequorum
    two_node-: 1
}
 
logging {
    to_logfile: yes
    logfile: /var/log/cluster/corosync.log
    to_syslog: yes
}

Vérifier la conf de corosync 1

corosync-cfgtool -s

Doit retourner no faults \ Ne doit pas comporter d’adresse 127.0.0.1

Vérifier la conf de corosync 2

corosync-cmapctl  |grep members
pcs status corosync

Configuration

Prevent Resources from Moving after Recovery

pcs resource defaults resource-stickiness=100

Pas de quorum

#pcs property set no-quorum-policy=ignore
pcs property set no-quorum-policy=freeze

Configuration du fencing / stonith

Test en vue du fencing via iDRAC

Voir https://www.devops.zone/tricks/connecting-ssh-drac-reboot-server/

Tester du fencing

/usr/sbin/fence_drac5 --ip=192.168.96.221 --username=root --password=calvin --ssh -c 'admin1->'

Test avec OpenManage /opt/dell/srvadmin/sbin/racadm

racadm -r 192.168.96.221 -u root -p calvin get iDRAC.Info

Test via SSH sur iDRAC Pour redemarrer le serveur en se connectant en SSH sur la iDRAC

ssh root@192.168.96.221
racadm serveraction powercycle

Si pas de stonith / fence sinon la VIP refusera de démarrer

# Si pas de stonith / fence
pcs property set stonith-enabled=false

Vérif

crm_verify -LVVV

Configuration

# pcs stonith create fence_node-1 fence_drac5 ipaddr=192.168.96.221 login=root passwd=calvin secure=1 cmd_prompt="/admin1->" pcmk_host_list=node-1 stonith-action=poweroff
pcs stonith create fence_node-1 fence_drac5 ipaddr=192.168.96.221 login=root passwd=calvin secure=1 cmd_prompt="/admin1->" pcmk_host_list=node-1 op monitor interval="60s"
pcs stonith create fence_node-2 fence_drac5 ipaddr=192.168.96.222 login=root passwd=calvin secure=1 cmd_prompt="/admin1->" pcmk_host_list=node-2 op monitor interval="60s"
 
pcs stonith level add 1 node-1 fence_node-1
pcs stonith level add 1 node-2 fence_node-2

Interdire le suicide (le fencing de soi-même)

pcs constraint location fence_node-1 avoids node-1
pcs constraint location fence_node-2 avoids node-2

Tester le fencing

#stonith_admin --reboot node-1
pcs stonith fence node-1

Ajout ressources

Ajout ressource VIP (adresse IP virtuelle)

pcs resource create myvip IPaddr2 ip=192.168.97.230 cidr_netmask=24 nic=bond0 op monitor interval=30s on-fail=fence
#pcs constraint location myvip prefers node-1=INFINITY
pcs constraint location myvip prefers node-1=100
pcs constraint location myvip prefers node-2=50
#pcs resource meta myvip resource-stickiness=100

Ajouter ressource ping

pcs resource create ping ocf:pacemaker:ping dampen=5s multiplier=1000 host_list=192.168.97.250 --clone
pcs constraint location myvip rule score=-INFINITY pingd lt 1 or not_defined pingd

Ajout ressource Apache

Avant il faut configurer http://localhost/server-status et arrêter le service d'apache sur l'ensemble des nœuds

curl http://localhost/server-status
systemctl stop httpd.service
systemctl disable httpd.service

pcs resource create srvweb apache configfile="/etc/httpd/conf/httpd.conf" statusurl="http://127.0.0.1/server-status" op monitor interval=1min #--clone
# Le serveur Web toujours sur la VIP
pcs constraint colocation add srvweb with myvip
# D'abord la VIP puis le serveur Web
pcs constraint order myvip then srvweb

Manip

Déplacer la VIP

pcs resource move myvip node-1
pcs resource move myvip node-2

Retour arrière - Déplacer la VIP

#pcs constraint --full |grep prefer
pcs constraint remove cli-prefer-myvip
pcs resource relocate run

Remise à zero compteur erreurs

#pcs resource failcount reset res1
#crm_resource -P
pcs resource cleanup

Déplacer toutes les ressources sur le nœud primaire (ignoring resource stickiness)

#pcs resource relocate show
pcs resource relocate run

Maintenance sur une ressource

#pcs resource update fence_node-1 meta target-role=stopped
#pcs resource update fence_node-1 meta is-managed=false
#pcs resource update fence_node-1 op monitor enabled=false
#pcs resource disable fence_node-1
pcs resource unmanage fence_node-1

Maintenance générale du cluster

pcs property set maintenance-mode=true

Fin de maintenance

pcs property set maintenance-mode=false

Arrêt du cluster

pcs cluster stop --all
pcs cluster disable --all

Diagnostic / Supervision

Diag Passif

Vérif corosync

# Check syntax conf
corosync -t
 
# Check cluster communication
corosync-cfgtool -s
 
# check the node's network
corosync-cmapctl  |grep members

Vérif

pcs cluster pcsd-status
pcs cluster verify
pcs status corosync
crm_mon -1 --fail
crm_mon -1Af
journalctl --since yesterday -p err
journalctl -u pacemaker.service --since "2017-02-24 16:00" -p warning

Script supervision (ces commandes doivent retourner aucune ligne)

LANG=C pcs status |egrep "Stopped|standby|OFFLINE|UNCLEAN|Failed|error"
crm_verify -LVVV
LANG=C pcs resource relocate show |sed -ne '/Transition Summary:/,$p' |grep -v '^Transition Summary:'
crm_mon -1f | grep -q fail-count

Voir plus haut si (script /usr/local/bin/crm_logger.sh)

tailf  /var/log/messages |grep "ClusterMon-External:"

Script supervision Quel nœud est actif

LANG=C crm_resource --resource myvip --locate |cut -d':' -f2 |tr -d ' '

Le serveur web répond t-il bien en utilisant l'IP de la VIP. (Le code de retour doit-être 0)

#curl -4 -m 1 --connect-timeout 1 http://192.168.97.230/ > /dev/null 2>&1
curl -4 -m 1 --connect-timeout 1 http://192.168.97.230/cl.html > /dev/null 2>&1
#echo $?

ACL

Compte en lecture seule avec les droits de consulter crm_mon \

Attention : ce compte trouver le mdp iDRAC/Ilo

pcs stonith --full |grep passwd

Mise en œuvre

#adduser rouser
#usermod -a -G haclient rouser
usermod -a -G haclient process
 
pcs property set enable-acl=true
pcs acl role create read-only description="Read access to cluster" read xpath /cib
#pcs acl user create rouser read-only
pcs acl user create process read-only

#crm_mon --daemonize --as-html /var/www/html/cl.html

/usr/local/bin/crm_logger.sh

#!/bin/sh
# https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/pdf/High_Availability_Add-On_Reference/Red_Hat_Enterprise_Linux-7-High_Availability_Add-On_Reference-en-US.pdf
logger -t "ClusterMon-External" "${CRM_notify_node:-x} ${CRM_notify_rsc:-x} \
${CRM_notify_task:-x} ${CRM_notify_desc:-x} ${CRM_notify_rc:-x} \
${CRM_notify_target_rc:-x} ${CRM_notify_status:-x} ${CRM_notify_recipient:-x}";
exit

chmod 755 /usr/local/bin/crm_logger.sh
chown root.root /usr/local/bin/crm_logger.sh

pcs resource create ClusterMon-External ClusterMon update=10000 user=process extra_options="-E /usr/local/bin/crm_logger.sh --watch-fencing" htmlfile=/var/www/html/cl.html pidfile=/tmp/crm_mon-external.pid op monitor on-fail="restart" interval="60" clone

Colocation - page de monitoting toujours actif sur la VIP \ Seulement nécessaire si ressource non clonée

pcs constraint colocation add ClusterMon-External with myvip

Test

curl 192.168.97.230/cl.html

Voir https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/High_Availability_Add-On_Reference/s1-eventnotification-HAAR.html

Diag Actif

En cas de pb

pcs resource debug-start resource_id

Ajout 2em interface pour le heartbeat

Redundant Ring Protocol (RRP) rrp_mode If set to active, Corosync uses both interfaces actively. If set to passive, Corosync sends messages alternatively over the available networks.

Avant de modifier la conf, on passe le cluster en mode maintenance :

pcs property set maintenance-mode=true

/etc/hosts

192.168.21.10      node1
192.168.22.10     node1b
192.168.21.11      node2
192.168.22.11     node2b

On ajoute rrp_mode et ring1_addr /etc/corosync/corosync.conf

totem {
    rrp_mode: active
}
nodelist {
    node {
        ring0_addr: node1
        ring1_addr: node1b
        nodeid: 1
    }
 
    node {
        ring0_addr: node2
        ring1_addr: node2b
        nodeid: 2
    }
}

pcs cluster reload corosync
pcs cluster status corosync
corosync-cfgtool -s
pcs property unset maintenance-mode

Reprise sur incident

#crm_resource -P
pcs resource cleanup
pcs resource relocate run
#pcs cluster start --all

Crash-tests

Test 1 Crash brutal

echo 1 > /proc/sys/kernel/sysrq
echo c > /proc/sysrq-trigger

Test 2 Coupure électrique : Débranchement du câble

Test 3 Coupure réseaux

ifdown bond0

Test 4 Perte du ping de la passerelle sur l'un des nœud

iptables -A OUTPUT -d 192.168.97.250/32 -p icmp -j REJECT

Test 5 Fork bomb, nœud ne répond plus, sauf au ping

Fork bomb

:(){ :|:& };:

Test 6 Perte connexion iDRAC : Débranchement du câble

Nettoyage - effacer

pcs cluster stop  --force #--all
pcs cluster destroy  --force #--all
 
systemctl stop pcsd
systemctl stop corosync
systemctl stop pacemaker
 
yum remove -y pcsd corosync pacemaker
userdel hacluster
 
rm -rf /dev/shm/qb-*-data /dev/shm/qb-*-header
 
rm -rf /etc/corosync
rm -rf /var/lib/corosync
rm -rf /var/lib/pcsd
rm -rf /var/lib/pacemaker
 
rm -rf /var/log/cluster/
rm -rf /var/log/pcsd/
rm -f /var/log/pacemaker.log*

Erreurs

1 Erreur Dell hardware

UEFI0081: Memory size has changed from the last time the system was started. No action is required if memory was added or removed.

http://www.dell.com/support/manuals/fr/fr/frbsdt1/integrated-dell-remote-access-cntrllr-8-with-lifecycle-controller-v2.00.00.00/eemi_13g-v1/UEFI-Event-Messages?guid=GUID-C1C6F253-F8EF-43BF-B8ED-1A9B2A910AC4&lang=en-us

2 Test fork-bomb

error: Integration Timer (I_INTEGRATED) just popped in state S_INTEGRATION! (180000ms)

Autres

Pour voir / vérifier les "property"

#pcs property set symmetric-cluster=true
pcs property

Ressources

Lister

pcs resource standards

ocf
lsb
service
systemd
stonith

pcs resource providers

heartbeat
openstack
pacemaker

Lister les agents : Exemple

pcs resource agents systemd
pcs resource agents ocf:heartbeat

Timeout par défaut pour les ressources

pcs resource op defaults timeout=240s

Stopper toutes les ressources

pcs property set stop-all-resources=true

pcs property unset stop-all-resources

ocf:pacemaker:ping \ /usr/lib/ocf/resource.d/pacemaker/ping

ocf:heartbeat:apache \ /usr/lib/ocf/resource.d/heartbeat/apache

egrep '^#.*OCF_RESKEY' /usr/lib/ocf/resource.d/heartbeat/apache
export OCF_ROOT=/usr/lib/ocf/
/usr/lib/ocf/resource.d/heartbeat/apache meta-data

Autre Lister toutes les ressources

crm_resource --list

Dump CIB (Cluster Information Base)

pcs cluster cib
pcs cluster cib cib-dump.xml

Ajout d'une ressource service

pcs resource create CRON systemd:crond
#pcs resource op add CRON start interval=0s timeout=1800s

UPDATE

pcs resource update ClusterMon-External  htmlfile='/tmp/cl.html'

UNSET

pcs resource update ClusterMon-External  htmlfile=

Stonith

pcs property list --all |grep stonith

Confirmer que le nœud est bien arrêté. \ Attention, si ce n'est pas le cas risque de pb

pcs stonith confirm node2

Failcount

crm_mon --failcounts
 
pcs resource failcount show resource_id
pcs resource failcount reset resource_id

Actualisation de l’état, et remise à zéro du “failcount”

pcs resource cleanup resource_id

Install depuis zero

echo "P@ssw0rd" |passwd hacluster --stdin
systemctl start pcsd.service
systemctl enable pcsd.service
pcs cluster auth -u hacluster -p P@ssw0rd 8si-pms-pps-srv-1 8si-pms-pps-srv-2
 
pcs cluster setup --name my_cluster 8si-pms-pps-srv-1 8si-pms-pps-srv-2
pcs cluster start --all
pcs cluster enable --all
 
pcs resource defaults resource-stickiness=100
pcs property set no-quorum-policy=freeze
 
pcs stonith create fence_8si-pms-pps-srv-1 fence_drac5 ipaddr=172.18.202.230 login=root passwd=calvin secure=1 cmd_prompt="/admin1->" pcmk_host_list=8si-pms-pps-srv-1 op monitor interval="60s"
pcs stonith create fence_8si-pms-pps-srv-2 fence_drac5 ipaddr=172.18.202.231 login=root passwd=calvin secure=1 cmd_prompt="/admin1->" pcmk_host_list=8si-pms-pps-srv-2 op monitor interval="60s"
 
pcs stonith level add 1 8si-pms-pps-srv-1 fence_8si-pms-pps-srv-1
pcs stonith level add 1 8si-pms-pps-srv-2 fence_8si-pms-pps-srv-2
 
pcs constraint location fence_8si-pms-pps-srv-1 avoids 8si-pms-pps-srv-1
pcs constraint location fence_8si-pms-pps-srv-2 avoids 8si-pms-pps-srv-2
 
pcs resource create myvip IPaddr2 ip=172.18.202.226 cidr_netmask=24 nic=bond0 op monitor interval=30s #on-fail=fence
pcs constraint location myvip prefers 8si-pms-pps-srv-1=100
pcs constraint location myvip prefers 8si-pms-pps-srv-2=50
#pcs resource meta myvip resource-stickiness=60
 
# l'utilisateur process doit appartenir au groupe haclient
#usermod -a -G haclient process
pcs property set enable-acl=true
pcs acl role create read-only description="Read access to cluster" read xpath /cib
pcs acl user create process read-only
 
pcs resource create ClusterMon-External ClusterMon update=10000 user=process extra_options="-E /usr/local/bin/crm_logger.sh --watch-fencing" htmlfile=/var/www/html/cl.html pidfile=/tmp/crm_mon-external.pid op monitor on-fail="restart" interval="60" clone
 
pcs resource create appmgr systemd:appmgr
pcs constraint colocation add appmgr with myvip

Voir aussi :

Généralité https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Configuration_Example_-_Oracle_HA_on_Cluster_Suite/sample_2_node.html

Fencing

Cluster

Table des matières

Haute dispo cluster failover redhat

Installation

Prérequis

Date synchro (ntp)

SELinux désactivé

Service NetworkManager arrêté et désactivé

Pare-feu

Résolution noms

Install

Configuration

Configuration du fencing / stonith

Test en vue du fencing via iDRAC

Vérif

Configuration

Ajout ressources

Manip

Diagnostic / Supervision

Diag Passif

ACL

Diag Actif

Ajout 2em interface pour le heartbeat

Reprise sur incident

Crash-tests

Nettoyage - effacer

Erreurs

1 Erreur Dell hardware

2 Test fork-bomb

Autres

Pour voir / vérifier les "property"

Ressources

Stonith

Failcount

Install depuis zero