Voir aussi :
Ressources :
Liens intros :
Voir :
Prérequis
/etc/hostsLes nœuds doivent avoir la date et l'heure synchronisée (voir NTP)
Vérif
date
Exemple avec Clush cluster_shell_parallele
echo date |clush -B -w node-[1-2]
setenforce 0 sed -i.bak "s/SELINUX=enforcing/SELINUX=permissive/g" /etc/selinux/config
Vérif
sestatus
systemctl stop NetworkManager systemctl disable NetworkManager
Si pare-feu activé
firewall-cmd --permanent --add-service=high-availability firewall-cmd --add-service=high-availability
Ou
Désactivation du parefeux
systemctl stop firewalld
systemctl disable firewalld
#rpm -e firewalld
Vérif
iptables -L -n -v
Chaque nœud doit pouvoir pinguer les autres via son nom. Il est conseiller d'utiliser /etc/hosts plutôt que DNS.
/etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 127.0.1.1 node-1.localdomain 192.168.97.221 node-1.localdomain node-1 192.168.97.222 node-2.localdomain node-2
Install paquets
yum install -y pacemaker pcs psmisc policycoreutils-python
echo "P@ssw0rd" | passwd hacluster --stdin systemctl start pcsd.service systemctl enable pcsd.service #unset http_proxy #export NO_PROXY=localhost,127.0.0.1,node-1,node-2 pcs cluster auth node-1 node-2 #-u hacluster -p passwd #pcs cluster setup --start --name my_cluster node-1 node-2 pcs cluster setup --name my_cluster node-1 node-2 pcs cluster start --all pcs cluster enable --all
Le fichier corosync.conf est automatiquement crée
/etc/corosync/corosync.conf
totem { version: 2 secauth: off cluster_name: my_cluster transport: udpu } node-list { node- { ring0_addr: node-1 node-id: 1 } node- { ring0_addr: node-2 node-id: 2 } } quorum { provider: corosync_votequorum two_node-: 1 } logging { to_logfile: yes logfile: /var/log/cluster/corosync.log to_syslog: yes }
Vérifier la conf de corosync 1
corosync-cfgtool -s
Doit retourner no faults \ Ne doit pas comporter d’adresse 127.0.0.1
Vérifier la conf de corosync 2
corosync-cmapctl |grep members pcs status corosync
Prevent Resources from Moving after Recovery
pcs resource defaults resource-stickiness=100
Pas de quorum
#pcs property set no-quorum-policy=ignore pcs property set no-quorum-policy=freeze
Voir https://www.devops.zone/tricks/connecting-ssh-drac-reboot-server/
Tester du fencing
/usr/sbin/fence_drac5 --ip=192.168.96.221 --username=root --password=calvin --ssh -c 'admin1->'
Test avec OpenManage /opt/dell/srvadmin/sbin/racadm
racadm -r 192.168.96.221 -u root -p calvin get iDRAC.Info
Test via SSH sur iDRAC Pour redemarrer le serveur en se connectant en SSH sur la iDRAC
ssh root@192.168.96.221 racadm serveraction powercycle
Si pas de stonith / fence sinon la VIP refusera de démarrer
# Si pas de stonith / fence pcs property set stonith-enabled=false
crm_verify -LVVV
# pcs stonith create fence_node-1 fence_drac5 ipaddr=192.168.96.221 login=root passwd=calvin secure=1 cmd_prompt="/admin1->" pcmk_host_list=node-1 stonith-action=poweroff pcs stonith create fence_node-1 fence_drac5 ipaddr=192.168.96.221 login=root passwd=calvin secure=1 cmd_prompt="/admin1->" pcmk_host_list=node-1 op monitor interval="60s" pcs stonith create fence_node-2 fence_drac5 ipaddr=192.168.96.222 login=root passwd=calvin secure=1 cmd_prompt="/admin1->" pcmk_host_list=node-2 op monitor interval="60s" pcs stonith level add 1 node-1 fence_node-1 pcs stonith level add 1 node-2 fence_node-2
Interdire le suicide (le fencing de soi-même)
pcs constraint location fence_node-1 avoids node-1 pcs constraint location fence_node-2 avoids node-2
Tester le fencing
#stonith_admin --reboot node-1 pcs stonith fence node-1
Ajout ressource VIP (adresse IP virtuelle)
pcs resource create myvip IPaddr2 ip=192.168.97.230 cidr_netmask=24 nic=bond0 op monitor interval=30s on-fail=fence #pcs constraint location myvip prefers node-1=INFINITY pcs constraint location myvip prefers node-1=100 pcs constraint location myvip prefers node-2=50 #pcs resource meta myvip resource-stickiness=100
Ajouter ressource ping
pcs resource create ping ocf:pacemaker:ping dampen=5s multiplier=1000 host_list=192.168.97.250 --clone pcs constraint location myvip rule score=-INFINITY pingd lt 1 or not_defined pingd
Ajout ressource Apache
Avant il faut configurer http://localhost/server-status et arrêter le service d'apache sur l'ensemble des nœuds
curl http://localhost/server-status systemctl stop httpd.service systemctl disable httpd.service
pcs resource create srvweb apache configfile="/etc/httpd/conf/httpd.conf" statusurl="http://127.0.0.1/server-status" op monitor interval=1min #--clone # Le serveur Web toujours sur la VIP pcs constraint colocation add srvweb with myvip # D'abord la VIP puis le serveur Web pcs constraint order myvip then srvweb
Déplacer la VIP
pcs resource move myvip node-1 pcs resource move myvip node-2
Retour arrière - Déplacer la VIP
#pcs constraint --full |grep prefer
pcs constraint remove cli-prefer-myvip
pcs resource relocate run
Remise à zero compteur erreurs
#pcs resource failcount reset res1 #crm_resource -P pcs resource cleanup
Déplacer toutes les ressources sur le nœud primaire (ignoring resource stickiness)
#pcs resource relocate show
pcs resource relocate run
Maintenance sur une ressource
#pcs resource update fence_node-1 meta target-role=stopped #pcs resource update fence_node-1 meta is-managed=false #pcs resource update fence_node-1 op monitor enabled=false #pcs resource disable fence_node-1 pcs resource unmanage fence_node-1
Maintenance générale du cluster
pcs property set maintenance-mode=true
Fin de maintenance
pcs property set maintenance-mode=false
Arrêt du cluster
pcs cluster stop --all pcs cluster disable --all
# Check syntax conf corosync -t # Check cluster communication corosync-cfgtool -s # check the node's network corosync-cmapctl |grep members
Vérif
pcs cluster pcsd-status pcs cluster verify pcs status corosync crm_mon -1 --fail crm_mon -1Af journalctl --since yesterday -p err journalctl -u pacemaker.service --since "2017-02-24 16:00" -p warning
Script supervision (ces commandes doivent retourner aucune ligne)
LANG=C pcs status |egrep "Stopped|standby|OFFLINE|UNCLEAN|Failed|error" crm_verify -LVVV LANG=C pcs resource relocate show |sed -ne '/Transition Summary:/,$p' |grep -v '^Transition Summary:' crm_mon -1f | grep -q fail-count
Voir plus haut si (script /usr/local/bin/crm_logger.sh)
tailf /var/log/messages |grep "ClusterMon-External:"
Script supervision Quel nœud est actif
LANG=C crm_resource --resource myvip --locate |cut -d':' -f2 |tr -d ' '
Le serveur web répond t-il bien en utilisant l'IP de la VIP. (Le code de retour doit-être 0)
#curl -4 -m 1 --connect-timeout 1 http://192.168.97.230/ > /dev/null 2>&1 curl -4 -m 1 --connect-timeout 1 http://192.168.97.230/cl.html > /dev/null 2>&1 #echo $?
Compte en lecture seule avec les droits de consulter crm_mon \
Attention : ce compte trouver le mdp iDRAC/Ilo
pcs stonith --full |grep passwd
Mise en œuvre
#adduser rouser #usermod -a -G haclient rouser usermod -a -G haclient process pcs property set enable-acl=true pcs acl role create read-only description="Read access to cluster" read xpath /cib #pcs acl user create rouser read-only pcs acl user create process read-only
#crm_mon --daemonize --as-html /var/www/html/cl.html
/usr/local/bin/crm_logger.sh
#!/bin/sh # https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/pdf/High_Availability_Add-On_Reference/Red_Hat_Enterprise_Linux-7-High_Availability_Add-On_Reference-en-US.pdf logger -t "ClusterMon-External" "${CRM_notify_node:-x} ${CRM_notify_rsc:-x} \ ${CRM_notify_task:-x} ${CRM_notify_desc:-x} ${CRM_notify_rc:-x} \ ${CRM_notify_target_rc:-x} ${CRM_notify_status:-x} ${CRM_notify_recipient:-x}"; exit
chmod 755 /usr/local/bin/crm_logger.sh chown root.root /usr/local/bin/crm_logger.sh
pcs resource create ClusterMon-External ClusterMon update=10000 user=process extra_options="-E /usr/local/bin/crm_logger.sh --watch-fencing" htmlfile=/var/www/html/cl.html pidfile=/tmp/crm_mon-external.pid op monitor on-fail="restart" interval="60" clone
Colocation - page de monitoting toujours actif sur la VIP \ Seulement nécessaire si ressource non clonée
pcs constraint colocation add ClusterMon-External with myvip
Test
curl 192.168.97.230/cl.html
En cas de pb
pcs resource debug-start resource_id
Redundant Ring Protocol (RRP) rrp_mode If set to active, Corosync uses both interfaces actively. If set to passive, Corosync sends messages alternatively over the available networks.
Avant de modifier la conf, on passe le cluster en mode maintenance :
pcs property set maintenance-mode=true
/etc/hosts
192.168.21.10 node1 192.168.22.10 node1b 192.168.21.11 node2 192.168.22.11 node2b
On ajoute rrp_mode et ring1_addr /etc/corosync/corosync.conf
totem { rrp_mode: active } nodelist { node { ring0_addr: node1 ring1_addr: node1b nodeid: 1 } node { ring0_addr: node2 ring1_addr: node2b nodeid: 2 } }
pcs cluster reload corosync pcs cluster status corosync corosync-cfgtool -s pcs property unset maintenance-mode
#crm_resource -P pcs resource cleanup pcs resource relocate run #pcs cluster start --all
Test 1 Crash brutal
echo 1 > /proc/sys/kernel/sysrq echo c > /proc/sysrq-trigger
Test 2 Coupure électrique : Débranchement du câble
Test 3 Coupure réseaux
ifdown bond0
Test 4 Perte du ping de la passerelle sur l'un des nœud
iptables -A OUTPUT -d 192.168.97.250/32 -p icmp -j REJECT
Test 5 Fork bomb, nœud ne répond plus, sauf au ping
Fork bomb
:(){ :|:& };:
Test 6 Perte connexion iDRAC : Débranchement du câble
pcs cluster stop --force #--all pcs cluster destroy --force #--all systemctl stop pcsd systemctl stop corosync systemctl stop pacemaker yum remove -y pcsd corosync pacemaker userdel hacluster rm -rf /dev/shm/qb-*-data /dev/shm/qb-*-header rm -rf /etc/corosync rm -rf /var/lib/corosync rm -rf /var/lib/pcsd rm -rf /var/lib/pacemaker rm -rf /var/log/cluster/ rm -rf /var/log/pcsd/ rm -f /var/log/pacemaker.log*
UEFI0081: Memory size has changed from the last time the system was started. No action is required if memory was added or removed.
error: Integration Timer (I_INTEGRATED) just popped in state S_INTEGRATION! (180000ms)
#pcs property set symmetric-cluster=true
pcs property
Lister
pcs resource standards
ocf lsb service systemd stonith
pcs resource providers
heartbeat openstack pacemaker
Lister les agents : Exemple
pcs resource agents systemd pcs resource agents ocf:heartbeat
Timeout par défaut pour les ressources
pcs resource op defaults timeout=240s
Stopper toutes les ressources
pcs property set stop-all-resources=true
pcs property unset stop-all-resources
ocf:pacemaker:ping \ /usr/lib/ocf/resource.d/pacemaker/ping
ocf:heartbeat:apache \ /usr/lib/ocf/resource.d/heartbeat/apache
egrep '^#.*OCF_RESKEY' /usr/lib/ocf/resource.d/heartbeat/apache export OCF_ROOT=/usr/lib/ocf/ /usr/lib/ocf/resource.d/heartbeat/apache meta-data
Autre Lister toutes les ressources
crm_resource --list
Dump CIB (Cluster Information Base)
pcs cluster cib pcs cluster cib cib-dump.xml
Ajout d'une ressource service
pcs resource create CRON systemd:crond
#pcs resource op add CRON start interval=0s timeout=1800s
UPDATE
pcs resource update ClusterMon-External htmlfile='/tmp/cl.html'
UNSET
pcs resource update ClusterMon-External htmlfile=
pcs property list --all |grep stonith
Confirmer que le nœud est bien arrêté. \ Attention, si ce n'est pas le cas risque de pb
pcs stonith confirm node2
crm_mon --failcounts
pcs resource failcount show resource_id
pcs resource failcount reset resource_id
Actualisation de l’état, et remise à zéro du “failcount”
pcs resource cleanup resource_id
echo "P@ssw0rd" |passwd hacluster --stdin systemctl start pcsd.service systemctl enable pcsd.service pcs cluster auth -u hacluster -p P@ssw0rd 8si-pms-pps-srv-1 8si-pms-pps-srv-2 pcs cluster setup --name my_cluster 8si-pms-pps-srv-1 8si-pms-pps-srv-2 pcs cluster start --all pcs cluster enable --all pcs resource defaults resource-stickiness=100 pcs property set no-quorum-policy=freeze pcs stonith create fence_8si-pms-pps-srv-1 fence_drac5 ipaddr=172.18.202.230 login=root passwd=calvin secure=1 cmd_prompt="/admin1->" pcmk_host_list=8si-pms-pps-srv-1 op monitor interval="60s" pcs stonith create fence_8si-pms-pps-srv-2 fence_drac5 ipaddr=172.18.202.231 login=root passwd=calvin secure=1 cmd_prompt="/admin1->" pcmk_host_list=8si-pms-pps-srv-2 op monitor interval="60s" pcs stonith level add 1 8si-pms-pps-srv-1 fence_8si-pms-pps-srv-1 pcs stonith level add 1 8si-pms-pps-srv-2 fence_8si-pms-pps-srv-2 pcs constraint location fence_8si-pms-pps-srv-1 avoids 8si-pms-pps-srv-1 pcs constraint location fence_8si-pms-pps-srv-2 avoids 8si-pms-pps-srv-2 pcs resource create myvip IPaddr2 ip=172.18.202.226 cidr_netmask=24 nic=bond0 op monitor interval=30s #on-fail=fence pcs constraint location myvip prefers 8si-pms-pps-srv-1=100 pcs constraint location myvip prefers 8si-pms-pps-srv-2=50 #pcs resource meta myvip resource-stickiness=60 # l'utilisateur process doit appartenir au groupe haclient #usermod -a -G haclient process pcs property set enable-acl=true pcs acl role create read-only description="Read access to cluster" read xpath /cib pcs acl user create process read-only pcs resource create ClusterMon-External ClusterMon update=10000 user=process extra_options="-E /usr/local/bin/crm_logger.sh --watch-fencing" htmlfile=/var/www/html/cl.html pidfile=/tmp/crm_mon-external.pid op monitor on-fail="restart" interval="60" clone pcs resource create appmgr systemd:appmgr pcs constraint colocation add appmgr with myvip
Voir aussi :
Fencing
Cluster