{{tag>Brouillon Cluster Grid Ressource}} = Notes ordonnanceur cluster grid batch scheduler slurm == Slurm Liens : * http://cascisdi.inra.fr/sites/cascisdi.inra.fr/files/slurm_0.txt * https://wiki.fysik.dtu.dk/niflheim/SLURM * https://www.glue.umd.edu/hpcc/help/slurm-vs-moab.html * https://www.crc.rice.edu/wp-content/uploads/2014/08/Torque-to-SLURM-cheatsheet.pdf * http://slurm.schedmd.com/rosetta.pdf * http://www.accre.vanderbilt.edu/wp-content/uploads/2012/04/Slurm.pdf * https://github.com/accre/SLURM * http://slurm.schedmd.com/quickstart.html * http://slurm.schedmd.com/slurm_ug_2011/Basic_Configuration_Usage.pdf * https://www.unila.edu.br/sites/default/files/files/user_guide_slurm.pdf * https://computing.llnl.gov/tutorials/slurm/slurm.pdf * https://computing.llnl.gov/tutorials/bgq/ * https://computing.llnl.gov/linux/slurm/quickstart.html * https://computing.llnl.gov/linux/slurm/faq.html * https://rc.fas.harvard.edu/resources/running-jobs/ * http://bap-alap.blogspot.fr/2012_09_01_archive.html * https://fortylines.com/blog/startingWithSLURM.blog.html * https://github.com/ciemat-tic/codec/wiki/Slurm-cluster * http://manx.classiccmp.org/mirror/techpubs.sgi.com/library/manuals/5000/007-5814-001/pdf/007-5814-001.pdf * http://wildflower.diablonet.net/~scaron/slurmsetup.html * http://wiki.sc3.uis.edu.co/index.php/Slurm_Installation * http://eniac.cyi.ac.cy/display/UserDoc/Copy+of+Slurm+notes * http://www.ibm.com/developerworks/library/l-slurm-utility/index.html * https://www.lrz.de/services/compute/linux-cluster/batch_parallel/ * http://www.gmpcs.lumat.u-psud.fr/spip.php?rubrique35 * https://services-numeriques.unistra.fr/hpc/applications-disponibles/systeme-de-files-dattente-slurm.html * http://www.brightcomputing.com/Blog/bid/174099/Slurm-101-Basic-Slurm-Usage-for-Linux-Clusters * https://dashboard.hpc.unimelb.edu.au/started/ API * http://slurm.schedmd.com/slurm_ug_2012/pyslurm.pdf Voir aussi : * https://aws.amazon.com/fr/batch/use-cases == A faire MPI with Slurm * http://slurm.schedmd.com/mpi_guide.html * openmpi * https://www.hpc2n.umu.se/batchsystem/slurm_info * hwloc-nox (Portable Linux Processor Affinity (PLPA)) * https://www.hpc2n.umu.se/batchsystem/slurm_info * https://computing.llnl.gov/linux/slurm/mpi_guide.html * https://computing.llnl.gov/tutorials/openMP/ProcessThreadAffinity.pdf * https://www.open-mpi.org/faq/?category=slurm * http://stackoverflow.com/questions/31848608/slurms-srun-slower-than-mpirun * https://www.rc.colorado.edu/support/examples-and-tutorials/parallel-mpi-jobs.html * http://www.brightcomputing.com/Blog/bid/149455/How-to-run-an-OpenMPI-job-in-Bright-Cluster-Manager-through-Slurm * http://www.hpc2n.umu.se/node/875 == Install Slurm utilisant par défaut **munge** pour faire le lien entre les comptes des machines **il faut que toutes les machines aient l'horloge synchronisées** Manager : apt-get install slurm-wlm Nœuds : apt-get install -y slurmd slurm-wlm-basic-plugins Manager et Nœuds systemctl enable munge.service zcat /usr/share/doc/slurm-client/examples/slurm.conf.simple.gz > /etc/slurm-llnl/slurm.conf Il faut adapter slurm.conf, il peut-être généré à partir de : * /usr/share/doc/slurmctld/slurm-wlm-configurator.easy.html * /usr/share/doc/slurmctld/slurm-wlm-configurator.html * https://computing.llnl.gov/linux/slurm/configurator.html On copie le même fichier de conf sur les nœuds (le même fichier sur le manager que sur les nœuds) scp -3 vmdeb1:/etc/slurm-llnl/slurm.conf vmdeb2:/etc/slurm-llnl/slurm.conf scp -3 vmdeb1:/etc/munge/munge.key vmdeb2:/etc/munge/munge.key Lister les "daemons" démarrés scontrol show daemons Sur le maître (ControlMachine) : slurmctld slurmd \\ Sur les nœuds : slurmd ''/etc/slurm-llnl/slurm.conf'' # slurm.conf file generated by configurator easy.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=vmdeb1 #ControlAddr=127.0.0.1 # #MailProg=/bin/mail #MpiDefault=none MpiDefault=openmpi MpiParams=ports=12000-12999 #MpiParams=ports=#-# #ProctrackType=proctrack/pgid Proctracktype=proctrack/linuxproc SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid #SlurmctldPort=6817 SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid #SlurmdPort=6818 SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd SlurmUser=slurm #SlurmdUser=root #UsePAM=1 DisableRootJobs=YES EnforcePartLimits=YES JobRequeue=0 ReturnToService=1 #TopologyPlugin=topology/tree # Must be writable by user SlurmUser. The file must be accessible by the primary and backup control machines. # On NFS share !? See http://manx.classiccmp.org/mirror/techpubs.sgi.com/library/manuals/5000/007-5814-001/pdf/007-5814-001.pdf StateSaveLocation=/var/lib/slurm-llnl/slurmctld SwitchType=switch/none #TaskPlugin=task/none #TaskPlugin=task/cgroup TaskPlugin=task/affinity # # # TIMERS #KillWait=30 #MinJobAge=300 #SlurmctldTimeout=120 #SlurmdTimeout=300 Waittime=0 # # # SCHEDULING FastSchedule=1 SchedulerType=sched/backfill #SchedulerPort=7321 SelectType=select/linear # # # LOGGING AND ACCOUNTING ClusterName=cluster1 #JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux #SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log #SlurmdDebug=3 SlurmdLogFile=/var/log/slurm-llnl/slurmd.log SlurmSchedLogFile=/var/log/slurm-llnl/slurmSched.log #JobCompType=jobcomp/filetxt #JobCompType=jobcomp/mysql JobCompType=jobcomp/none JobCompLoc=/var/log/slurm-llnl/jobcomp #JobCheckpointDir=/var/lib/slurm-llnl/checkpoint #AccountingStorageType=jobacct_gather/linux #AccountingStorageType=accounting_storage/filetxt AccountingStorageType=accounting_storage/slurmdbd AccountingStoreJobComment=YES DefaultStorageType=accounting_storage/slurmdbd #AccountingStorageLoc=/var/log/slurm-llnl/accounting AccountingStoragePort=6819 AccountingStorageEnforce=associations # # NodeName=vmdeb1 # COMPUTE NODES NodeName=DEFAULT PartitionName=DEFAULT MaxTime=INFINITE State=UP NodeName=vmdeb2 CPUs=1 RealMemory=494 State=UNKNOWN NodeName=vmdeb3 CPUs=2 RealMemory=494 TmpDisk=8000 State=UNKNOWN PartitionName=debug Nodes=vmdeb[2-3] Default=YES MaxTime=INFINITE Shared=YES State=UP === Install de slurmdbd Il est recommandé d'utiliser MySQL (pas toutes les fonctionnalité avec PostgreSQL, dommage) Ici on part du principe que vous avez déjà une base de donnés MySQL et compte et droits crée. apt-get install slurmdbd zcat /usr/share/doc/slurmdbd/examples/slurmdbd.conf.simple.gz > /etc/slurm-llnl/slurmdbd.conf On adapte le fichier slurmdbd.conf Puis service slurmdbd restart On test sacct JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- == Pb munge -n | ssh vmdeb1 unmunge STATUS: Expired credential (15) Solution : ntpdate -u pool.ntp.org sudo -u slurm -- /usr/sbin/slurmctld -Dcvvvv /usr/sbin/slurmd -Dcvvvv -c : Clear : Efface l'etat précedent, purge les jobs... -D : Deamon : Lancement en arrière plan. Logs sur STDOUT -v : Verbose : Mode bavare. Mettre plusieurs "v" pour être très bavare slurmd -C Affiche la configuration de l'hôte courant Aide Le **man** et commande --help commande --usage Variables : SQUEUE_STATES=all for the squeue command to display jobs in any state. (y compris les job en COMPLETED et CANCELLED) Commande : sbatch salloc srun sattach srun -l --ntasks-per-core=1 --exclusive -n 2 hostname sinfo --Node scontrol show partition scancel --user=test --state=pending scontrol show config scontrol show job scancel -i --user=test # The Slurm -d singleton argument tells Slurm not to dispatch this job until all previous jobs with the same name have completed. sbatch -d singleton simple.sh scontrol ping sinfo -R # Afficher egalement les jobs terminés squeue -t all #A/I/O/T = "active(in use)/idle/other/total" sinfo -l # sinfo -Nle -o '%n %C %t' === Astuce ==== Lancer une commande **srun** sans attendre Normalement $ srun -N2 -l hostname srun: job 219 queued and waiting for resources Solution (compte root ou le "SlurmUser") # sinfo --noheader -o %N vmdeb[2-3] # srun -N2 --no-allocate -w vmdeb[2-3] hostname ------- Cancel / terminate a job in "CG" state scontrol update nodename=node4-blender state=down reason=hung scontrol update nodename=node4-blender state=idle Il faudra aussi tuer le processus 'slurmstepd' sur les nœuds \\ Problème de flux réseaux : Node => Manager:TCP6817 PB "srun: error: Application launch failed: User not found on host" Solution : Il faut que le même utilisateur ai le même UID sur les nœuds ainsi que sur le manager. Apparemment c'est lié à **munge** Il peut être intéressant d'utiliser LDAP