前面给大家将了一下slurm集群的简单配置,这里给大家再提升一下,配置slurm服务的restful的api,这样大家可以将slurm服务通过api整合到桌面或者网页端,通过桌面或者网页界面进行管理。
1、SLURM集群配置
这里请大家参考:SLURM超算集群资源管理服务的安装和配置-基于slurm22.05.9和centos9stream,配置slurmdbd作为账户信息存储服务-CSDN博客文章浏览阅读302次,点赞10次,收藏6次。slurm介绍就不再赘述了,这里看官网链接,其他的自己搜索吧。这里主要将slurm集群配置的一般步骤,重点是slurmd的conf文件的配置;官网的内容比较全但不太好选择哪些是必须的,所以这里主要配置大家常用的东西,方便大家尽快上手。另外,这里写了slurm的版本,大家要注意一下尽量使用相同的版本,跨版本的服务容易引起莫名其妙的错误。https://blog.csdn.net/zrc_xiaoguo/article/details/134634440?spm=1001.2014.3001.5502
2、安装slurmrestd服务
###查看可用的安装包
yum list | grep slurmr
slurm-slurmrestd.x86_64 22.05.9-1.el9 epel
##依然是来自于epel源,前面slurm服务配置好了应该都没问题了
##如果前面配置slurm服务的时候已经安装了slurmrestd则不需要重复安装了。
rpm -qa | grep slurmrestd
##安装slurmrestd
yum install slurm-slurmrestd -y
3、配置slurmrestd服务(重要)
###先查看systemd配置的slurmrestd服务
systemctl status slurmrestd
先不能直接启动slurmrestd服务,否则会报不能使用root启用slurmrestd服务的错误信息:
默认的启动服务ip是会在安装节点的所有ip上启用,也就是0.0.0.0,端口是6820,这个端口容易与ceph集群的端口冲突,所以建议修改掉。
这里报错不能使用root用户,所以需要修改slurmrestd.service文件,配置slurmrestd服务的运行账户,我们这里可直接使用已经创建的slurm,当然也可以重新创建一个slurmrestd或者slurmapi的专用账户,同时还需要将slurmrestd.socket文件所在文件夹权限设置成slurmrestd运行服务的账户
###直接编辑service文件,编辑之前建议备份
vim /usr/lib/systemd/system/slurmrestd.service
[Unit]
Description=Slurm REST daemon
After=network-online.target slurmctld.service
Wants=network-online.target
ConditionPathExists=/etc/slurm/slurm.conf
[Service]
Type=simple
EnvironmentFile=-/etc/sysconfig/slurmrestd
EnvironmentFile=-/etc/default/slurmrestd
# slurmrestd should not run as root or the slurm user.
# Please either use the -u and -g options in /etc/sysconfig/slurmrestd or
# /etc/default/slurmrestd, or explicitly set the User and Group in this file
# an unpriviledged user to run as.
User=slurm #重点在这里,配置slurmrestd运行服务的账户和用户组
Group=slurm #重点在这里,配置slurmrestd运行服务的账户和用户组
# Default to listen on both socket and slurmrestd port
ExecStart=/usr/sbin/slurmrestd $SLURMRESTD_OPTIONS unix:/var/run/slurm/slurmrestd.socket 0.0.0.0:46820 #这里也很重要,我这里将slurmrestd.cocket文件指定到了/var/run/slurm下面,并且将这个目录所有者权限设置为slurm。另外,服务的端口也修改为46820.
# Enable auth/jwt be default, comment out the line to disable it for slurmrestd
Environment="SLURM_JWT=daemon"
ExecReload=/bin/kill -HUP $MAINPID
[Install]
WantedBy=multi-user.target
######################################################
##service文件编辑完成后需要运行daemon-reload
systemctl daemon-reload
####这个时候还不能启动slurmrestd服务,因为目录权限默认没有改变
chown slurm:slurm /var/run/slurm
4、配置JWT Authentication(重要)
###生成jwtkey,目录根据自己需求设置吧
dd if=/dev/random of=/var/spool/slurm/statesave/jwt_hs256.key bs=32 count=1
chown slurm:slurm /var/spool/slurm/statesave/jwt_hs256.key
chmod 0600 /var/spool/slurm/statesave/jwt_hs256.key
chown slurm:slurm /var/spool/slurm/statesave
chmod 0755 /var/spool/slurm/statesave
###关于安全的提示
The key does not have to be in the StateSaveLocation, but that is a convenient location if you have multiple controllers since it is shared between them. The key should not be placed in a directory where non-admin users might be able to access it. The key file should be owned by SlurmUser or root, with recommended permissions of 0400. The file must not be accessible by 'other'.
###修改slurm.conf,加入或修改下面两个参数
AuthAltTypes=auth/jwt
AuthAltParameters=jwt_key=/var/spool/slurm/statesave/jwt_hs256.key
##同步所有节点的slurm.conf和jwt.key文件,否则可能报各节点配置不一致的警告
scp host1:/var/spool/slurm/statesave/jwt_hs256.key /var/spool/slurm/statesave/jwt_hs256.key
scp host1:/etc/slurm/slurm.conf /etc/slurm
chown slurm:slurm /var/spool/slurm/statesave/jwt_hs256.key
###重启slurmctld服务
systemctl restart slurmctld
获取token
###直接获取token,默认市场1800秒,可以自行指定
scontrol token username=slurmuser1
SLURM_JWT=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJleHAiOjzAsImlhdCI6MTcwMTE0MDkzMCwic3VuIjoidHpoeCJ9.vUz2V02dFpXmAr8eAJyRGNwcMe0xdqm7UgDvuM
###指定token有效时间为600秒
scontrol token username=slurmuser1 lifespan=600
由于token有有效期,需要访问的时候再生成新token,所以在实际应用或终端测试时需要想办法自动获取用户token
请求的参数设置就不在这里细说了,请求成功会报相应的信息,设置大致如下
5、启动slurmrestd系统服务
####修改完权限后再启动slurmrestd的系统服务,同时加入系统自动启动
systemctl enable slurmrestd --now
6、slurm API测试及使用
启动slurmrestd的系统服务后即可使用api测试工具或者自己调试nodejs等脚本测试api
默认系统不给任何api的提示,所有路径访问错误或者授权错误统一出现“Authentication failure”,直接使用ip:port访问会出现授权错误的情况。所以这里最重要的还需要配置api的认证方式,auth/jwt
官网JWT方式的配置,但使用JWT方式需要统一修改slurm.conf中的authalttype了,需要重新配置一下slurm.conf并在复制到所有节点上。
7、slurmrestd的API列表
这里是官方文档,都很清晰了,能打开的话就直接看吧
Slurm Workload Manager - (schedmd.com)https://slurm.schedmd.com/rest_api.html这里提到了权限,需要将X-SLURM-USER-NAME和X-SLURM-USER-TOKEN加入到头部headers中
Access
- APIKey KeyParamName:X-SLURM-USER-NAME KeyInQuery:false KeyInHeader:true
- APIKey KeyParamName:X-SLURM-USER-TOKEN KeyInQuery:false KeyInHeader:true
- HTTP Basic Authentication
Methods
[ Jump to Models ]
TABLE OF CONTENTS
Openapi
- GET /openapi
- GET /openapi/v3
- GET /openapi.json
- GET /openapi.yaml
Slurm
- DELETE /slurm/v0.0.40/job/{job_id}
- DELETE /slurm/v0.0.40/node/{node_name}
- GET /slurm/v0.0.40/diag
- GET /slurm/v0.0.40/job/{job_id}
- GET /slurm/v0.0.40/jobs
- GET /slurm/v0.0.40/licenses
- GET /slurm/v0.0.40/node/{node_name}
- GET /slurm/v0.0.40/nodes
- GET /slurm/v0.0.40/partition/{partition_name}
- GET /slurm/v0.0.40/partitions
- GET /slurm/v0.0.40/ping
- GET /slurm/v0.0.40/reconfigure
- GET /slurm/v0.0.40/reservation/{reservation_name}
- GET /slurm/v0.0.40/reservations
- GET /slurm/v0.0.40/shares
- POST /slurm/v0.0.40/job/{job_id}
- POST /slurm/v0.0.40/job/submit
- POST /slurm/v0.0.40/node/{node_name}
Slurmdb
- DELETE /slurmdb/v0.0.40/account/{account_name}
- DELETE /slurmdb/v0.0.40/association
- DELETE /slurmdb/v0.0.40/associations
- DELETE /slurmdb/v0.0.40/cluster/{cluster_name}
- DELETE /slurmdb/v0.0.40/qos/{qos}
- DELETE /slurmdb/v0.0.40/user/{name}
- DELETE /slurmdb/v0.0.40/wckey/{id}
- GET /slurmdb/v0.0.40/account/{account_name}
- GET /slurmdb/v0.0.40/accounts
- GET /slurmdb/v0.0.40/association
- GET /slurmdb/v0.0.40/associations
- GET /slurmdb/v0.0.40/cluster/{cluster_name}
- GET /slurmdb/v0.0.40/clusters
- GET /slurmdb/v0.0.40/config
- GET /slurmdb/v0.0.40/diag
- GET /slurmdb/v0.0.40/instance
- GET /slurmdb/v0.0.40/instances
- GET /slurmdb/v0.0.40/job/{job_id}
- GET /slurmdb/v0.0.40/jobs
- GET /slurmdb/v0.0.40/qos
- GET /slurmdb/v0.0.40/qos/{qos}
- GET /slurmdb/v0.0.40/tres
- GET /slurmdb/v0.0.40/user/{name}
- GET /slurmdb/v0.0.40/users
- GET /slurmdb/v0.0.40/wckey/{id}
- GET /slurmdb/v0.0.40/wckeys
- POST /slurmdb/v0.0.40/accounts
- POST /slurmdb/v0.0.40/accounts_association
- POST /slurmdb/v0.0.40/associations
- POST /slurmdb/v0.0.40/clusters
- POST /slurmdb/v0.0.40/config
- POST /slurmdb/v0.0.40/qos
- POST /slurmdb/v0.0.40/tres
- POST /slurmdb/v0.0.40/users
- POST /slurmdb/v0.0.40/users_association
- POST /slurmdb/v0.0.40/wckeys
8、请求结果示例
GET /openapi/v3