一. 概念
1. 介绍
gluster是一个横向扩展的分布式文件系统,可将来自多个服务器的磁盘存储资源整合到一个全局名称空间中,可以根据存储消耗需求快速调配额外的存储。它将自动故障转移作为主要功能.
- 分布式存储系统.集群式NAS存储.
- 无集中式元数据服务,采用Hash算法定位.
- 一致性哈希DHT.
- Hash值落在哪个范围内,数据就存储在哪里.
- 弹性卷管理.
- 自动做了raid.
2. 优点
- 缩放到几PB.处理数千个客户.开源.
- POSIX兼容
- 可以使用任何支持扩展属性的ondisk文件系统.使用NFS和SMB等行业标准协议访问
- 提供复制,配额,地理复制,快照和bitrot检测
- 允许优化不同的工作量 开源
3. 缺点
- 不适用于存储大量小文件的场景,因为GlusterFS的设计之初就是用于存储大数据的,对小文件的优化不是很好,推荐保存单个文件至少1MB以上的环境,如果是大量小文件的场景建议使用FastDFS、MFS等
4. 卷
- 分布卷(默认模式):即DHT, 也叫 分布卷: 将文件以hash算法随机分布到 一台服务器节点中存储
- 复制模式:即AFR, 创建volume 时带 replica x 数量: 将文件复制到 replica x 个节点中
- 条带模式:即Striped, 创建volume 时带 stripe x 数量: 将文件切割成数据块,分别存储到 stripe x 个节点中 ( 类似raid 0 )
- 分布式条带模式:最少需要4台服务器才能创建。 创建volume 时 stripe 2 server = 4 个节点: 是DHT 与 Striped 的组合型
- 分布式复制模式:最少需要4台服务器才能创建。 创建volume 时 replica 2 server = 4 个节点:是DHT 与 AFR 的组合型
- 条带复制卷模式:最少需要4台服务器才能创建。 创建volume 时 stripe 2 replica 2 server = 4 个节点: 是 Striped 与 AFR 的组合型
- 三种模式混合: 至少需要8台 服务器才能创建。 stripe 2 replica 2 , 每4个节点 组成一个组
二. 部署
1. 配置
- 若干brick组成1个复制卷,另外若干brick组成其他复制卷;单个文件在复制卷内数据保持副本,不同文件在不同复制卷之间进行哈希分布;即分布式卷跨复制卷集(replicated sets )
- brick server数量是副本数量的倍数,且>=2倍,即最少需要4台brick server,同时组建复制卷集的brick容量相等
IP | hostname | 配置 | 说明 |
---|---|---|---|
192.168.100.155 | g1 | CentOS 7 1C2G | 额外一块硬盘 |
192.168.100.156 | g2 | CentOS 7 1C2G | 额外一块硬盘 |
192.168.100.157 | g3 | CentOS 7 1C2G | 额外一块硬盘 |
192.168.100.154 | k8s | CentOS 7 2C4G | 部署的heketi, 因资源问题,上面有个小型k8s |
192.168.100.158 | / | CentOS 7 2C4G | 资源充足可将k8s部署在这上面 |
2. 部署
- 以下三个节点都需要操作
# 关闭防火墙和selinux
vim /etchosts
192.168.100.155 g1
192.168.100.156 g2
192.168.100.157 g3
# repo
wget -O /etc/yum.repos.d/CentOS-Base.repo https://repo.huaweicloud.com/repository/conf/CentOS-7-reg.repo
yum -y install centos-release-gluster
# 安装并启动
yum -y install glusterfs-server
systemctl enable glusterd.service --now
# 磁盘格式化
[root@g1 ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 10G 0 disk
└─sda1 8:1 0 10G 0 part /
sdb 8:16 0 20G 0 disk
sr0 11:0 1 4.3G 0 rom
# 三个节点创建目录
mkdir -p /data/brick1
#
fdisk /dev/sdb
# 格式化文件系统
mkfs.xfs -i size=512 /dev/sdb1
# 开机挂载
echo '/dev/sdb1 /data/brick1 xfs defaults 0 0 ' >> /etc/fstab
mount -a
# 验证
df -h
# 设置glusterfs卷创建的目录, 创建分布式卷
mkdir /data/brick1/gv0
- 以下在g1节点上操作
ssh-keygen
ssh-copy-id g1
ssh-copy-id g2
ssh-copy-id g3
# 配置受信任池
gluster peer probe g2
gluster peer probe g3
# 可在任意节点上查看节点状态
gluster peer status
# 设置glusterfs分布式卷, 必须指定类型, 默认是分布式卷, 必须指定副本数,不需要指出分布式卷类型,只要副本数量与 brick server 数量不等且符合倍数关系,即是分布式复制卷
gluster volume create gv0 replica 3 g1:/data/brick1/gv0 g2:/data/brick1/gv0 g3:/data/brick1/gv0
# 启动创建的卷
gluster volume start gv0
## 停止卷
gluster volume stop gv0
## 删除卷
gluster volume delete gv0
# 查看信息
[root@g1 ~]# gluster volume info
Volume Name: gv0
Type: Replicate
Volume ID: c5f0bbe3-afae-4a4a-9ab6-4cfa284897ed
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: g1:/data/brick1/gv0
Brick2: g2:/data/brick1/gv0
Brick3: g3:/data/brick1/gv0
Options Reconfigured:
cluster.granular-entry-heal: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
# 查看卷状态
gluster volume status
# 测试
mkdir /seek
mount -t glusterfs g1:/gv0 /seek
for i in `seq -w 1 10`; do cp -rp /var/log/messages /mnt/copy-test-$i; done
# 因为是三副本存储,所以每个节点上的文件数量都是10, 都可以查看到该文件
三. k8s与GlusterFS
1. 概念
- Kubernetes中使用GlusterFS作为持久化存储,要提供storageClass使用需要依赖Heketi工具
- Heketi是一个具有resetful接口的glusterfs管理程序,作为kubernetes的Storage存储的external provisioner
- 提供基于RESTful接口管理glusterfs的功能,可以方便的创建集群管理glusterfs的node,device,volume
- 与k8s结合可以创建动态的PV,扩展glusterfs存储的动态管理功能。主要用来管理glusterFS volume的生命周期,初始化时候就要分配好裸磁盘(未格式化)设备
- 每个kubernetes集群的节点需要安装gulsterfs的客户端,如glusterfs-cli, glusterfs-fuse, 主要用于在每个node节点挂载volume
- 每个kubernetes集群的节点运行
modprobe dm_thin_pool
,加载内核模块 - kube-apiserver中添加
–allow-privileged=true
参数以开启此功能,默认此版本的kubeadm已开启
2. Heketi
- 可单独部署在一台服务器上, 我这里是部署在k8s的master节点上的
heketi
仅支持使用裸分区或裸磁盘(未格式化)添加为device,不支持文件系统
# hosts
vim /etc/hosts
192.168.100.155 g1
192.168.100.156 g2
192.168.100.157 g3
yum -y install centos-release-gluster
yum -y install heketi heketi-client
# 配置heketi.json
cd /etc/heketi/
cp heketi.json heketi.json.bak
# 修改后
[root@master01 heketi]# cat heketi.json
{
"_port_comment": "Heketi Server Port Number",
"port": "18080", # 默认端口号8080,
"_use_auth": "Enable JWT authorization. Please enable for deployment",
"use_auth": true, # 默认flase,可以改为true
"_jwt": "Private keys for access",
"jwt": {
"_admin": "Admin has access to all APIs",
"admin": {
"key": "admin" # 修改
},
"_user": "User only has access to /volumes endpoint",
"user": {
"key": "admin" # 修改
}
},
"_glusterfs_comment": "GlusterFS Configuration",
"glusterfs": {
"_executor_comment": [
"Execute plugin. Possible choices: mock, ssh",
"mock: This setting is used for testing and development.",
" It will not send commands to any node.",
"ssh: This setting will notify Heketi to ssh to the nodes.",
" It will need the values in sshexec to be configured.",
"kubernetes: Communicate with GlusterFS containers over",
" Kubernetes exec api."
],
# 三种模式:
# mock:测试环境下创建的volume无法挂载;
# kubernetes:在GlusterFS由kubernetes创建时采用
"executor": "ssh", # 生产环境使用 ssh 或 Kubernetes,这里用 ssh,改为ssh
"_sshexec_comment": "SSH username and private key file information",
"sshexec": {
"keyfile": "/etc/heketi/heketi_key", # 密钥路径
"user": "root", # 用户为root
"port": "22",
"fstab": "/etc/fstab"
},
"_kubeexec_comment": "Kubernetes configuration",
"kubeexec": {
"host" :"https://kubernetes.host:8443",
"cert" : "/path/to/crt.file",
"insecure": false,
"user": "kubernetes username",
"password": "password for kubernetes user",
"namespace": "OpenShift project or Kubernetes namespace",
"fstab": "Optional: Specify fstab file on node. Default is /etc/fstab"
},
"_db_comment": "Database file name",
"db": "/var/lib/heketi/heketi.db",
"_loglevel_comment": [
"Set log level. Choices are:",
" none, critical, error, warning, info, debug",
"Default is warning"
],
# 默认设置为debug,不设置时的默认值即是warning;
# 日志信息输出在/var/log/message
"loglevel" : "warning"
}
}
# 使用ssh的方式需要创建秘钥, 用于免密连接glusterfs的所有节点
ssh-keygen -f heketi_key -t rsa -N ''
ssh-copy-id -i heketi_key.pub g1
ssh-copy-id -i heketi_key.pub g2
ssh-copy-id -i heketi_key.pub g3
# 启动
systemctl enable heketi.service;systemctl start heketi.service
# 验证
curl 192.168.100.154:18080/hello
Hello from Heketi
# 添加cluster, 两个admin分别是上面的 heketi.json 文件中的认证信息,需要改为自己的, 会生成如下信息
heketi-cli --user admin --server http://192.168.100.154:18080 --secret admin --json cluster create
{"id":"5ff98e26472e3e1db21742bf5cd3ce46","nodes":[],"volumes":[],"block":true,"file":true,"blockvolumes":[]}
# 创建 topology.json 文件,其中 /dev/sdb 为我们未格式化的分区
cd /etc/heketi
vim topology.json
{
"clusters": [
{
"nodes": [
{
"node": {
"hostnames": {
"manage": [
"g1"
],
"storage": [
"192.168.100.155"
]
},
"zone": 1
},
"devices": [
"/dev/sdb"
]
},
{
"node": {
"hostnames": {
"manage": [
"g2"
],
"storage": [
"192.168.100.156"
]
},
"zone": 1
},
"devices": [
"/dev/sdb"
]
},
{
"node": {
"hostnames": {
"manage": [
"g3"
],
"storage": [
"192.168.100.157"
]
},
"zone": 1
},
"devices": [
"/dev/sdb"
]
}
]
}
]
}
# 因为 heketi 需要裸设备,我们部署glusterfs验证时候已经格式化了,现在需要还原
gluster volume delete gv0
# 三个节点都要做
umount /data/brick1
# 还原裸设备三个节点都需要做 mklabel msdos
[root@g1 ~]# parted /dev/sdb
GNU Parted 3.1
Using /dev/sdb
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) mklabel msdos
Warning: The existing disk label on /dev/sdb will be destroyed and all data on this disk will be lost. Do you want to continue?
Yes/No? yes
(parted) quit
Information: You may need to update /etc/fstab.
# 三个节点都需要执行
mkfs.xfs -f /dev/sdb
pvcreate -ff --metadatasize=128M --dataalignment=256K /dev/sdb
# heketi初始化
[root@k8s /etc/heketi]# heketi-cli --server http://192.168.100.154:18080 --user admin --secret admin topology load --json=/etc/heketi/topology.json
Found node g1 on cluster 2d8a5e1a487250410d393e8bbefd43a7
Adding device /dev/sdb ... OK
Found node g2 on cluster 2d8a5e1a487250410d393e8bbefd43a7
Adding device /dev/sdb ... OK
Found node g3 on cluster 2d8a5e1a487250410d393e8bbefd43a7
Adding device /dev/sdb ... OK
# 查看数据
heketi-cli --server http://192.168.100.154:18080 --user admin --secret admin cluster list
Clusters:
Id:2d8a5e1a487250410d393e8bbefd43a7 [file][block]
Id:5ff98e26472e3e1db21742bf5cd3ce46 [file][block]
# 节点信息
heketi-cli --server http://192.168.100.154:18080 --user admin --secret admin node list
Id:8d2f3e7fa8542db11f1e64f37fe94cac Cluster:2d8a5e1a487250410d393e8bbefd43a7
Id:ac85f20ff1dfd9e1c9f436f820ac193f Cluster:2d8a5e1a487250410d393e8bbefd43a7
Id:f7c789945e0afc0c51bff0b4944925c1 Cluster:2d8a5e1a487250410d393e8bbefd43a7
# 可查看Cluster Id, 接下来就需要在k8s中调用它
heketi-cli --user admin --secret admin topology info --server http://192.168.100.154:18080
Cluster Id: 2d8a5e1a487250410d393e8bbefd43a7
File: true
Block: true
Volumes:
3. k8s中调用
- 所有的k8s节点都需要部署glusterfs的客户端
yum -y install glusterfs-fuse
# 创建secret和storageclass,我的heketi和k8s在同一节点,最好是分开
[root@k8s /etc/heketi]# echo -n "admin"|base64
YWRtaW4=
- heketi认证的secret
- vim heketi-secret.yaml
apiVersion: v1
kind: Secret
metadata:
name: heketi-secret
namespace: default
data:
# base64 encoded password. E.g.: echo -n "mypassword" | base64
key: YWRtaW4=
type: kubernetes.io/glusterfs
- vim heketi-sc.yaml
apiVersion: storage.k8s.io/v1beta1
kind: StorageClass
metadata:
name: gluster-heketi-storageclass
provisioner: kubernetes.io/glusterfs
reclaimPolicy: Delete
parameters:
resturl: "http://192.168.100.154:18080"
restauthenabled: "true"
restuser: "admin"
secretNamespace: "default"
secretName: "heketi-secret"
volumetype: "replicate:3"
apiVersion: storage.k8s.io/v1beta1
kind: StorageClass
metadata:
name: gluster-heketi-storageclass
provisioner: kubernetes.io/glusterfs
reclaimPolicy: Delete
parameters:
resturl: "http://192.168.100.154:18080"
clusterid: "2d8a5e1a487250410d393e8bbefd43a7"
restauthenabled: "true"
restuser: "admin"
secretNamespace: "default"
secretName: "heketi-secret"
volumetype: "replicate:3"
- 验证
apiVersion: v1
kind: Service
metadata:
name: nginx
labels:
app: nginx
spec:
ports:
- port: 80
name: web
clusterIP: None
selector:
app: nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: nginx
spec:
selector:
matchLabels:
app: nginx # has to match .spec.template.metadata.labels
serviceName: "nginx"
replicas: 3 # by default is 1
template:
metadata:
labels:
app: nginx # has to match .spec.selector.matchLabels
spec:
terminationGracePeriodSeconds: 10
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
name: web
volumeMounts:
- name: www
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: www
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: gluster-heketi-storageclass
resources:
requests:
storage: 1G
- zookeeper集群, 比较费资源
apiVersion: v1
kind: Service
metadata:
name: zk-hs
labels:
app: zk
spec:
ports:
- port: 2888
name: server
- port: 3888
name: leader-election
clusterIP: None
selector:
app: zk
---
apiVersion: v1
kind: Service
metadata:
name: zk-cs
labels:
app: zk
spec:
ports:
- port: 2181
name: client
selector:
app: zk
---
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: zk-pdb
spec:
selector:
matchLabels:
app: zk
maxUnavailable: 1
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: zk
spec:
selector:
matchLabels:
app: zk
serviceName: zk-hs
replicas: 3
updateStrategy:
type: RollingUpdate
podManagementPolicy: Parallel
template:
metadata:
labels:
app: zk
spec:
tolerations:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: "app"
operator: In
values:
- zk
topologyKey: "kubernetes.io/hostname"
containers:
- name: kubernetes-zookeeper
imagePullPolicy: IfNotPresent
image: mirrorgooglecontainers/kubernetes-zookeeper:1.0-3.4.10
resources:
requests:
memory: "1G"
cpu: "0.5"
ports:
- containerPort: 2181
name: client
- containerPort: 2888
name: server
- containerPort: 3888
name: leader-election
command:
- sh
- -c
- "start-zookeeper \
--servers=3 \
--data_dir=/var/lib/zookeeper/data \
--data_log_dir=/var/lib/zookeeper/data/log \
--conf_dir=/opt/zookeeper/conf \
--client_port=2181 \
--election_port=3888 \
--server_port=2888 \
--tick_time=2000 \
--init_limit=10 \
--sync_limit=5 \
--heap=512M \
--max_client_cnxns=60 \
--snap_retain_count=3 \
--purge_interval=12 \
--max_session_timeout=40000 \
--min_session_timeout=4000 \
--log_level=INFO"
readinessProbe:
exec:
command:
- sh
- -c
- "zookeeper-ready 2181"
initialDelaySeconds: 10
timeoutSeconds: 5
livenessProbe:
exec:
command:
- sh
- -c
- "zookeeper-ready 2181"
initialDelaySeconds: 10
timeoutSeconds: 5
volumeMounts:
- name: datadir
mountPath: /var/lib/zookeeper
securityContext:
runAsUser: 1000
fsGroup: 1000
volumeClaimTemplates:
- metadata:
name: datadir
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: gluster-heketi-storageclass
resources:
requests:
storage: 5G
# 以上面的nginx为例
[root@k8s ~/glusterfs]# kubectl get pod
NAME READY STATUS RESTARTS AGE
nginx-0 1/1 Running 0 62s
nginx-1 1/1 Running 0 50s
nginx-2 1/1 Running 0 43s
[root@k8s ~/glusterfs]# kubectl exec -ti nginx-0 -- df -h
Filesystem Size Used Avail Use% Mounted on
overlay 20G 4.5G 16G 23% /
tmpfs 64M 0 64M 0% /dev
tmpfs 2.0G 0 2.0G 0% /sys/fs/cgroup
/dev/sda1 20G 4.5G 16G 23% /etc/hosts
shm 64M 0 64M 0% /dev/shm
192.168.100.157:vol_0a072c68764078d2389be28ee4598bb9 1014M 43M 972M 5% /usr/share/nginx/html
tmpfs 2.0G 12K 2.0G 1% /run/secrets/kubernetes.io/serviceaccount
tmpfs 2.0G 0 2.0G 0% /proc/acpi
tmpfs 2.0G 0 2.0G 0% /proc/scsi
tmpfs 2.0G 0 2.0G 0% /sys/firmware
# 随便在glusterfs节点上查看,根据提示可以可知道分别在各个节点的存储的路径位置
[root@g1 ~]# gluster volume status vol_0a072c68764078d2389be28ee4598bb9
Status of volume: vol_0a072c68764078d2389be28ee4598bb9
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick 192.168.100.155:/var/lib/heketi/mount
s/vg_6571d6295a5dfedffe38e8277715a0f0/brick
_13692d64dc8f7debbaaf1ca4692fc81d/brick 49156 0 Y 38860
Brick 192.168.100.157:/var/lib/heketi/mount
s/vg_16efa0044cb89c7a94af43632e4ad883/brick
_7468d1b2b30f56026a9f8352903c6998/brick 49156 0 Y 37814
Brick 192.168.100.156:/var/lib/heketi/mount
s/vg_f2c7e0d58b71681158503e0266892ef7/brick
_d8d641412d91d7059fc6ac539ce47362/brick 49156 0 Y 38692
Self-heal Daemon on localhost N/A N/A Y 38877
Self-heal Daemon on g2 N/A N/A Y 38717
Self-heal Daemon on g3 N/A N/A Y 37831
Task Status of Volume vol_0a072c68764078d2389be28ee4598bb9
------------------------------------------------------------------------------
There are no active volume tasks