调度工具之dolphinscheduler篇

前言

随着开发程序的增多，任务调度以及任务之间的依赖关系管理就成为一个比较头疼的问题，随时少量的任务可以用linux系统自带的crontab加以定时进行，但缺点也很明细，不够直观，以及修改起来比较麻烦，容易出错，这时候就需要调度工具来帮忙，不知道大家都接触过哪些调度工具，我这边接触过airflow、oozie、 Kyligence，但今天我想推荐的调度工具是dolphinscheduler，下面就从安装部署来简单介绍下该工具。

一、dolphinscheduler是什么？

dolphinscheduler是一个国产的调度工具，非常符合国人的使用习惯，支持的调度任务类型也是非常之多，包括常见的spark、flink、sql、shell、python、datax、sqoop、seatunel、dinky等，可以说是相对比较全面，另外除了任务调度，还具有资源管理，多租户等功能，对于一般的中小型企业来说，这些功能足够用。

二、安装部署

1.环境准备

由于dolphinscheduler元数据注册在zookeeper中，所以部署dolphinscheduler前需安装zookeeper，具体安装步骤在我之前发表的文章中有讲解，可以去翻看下，另外，安装环境也是需要安装jdk的，具体安装步骤这里就不再赘述了，可以看下我之前发表的文章。

2.下载安装包

登录dolphinscher安装包下载地址https://dlcdn.apache.org/dolphinscheduler/，选择一个版本，点击apache-dolphinscheduler-xxx-bin.tar.gz，进入下载页面，目前最新的版本是3.2.0，但笔者还是推荐3.1.8版本，所以今天的安装部署都是围绕3.1.8版本来介绍，
安装包下载后，执行以下命令解压并修改名称

tar -zxvf apache-dolphinscheduler-3.1.8-bin.tar.gz
mv apache-dolphinscheduler-3.1.8-bin dolphinscheduler-3.1.8

2.修改配置

进入解压后的文件到 dolphinscheduler-3.1.8/bin/env目录，vim dolphinscheduler_env.sh配置dolphinscheduler的数据源、zookeeper连接信息以及spark、flink、datax、seatunnel安装目录地址
提示：配置信息可根据自身环境不同而自行修改

export JAVA_HOME=${JAVA_HOME:-"/usr/java/jdk1.8.0_181-cloudera"}

# Database related configuration, set database type, username and password
export DATABASE=${DATABASE:-"mysql"}
export SPRING_PROFILES_ACTIVE=${DATABASE}
export SPRING_DATASOURCE_URL=${SPRING_DATASOURCE_URL:-"jdbc:mysql://ds1:3306/dolphinscheduler?useSSL=false"}
export SPRING_DATASOURCE_USERNAME=${SPRING_DATASOURCE_USERNAME:-"root"}
export SPRING_DATASOURCE_PASSWORD=${SPRING_DATASOURCE_PASSWORD:-"*****"}

# DolphinScheduler server related configuration
export SPRING_CACHE_TYPE=${SPRING_CACHE_TYPE:-none}
export SPRING_JACKSON_TIME_ZONE=${SPRING_JACKSON_TIME_ZONE:-"Asia/Shanghai"}
export MASTER_FETCH_COMMAND_NUM=${MASTER_FETCH_COMMAND_NUM:-10}

# Registry center configuration, determines the type and link of the registry center
export REGISTRY_TYPE=${REGISTRY_TYPE:-zookeeper}
export REGISTRY_ZOOKEEPER_CONNECT_STRING=${REGISTRY_ZOOKEEPER_CONNECT_STRING:-ds1:2181,ds2:2181,ds3:2181}

# Tasks related configurations, need to change the configuration if you use the related tasks.
export HADOOP_HOME=${HADOOP_HOME:-"/application/hadoop"}
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/application/hadoop/etc/hadoop"}
export SPARK_HOME2=${SPARK_HOME2:-"/application/spark"}
export PYTHON_HOME=${PYTHON_HOME:-"/usr/bin/python"}
export HIVE_HOME=${HIVE_HOME:-"/application/hive"}
export FLINK_HOME=${FLINK_HOME:-"/application/flink"}
export DATAX_HOME=${DATAX_HOME:-"/opt/soft/datax"}
export SEATUNNEL_HOME=${SEATUNNEL_HOME:-"/application/seatunnel"}
export CHUNJUN_HOME=${CHUNJUN_HOME:-/opt/soft/chunjun}

export PATH=$HADOOP_HOME/bin:$SPARK_HOME/bin:$PYTHON_HOME/bin:$JAVA_HOME/bin:$HIVE_HOME/bin:$FLINK_HOME/bin:$DATAX_HOME/bin:$SEATUNNEL_HOME/bin:$CHUNJUN_HOME/bin:$PATH

vim install_env.sh编辑dolphinscheduler的master、worker、apiserver、alterserver、服务器上安装的路径以及部署的用户名和zookeeper的注册路径

ips=${ips:-"ds1,ds2,ds3"}

# Port of SSH protocol, default value is 22. For now we only support same port in all `ips` machine
# modify it if you use different ssh port
sshPort=${sshPort:-"22"}

# A comma separated list of machine hostname or IP would be installed Master server, it
# must be a subset of configuration `ips`.
# Example for hostnames: masters="ds1,ds2", Example for IPs: masters="192.168.8.1,192.168.8.2"
masters=${masters:-"ds1,ds2,ds3"}

# A comma separated list of machine <hostname>:<workerGroup> or <IP>:<workerGroup>.All hostname or IP must be a
# subset of configuration `ips`, And workerGroup have default value as `default`, but we recommend you declare behind the hosts
# Example for hostnames: workers="ds1:default,ds2:default,ds3:default", Example for IPs: workers="192.168.8.1:default,192.168.8.2:default,192.168.8.3:default"
workers=${workers:-"ds1:default,ds2:default,ds3:default"}

# A comma separated list of machine hostname or IP would be installed Alert server, it
# must be a subset of configuration `ips`.
# Example for hostname: alertServer="ds3", Example for IP: alertServer="192.168.8.3"
alertServer=${alertServer:-"ds3"}

# A comma separated list of machine hostname or IP would be installed API server, it
# must be a subset of configuration `ips`.
# Example for hostname: apiServers="ds1", Example for IP: apiServers="192.168.8.1"
apiServers=${apiServers:-"ds2"}

# The directory to install DolphinScheduler for all machine we config above. It will automatically be created by `install.sh` script if not exists.
# Do not set this configuration same as the current path (pwd). Do not add quotes to it if you using related path.
installPath=${installPath:-"/application/dolphinscheduler"}

# The user to deploy DolphinScheduler for all machine we config above. For now user must create by yourself before running `install.sh`
# script. The user needs to have sudo privileges and permissions to operate hdfs. If hdfs is enabled than the root directory needs
# to be created by this user
deployUser=${deployUser:-"root"}

# The root of zookeeper, for now DolphinScheduler default registry server is zookeeper.
zkRoot=${zkRoot:-"/dolphinscheduler"}

进入解压后的文件目录dolphinscheduler-3.1.8/api-server/conf，vim common.properties编辑资源配置路径，dolphinscheduler-3.1.8/worker-server/conf目录下的common.properties也需要配置
提示：此处是配置文件或jar包上传的资源中心，需要注意的几个地方分别是data.basedir.path、resource.storage.type、resource.storage.upload.base.path、resource.hdfs.root.user、resource.hdfs.fs.defaultFS其他配置信息可根据需要自行配置或者抱持默认

data.basedir.path=/application/data

# resource view suffixs
#resource.view.suffixs=txt,log,sh,bat,conf,cfg,py,java,sql,xml,hql,properties,json,yml,yaml,ini,js

# resource storage type: HDFS, S3, OSS, NONE
resource.storage.type=HDFS
# resource store on HDFS/S3 path, resource file will store to this base path, self configuration, please make sure the directory exists on hdfs and have read write permissions. "/dolphinscheduler" is recommended
resource.storage.upload.base.path=/dolphinscheduler

# The AWS access key. if resource.storage.type=S3 or use EMR-Task, This configuration is required
resource.aws.access.key.id=minioadmin
# The AWS secret access key. if resource.storage.type=S3 or use EMR-Task, This configuration is required
resource.aws.secret.access.key=minioadmin
# The AWS Region to use. if resource.storage.type=S3 or use EMR-Task, This configuration is required
resource.aws.region=cn-north-1
# The name of the bucket. You need to create them by yourself. Otherwise, the system cannot start. All buckets in Amazon S3 share a single namespace; ensure the bucket is given a unique name.
resource.aws.s3.bucket.name=dolphinscheduler
# You need to set this parameter when private cloud s3. If S3 uses public cloud, you only need to set resource.aws.region or set to the endpoint of a public cloud such as S3.cn-north-1.amazonaws.com.cn
resource.aws.s3.endpoint=http://localhost:9000

# alibaba cloud access key id, required if you set resource.storage.type=OSS
resource.alibaba.cloud.access.key.id=<your-access-key-id>
# alibaba cloud access key secret, required if you set resource.storage.type=OSS
resource.alibaba.cloud.access.key.secret=<your-access-key-secret>
# alibaba cloud region, required if you set resource.storage.type=OSS
resource.alibaba.cloud.region=cn-hangzhou
# oss bucket name, required if you set resource.storage.type=OSS
resource.alibaba.cloud.oss.bucket.name=dolphinscheduler
# oss bucket endpoint, required if you set resource.storage.type=OSS
resource.alibaba.cloud.oss.endpoint=https://oss-cn-hangzhou.aliyuncs.com
# if resource.storage.type=HDFS, the user must have the permission to create directories under the HDFS root path
resource.hdfs.root.user=root
# if resource.storage.type=S3, the value like: s3a://dolphinscheduler; if resource.storage.type=HDFS and namenode HA is enabled, you need to copy core-site.xml and hdfs-site.xml to conf dir
resource.hdfs.fs.defaultFS=hdfs://ds1:8020

# whether to startup kerberos
hadoop.security.authentication.startup.state=false

# java.security.krb5.conf path
java.security.krb5.conf.path=/opt/krb5.conf

# login user from keytab username
login.user.keytab.username=hdfs-mycluster@ESZ.COM

# login user from keytab path
login.user.keytab.path=/opt/hdfs.headless.keytab

# kerberos expire time, the unit is hour
kerberos.expire.time=2


# resourcemanager port, the default value is 8088 if not specified
resource.manager.httpaddress.port=8088
# if resourcemanager HA is enabled, please set the HA IPs; if resourcemanager is single, keep this value empty
yarn.resourcemanager.ha.rm.ids=ds1
# if resourcemanager HA is enabled or not use resourcemanager, please keep the default value; If resourcemanager is single, you only need to replace ds1 to actual resourcemanager hostname
yarn.application.status.address=http://ds1:%s/ws/v1/cluster/apps/%s
# job history status url when application number threshold is reached(default 10000, maybe it was set to 1000)
yarn.job.history.status.address=http://ds1:19888/ws/v1/history/mapreduce/jobs/%s
# datasource encryption enable
datasource.encryption.enable=false

# datasource encryption salt
datasource.encryption.salt=!@#$%^&*

# data quality option
data-quality.jar.name=dolphinscheduler-data-quality-dev-SNAPSHOT.jar

#data-quality.error.output.path=/tmp/data-quality-error-data

# Network IP gets priority, default inner outer

# Whether hive SQL is executed in the same session
support.hive.oneSession=true

# use sudo or not, if set true, executing user is tenant user and deploy user needs sudo permissions; if set false, executing user is the deploy user and doesn't need sudo permissions
sudo.enable=true
setTaskDirToTenant.enable=false

2.元数据初始化

由于我这边配置的元数据存储中心是mysql，所以首先需要将mysql驱动拷贝
dolphinscheduler每个模块的libs目录下，其中包括api-server/libs、alert-server/libs、master-server/libs、worker-server/libs和tools/libs；
在mysql数据库中需要先创建dolphinscheduler数据库，如果需要指定用户，需要为该用户赋权，相关命令如下
提示：mysql5和mysql8版本语法有差异，请根据自身版本做修改，下面的例子是mysql8版本

CREATE DATABASE dolphinscheduler DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
CREATE USER '{user}'@'%' IDENTIFIED BY '{password}';
GRANT ALL PRIVILEGES ON dolphinscheduler.* TO '{user}'@'%';
CREATE USER '{user}'@'localhost' IDENTIFIED BY '{password}';
GRANT ALL PRIVILEGES ON dolphinscheduler.* TO '{user}'@'localhost';
FLUSH PRIVILEGES;

进入到dolphinscheduler-3.1.8目录下，执行bash tools/bin/upgrade-schema.sh命令进行初始化，此时注意是否有错误信息，初始化成功后，可执行bin/install.sh命令，dolphinscheduler即可自行安装到配置文件里的安装路径，并将安装服务复制到定义的worker-server节点，接着输入以下地址看看是否能够登录http://ds2:12345/dolphinscheduler/ui（此处的ds2是配置文件中定义的apiserver），当看到以下界面是证明启动成功，初始账号密码为admin/dolphinscheduler123
在这里插入图片描述
进入系统后，首先需要创建项目

创建项目后，点击项目名称，即可进入到工作流定义界面

点击工作流定义，创建工作流，左侧列表中有拖拽自己任务的类型，这里以shell任务为例，输入节点名称以及脚本命令，点击保存
在这里插入图片描述
保存完成后，会弹出定义工作流的名称、租户以及执行策略等，点击确定后，该工作流定义完成

工作流右侧的按钮分别是编辑、运行、定时、上线、复制、定时管理、工作流树形图、导出、版本信息，需先点击上线后，才能运行该程序
在这里插入图片描述
点击运行时，弹出提示框，有通知策略、流程优先级、分组、环境名称等信息可根据自身需求自行定义，点击确定以运行该工作流

可在工作流实例中查看工作流的运行情况

可在任务实例中，查看工作流里面的任务实例的日志信息
在这里插入图片描述
任务运行成功后，可通过工作流定义里面的定时功能，对该工作流定义一个自动运行的时间及频率，点击确定后，还需要点击工作流定义中的定时管理，对刚才定义的定时进行上线，此时该工作流的定时功能才算完成
在这里插入图片描述

总结

试用dolphinscheduler已经有一段时间了，从之前的2.7到现在的3.x版本，部署的方式有了些许的改变，之前的2.x版本，各个模块都是在一块的，到了3.0版本之后，api-server、work-server、master-server、alter-server都分开的，有了调度平台之后，编写的spark、flink任务部署起来就会直观很多，不用到服务器上逐个任务排查了，由于篇幅有限，其中的资源管理（可以上传脚本以及编写的程序jar包等）、数据源配置以及数据质量等功能就不一一展示了，具体的细节，大家可以下载安装部署，试试它的功能。