欢迎关注我的CSDN:https://spike.blog.csdn.net/
本文地址:https://spike.blog.csdn.net/article/details/132575709
OpenFold Multimer 是用于预测蛋白质多聚体结构的计算方法。基于OpenFold 的单体预测框架,利用深度学习技术,结合序列、进化和互作信息,来推断蛋白质之间的相互作用界面和空间排列。Openfold Multimer 可以处理不同类型的多聚体,包括同源二聚体、异源二聚体、同源多聚体和异源多聚体,优势在于可以在没有任何实验数据或模板的情况下,生成高质量的多聚体结构预测。
工程:GitHub: aqlaboratory/openfold
其他参考文章:
- 蛋白质结构预测 OpenFold Multimer 训练过程的特征预处理
- 开源框架 OpenFold 模版 (Template) 逻辑与 HHsearch 搜索模版
- 基于开源框架 OpenFold Multimer 蛋白质复合物的结构预测与BugFix
- 基于开源框架 OpenFold 训练的 Finetuning 模型与推理逻辑评估
- 开源可训练的蛋白质结构预测框架 OpenFold 的环境配置
1. 预处理
准备已搜索完成的 MSA 文件,使用脚本 scripts/precompute_alignments.py
1.1 准备 mmcif_cache.json
使用 scripts/generate_mmcif_cache.py
脚本,处理 mmcif 文件的缓存:
nohup python3 -u scripts/generate_mmcif_cache.py [your folder]/af2-data-v230/pdb_mmcif/mmcif_files/ mmcif_cache.json --no_workers 128 > nohup.mmcif_cache.out &
tail -f nohup.mmcif_cache.out
其中, generate_mmcif_cache.py
运行耗时大约 40min,mmcif_cache.json
的 size 是252M。mmcif_cache.json
输出结果,包括PDB信息,即:
{
"4ewn": {
"release_date": "2012-12-05",
"chain_ids": ["D"],
"seqs": [
"MLAKRI..."
],
"no_chains": 1,
"resolution": 1.9
},
"5m9r": {
"release_date": "2017-02-22",
"chain_ids": ["A", "B"],
"seqs": [
"MQDNS...",
"MQDNS..."
],
"no_chains": 2,
"resolution": 1.44
},
# ...
1.2 准备 chain_data_cache.json
使用 scripts/generate_chain_data_cache.py
脚本,处理 mmcif chain 文件的缓存:
nohup python3 -u scripts/generate_chain_data_cache.py [your folder]/af2-data-v230/pdb_mmcif/mmcif_files/ chain_data_cache.json --cluster_file clusters-by-entity-40.txt --no_workers 128 > nohup.chain_data_cache.out &
tail -f nohup.chain_data_cache.out
其中,generate_chain_data_cache.py
运行耗时大约 2h,chain_data_cache.json
的 size 是 292 M。chain_data_cache.json
输出结果,包括单链信息,即:
{
"1p2g_A": {
"release_date": "2003-09-02",
"seq": "SRPLS...",
"resolution": 2.3,
"cluster_size": -1
},
"7u5p_A": {
"release_date": "2022-06-22",
"seq": "MGAAA...",
"resolution": 3.14,
"cluster_size": -1
},
# ...
2. 配置训练脚本
基础训练脚本 train_openfold.py
:
python3 train_openfold.py mmcif_dir/ alignment_dir/ template_mmcif_dir/ output_dir/ \
2021-10-10 \
--template_release_dates_cache_path mmcif_cache.json \
--precision bf16 \
--gpus 8 \
--replace_sampler_ddp=True \
--seed 4242022 \ # in multi-gpu settings, the seed must be specified
--deepspeed_config_path deepspeed_config.json \
--checkpoint_every_epoch \
--resume_from_ckpt ckpt_dir/ \
--train_chain_data_cache_path chain_data_cache.json \
--obsolete_pdbs_file_path obsolete.dat
具体参数如下:
具体参数:
mmcif_dir
:[your folder]/af2-data-v230/pdb_mmcif/mmcif_files/
alignment_dir
:特征文件夹template_mmcif_dir
:[your folder]/af2-data-v230/pdb_mmcif/mmcif_files/
output_dir/
:输出文件夹max_template_date
:默认2021-10-10,模版时间template_release_dates_cache_path
:预处理完成precision
:精度gpus
:GPU数量replace_sampler_ddp
:参数seed
:种子deepspeed_config_path
:deepspeed 配置,工程配置为主checkpoint_every_epoch
:缓存resume_from_ckpt
:训练恢复,初次训练不需设置train_chain_data_cache_path
:预处理完成obsolete_pdbs_file_path
:[your folder]/af2-data-v230/pdb_mmcif/obsolete.dat
其中,obsolete.dat
(过时的) 主要是 PDB 的一些更新与映射,即:
LIST OF OBSOLETE COORDINATE ENTRIES AND SUCCESSORS
OBSLTE 31-JUL-94 116L 216L
OBSLTE 15-APR-98 125D 1AW6
OBSLTE 20-SEP-99 14PS 1QJB
OBSLTE 30-OCT-78 151C 251C
OBSLTE 15-JAN-91 156B 256B
# ...
更新之后的训练逻辑 train_openfold.py
(Monomoer),如下:
python3 train_openfold.py \
--train_data_dir [your folder]/af2-data-v230/pdb_mmcif/mmcif_files/ \
--train_alignment_dir mydata/alignment_dir/ \
--template_mmcif_dir [your folder]/af2-data-v230/pdb_mmcif/mmcif_files/ \
--output_dir mydata/output_dir/ \
--max_template_date "2021-10-10" \
--template_release_dates_cache_path mmcif_cache.json \
--precision bf16 \
--gpus 1 \
--replace_sampler_ddp=True \
--seed 42 \
--deepspeed_config_path deepspeed_config.json \
--checkpoint_every_epoch \
--train_chain_data_cache_path chain_data_cache.json \
--obsolete_pdbs_file_path [your folder]/af2-data-v230/pdb_mmcif/obsolete.dat
训练日志:
# ...
Loading extension module utils...
Time to load utils op: 0.0003807544708251953 seconds
| Name | Type | Params
----------------------------------------
0 | model | AlphaFold | 93.2 M
1 | loss | AlphaFoldLoss | 0
----------------------------------------
93.2 M Trainable params
0 Non-trainable params
93.2 M Total params
372.916 Total estimated model params size (MB)
/opt/conda/envs/openfold/lib/python3.9/site-packages/torch/utils/data/dataloader.py:563: UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 10, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
/opt/conda/envs/openfold/lib/python3.9/site-packages/pytorch_lightning/trainer/data_loading.py:489: UserWarning: One of given dataloaders is None and it will be skipped.
rank_zero_warn("One of given dataloaders is None and it will be skipped.")
Epoch 0: 0%| | 54/10000 [26:31<81:25:01, 29.47s/it, loss=132, v_num=]
Multimer 的 train_openfold.py
参数配置,额外增加参数如下:
--config_preset "model_1_multimer_v3"
,Multimer 配置--train_mmcif_data_cache_path mmcif_cache.json
,PDB 配置
即:
python3 train_openfold.py \
--train_data_dir [your folder]/af2-data-v230/pdb_mmcif/mmcif_files/ \
--train_alignment_dir mydata/alignment_dir/ \
--train_mmcif_data_cache_path mmcif_cache.json \
--template_mmcif_dir [your folder]/af2-data-v230/pdb_mmcif/mmcif_files/ \
--output_dir mydata/output_dir/ \
--max_template_date "2021-10-10" \
--config_preset "model_1_multimer_v3" \
--template_release_dates_cache_path mmcif_cache.json \
--precision bf16 \
--gpus 1 \
--replace_sampler_ddp=True \
--seed 42 \
--deepspeed_config_path deepspeed_config.json \
--checkpoint_every_epoch \
--train_chain_data_cache_path chain_data_cache.json \
--obsolete_pdbs_file_path [your folder]/af2-data-v230/pdb_mmcif/obsolete.dat
3. Bug
Bug: docker shared memory limit
日志:
RuntimeError: DataLoader worker (pid 30285) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
修改之后的 Docker 启动程序,添加 --shm-size
参数:
nvidia-docker run -it --name openfold-v3 --shm-size 72G -v [nfs]:[nfs] openfold:v1.03
缓存 Docker
docker ps -a | grep openfold
# 提交 Tag
docker ps -l
docker commit [container id] openfold:v1.03
# 准备远程 Tag
docker tag openfold:v1.03 harbor.[ip].com/openfold:v1.03
docker images | grep "openfold"
# 推送至远程
docker push harbor.[ip].com/openfold:v1.03
参考:
- CSDN - Docker之通过资源控制来限制风险
- 知乎 - Dataloader中的num_workers设置与docker的shared memory相关问题