前言
.sdf和.msp文件都可以用来存储分子信息,.sdf文件可以用rdkit读取,.msp文件就只能当成文本文档读取了。
读取
rdkit安装
pip install rdkit
.sdf读取
from rdkit import Chem
suppl_h = Chem.SDMolSupplier('../data/HMDB/f_hmdb.sdf') # 得到一个迭代器
mols_h = [mol for mol in suppl_h if mol]
hmdb_smi = [mol.GetProp('SMILES') for mol in mols_h]
hmdb_smi_set = set(hmdb_smi)
分子mol有一些常用的函数:
mol.GetPropsAsDict() # 可以获得分子的所有key及其value
smi = 'COO'
mol.SetProp('SMILES', smi) # 可以设置mol的属性
.msp读取
把.msp文件当成普通的文本文档去读取,主要涉及字符串操作。
def read_msp2mgf(file_path, save_path):
f = open(file_path, 'r')
lines = f.readlines()
i = 0
spectrums = []
while i < len(lines):
l = lines[i].replace('\n', '') # 空行用空字符替换
if l.startswith('Name:'):
name = l.split(': ')[-1]
elif l.startswith('InChIKey:'):
inchikey = l.split(': ')[-1]
elif l.startswith('SMILES:'):
smiles = l.split(': ')[-1]
elif l.startswith('ExactMass:'):
exactmass = float(l.split(': ')[-1])
elif l.startswith('Num Peaks:'):
num_peaks = int(l.split(': ')[-1])
elif len(l) > 0 and ':' not in l:
mz, inten = [], []
while lines[i] != '\n':
mz.append(float(lines[i].split(' ')[0]))
inten.append(float(lines[i].split(' ')[1]))
i += 1
mz = np.array(mz)
inten = np.array(inten)
metadata = {'Name':name, 'InChIKey':inchikey,'SMILES':smiles, 'ExactMass':exactmass, 'Num Peaks':num_peaks}
spectrum = spec.Spectrum(mz, inten, metadata)
spectrums.append(spectrum)
i += 1
spec.save_as_mgf(spectrums, save_path+'MassBank1.mgf')
结语
有问题欢迎在评论区讨论。