前言
偶然发现InterPro数据库挺不错的。
之前使用selenium爬取了AlphaFlod数据,于是也想试试把InterPro的结构域数据爬取一下。
结果发现官方已经给好了代码,真是太善解人意了。
当然,想要批量下载还需要魔改一下官方代码。
步骤一:获取想要下载的蛋白质列表
我们首先在Browse - InterPro (ebi.ac.uk) 该界面搜索我们需要的蛋白质
1、选择reviewed的蛋白质(unreviewed的数据一般质量不高,想用也行)
2、选择对应的物种
3、输入蛋白质的关键词
4、点击导出Export按钮
5、点击Generate按钮
完成生成后,会变为download按钮,下载即可
下载好的文件:
步骤二:下载列表中各个蛋白质的结构域数据
官方给出的代码只能下载单个蛋白质的结构域tsv信息
Results - InterPro (ebi.ac.uk)官方代码:Results - InterPro (ebi.ac.uk)
我们魔改一下,读取第一步下好的列表,然后再依次下载结构域信息,保存到各个文件中:
(运行要求:把protein-sequences.tsv重命名为export.tsv,然后放置在以下代码的同一目录下,在该目录下建立名为domain的文件夹,用于存放输出文件)
'''
修改自InterPro官网上的代码
用于读取InterPro上的查找结果 export.tsv 并根据结果下载所有蛋白质的结构域信息
'''
# standard library modules
import sys, errno, re, json, ssl, os
from urllib import request
from urllib.error import HTTPError
from time import sleep
# BASE_URL = "https://www.ebi.ac.uk:443/interpro/api/entry/InterPro/protein/reviewed/A0A024R1R8/?page_size=200"
def parse_items(items):
if type(items)==list:
return ",".join(items)
return ""
def parse_member_databases(dbs):
if type(dbs)==dict:
return ";".join([f"{db}:{','.join(dbs[db])}" for db in dbs.keys()])
return ""
def parse_go_terms(gos):
if type(gos)==list:
return ",".join([go["identifier"] for go in gos])
return ""
def parse_locations(locations):
if type(locations)==list:
return ",".join(
[",".join([f"{fragment['start']}..{fragment['end']}"
for fragment in location["fragments"]
])
for location in locations
])
return ""
def parse_group_column(values, selector):
return ",".join([parse_column(value, selector) for value in values])
def parse_column(value, selector):
if value is None:
return ""
elif "member_databases" in selector:
return parse_member_databases(value)
elif "go_terms" in selector:
return parse_go_terms(value)
elif "children" in selector:
return parse_items(value)
elif "locations" in selector:
return parse_locations(value)
return str(value)
def download_to_file(url, file_path):
#disable SSL verification to avoid config issues
context = ssl._create_unverified_context()
next = url
last_page = False
attempts = 0
while next:
try:
req = request.Request(next, headers={"Accept": "application/json"})
res = request.urlopen(req, context=context)
# If the API times out due a long running query
if res.status == 408:
# wait just over a minute
sleep(61)
# then continue this loop with the same URL
continue
elif res.status == 204:
#no data so leave loop
break
payload = json.loads(res.read().decode())
next = payload["next"]
attempts = 0
if not next:
last_page = True
except HTTPError as e:
if e.code == 408:
sleep(61)
continue
else:
# If there is a different HTTP error, it wil re-try 3 times before failing
if attempts < 3:
attempts += 1
sleep(61)
continue
else:
sys.stderr.write("LAST URL: " + next)
raise e
with open(file_path,"w+") as f:
for i, item in enumerate(payload["results"]):
f.write(parse_column(item["metadata"]["accession"], 'metadata.accession') + "\t")
f.write(parse_column(item["metadata"]["name"], 'metadata.name') + "\t")
f.write(parse_column(item["metadata"]["source_database"], 'metadata.source_database') + "\t")
f.write(parse_column(item["metadata"]["type"], 'metadata.type') + "\t")
f.write(parse_column(item["metadata"]["integrated"], 'metadata.integrated') + "\t")
f.write(parse_column(item["metadata"]["member_databases"], 'metadata.member_databases') + "\t")
f.write(parse_column(item["metadata"]["go_terms"], 'metadata.go_terms') + "\t")
f.write(parse_column(item["proteins"][0]["accession"], 'proteins[0].accession') + "\t")
f.write(parse_column(item["proteins"][0]["protein_length"], 'proteins[0].protein_length') + "\t")
f.write(parse_column(item["proteins"][0]["entry_protein_locations"], 'proteins[0].entry_protein_locations') + "\t")
f.write("\n")
# Don't overload the server, give it time before asking for more
sleep(1)
with open("export.tsv") as f:
# 丢弃第一行头文件
line = f.readline()
line = f.readline()
cnt = 0
while line:
cnt += 1
print(cnt)
protein_id = line.split("\t")[0]
url = f"https://www.ebi.ac.uk:443/interpro/api/entry/InterPro/protein/reviewed/{protein_id}/?page_size=200"
download_to_file(url,os.path.join('domain', protein_id+'.tsv'))
line = f.readline()
运行就可以开始下载了
代码已放置在gitee仓库上,欢迎使用interpro-domain-downloader: 下载InterPro数据库上的蛋白质结构域domain数据 (gitee.com)