ncbi-datasets-cli-高效便捷下载NCBI数据

文章目录

  • 简介
  • 安装
  • ```datasets download```下载基因组/基因序列
    • 按照GCA list文件编号下载
    • 下载大基因组
    • genome完整参数
    • gene参数
  • ```datasets summary```下载元数据
  • ```dataformat```将json转换成表格格式
  • 通过json文件解析其他字段
  • 问题

简介

NCBI Datasets 可以轻松从 NCBI 数据库中收集数据。使用命令行界面(CLI)工具或 NCBI Datasets 网页界面查找和下载基因和基因组的序列、注释和元数据。如下是可用的工具:
在这里插入图片描述

安装

  • 使用conda安装Datasets CLI tools, datasetsand dataformat:
# 注意不是datasets而是ncbi-datasets-cli
$ conda install -c conda-forge ncbi-datasets-cli
(base) [yut@io02 ~]$ datasets --version
datasets version: 15.25.0

datasets download下载基因组/基因序列

datasets从 NCBI 下载所有生命领域的生物序列数据,dataformat将前者下载的数据包中的元数据从 JSON Lines 格式转换为其他格式。

使用datasets下载人类参考基因组 GRCh38 的基因组数据包:

$ datasets download genome taxon human --reference --filename human-reference.zip

使用 dataformat从下载的人类参考基因组 GRCh38 数据包中提取选定的元数据字段:

$ dataformat tsv genome --package human-reference.zip --fields organism-name,assminfo-name,accession,assminfo-submitter
Organism name	Assembly Name	Assembly Accession	Assembly Submitter
Homo sapiens	GRCh38.p14	GCF_000001405.40	Genome Reference Consortium

按照GCA list文件编号下载

(base) [yut@io02 02_Glacier_new_taxa]$ head 3.gca
GCF_020042285.1
GCF_020783315.1
GCF_024343615.1
(base) [yut@io02 02_Glacier_new_taxa]$ time datasets download genome accession --inputfile 3.gca --include gff3,rna,cds,protein,genome,seq-report --filename  3genome.zip
# --inputfile:输入GCA号的list,每行一个
# --filename:输出zip包名称,默认ncbi-dataset.zip

New version of client (15.27.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets
Collecting 3 genome records [================================================] 100% 3/3
Downloading: 3genome.zip    10.4MB valid zip archive
Validating package files [================================================] 100% 18/18

real    0m8.208s
user    0m0.652s
sys     0m0.234s
(base) [yut@io02 02_Glacier_new_taxa]$ ls
3genome.zip  download.log 

下载大基因组

下载大量基因组,首先下载压缩包,然后分三步访问数据。

  • 1.下载人基因组压缩包
datasets download genome accession GCF_000001405.40 --dehydrated --filename human_GRCh38_dataset.zip
  • 2.解压
unzip human_GRCh38_dataset.zip -d my_human_dataset
  • 3.转换格式
datasets rehydrate --directory my_human_dataset/

genome完整参数

(base) [yut@io02 ~]$ datasets download genome  --help

Download a genome data package. Genome data packages may include genome, transcript and protein sequences, annotation and one or more data reports. Data packages are downloaded as a zip archive.

The default genome data package includes the following files:
  * <accession>_<assembly_name>_genomic.fna (genomic sequences)
  * assembly_data_report.jsonl (data report with genome assembly and annotation metadata)
  * dataset_catalog.json (a list of files and file types included in the data package)

Usage
  datasets download genome [flags]
  datasets download genome [command]

Sample Commands
  datasets download genome accession GCF_000001405.40 --chromosomes X,Y --include genome,gff3,rna
  datasets download genome taxon "bos taurus" --dehydrated
  datasets download genome taxon human --assembly-level chromosome,complete --dehydrated
  datasets download genome taxon mouse --search C57BL/6J --search "Broad Institute" --dehydrated

Available Commands
  accession   Download a genome data package by Assembly or BioProject accession
  taxon       Download a genome data package by taxon (NCBI Taxonomy ID, scientific or common name at any tax rank)

Flags
      --annotated                Limit to annotated genomes
      --assembly-level string    Limit to genomes at one or more assembly levels (comma-separated):
                                   * chromosome
                                   * complete
                                   * contig
                                   * scaffold
                                    (default "[]")
      --assembly-source string   Limit to 'RefSeq' (GCF_) or 'GenBank' (GCA_) genomes (default "all")
      --chromosomes strings      Limit to a specified, comma-delimited list of chromosomes, or 'all' for all chromosomes
      --dehydrated               Download a dehydrated zip archive including the data report and locations of data files (use the rehydrate command to retrieve data files).
      --exclude-atypical         Exclude atypical assemblies
      --mag string               Limit to metagenome assembled genomes (only) or remove them from the results (exclude) (default "all")
      --preview                  Show information about the requested data package
      --reference                Limit to reference genomes
      --released-after string    Limit to genomes released on or after a specified date (MM/DD/YYYY)
      --released-before string   Limit to genomes released on or before a specified date (MM/DD/YYYY)
      --search strings           Limit results to genomes with specified text in the searchable fields:
                                 species and infraspecies, assembly name and submitter.
                                 To search multiple strings, use the flag multiple times.


Global Flags
      --api-key string    Specify an NCBI API key
      --debug             Emit debugging info
      --filename string   Specify a custom file name for the downloaded data package (default "ncbi_dataset.zip")
      --help              Print detailed help about a datasets command
      --no-progressbar    Hide progress bar
      --version           Print version of datasets

Use datasets download genome <command> --help for detailed help about a command.

gene参数

(base) [yut@io02 ~]$ datasets download gene --help

Download a gene data package.  Gene data packages include gene, transcript and protein sequences and one or more data reports. Data packages are downloaded as a zip archive.

The default gene data package for NM, NR, NP, XM, XR, XP and YP accessions:
  * rna.fna (transcript sequences)
  * protein.faa (protein sequences)
  * data_report.jsonl (data report with gene metadata)
  * dataset_catalog.json (a list of files and file types included in the data package)

Usage
  datasets download gene [flags]
  datasets download gene [command]

Sample Commands
  datasets download gene gene-id 672
  datasets download gene symbol brca1 --taxon mouse
  datasets download gene accession NP_000483.3
  datasets download gene gene-id 2778 --fasta-filter NC_000020.11,NM_001077490.3,NP_001070958.1

Available Commands
  gene-id     Download a gene data package by NCBI Gene ID
  symbol      Download a gene data package by gene symbol
  accession   Download a gene data package by RefSeq nucleotide or protein accession
  taxon       Download a gene data package by taxon (NCBI Taxonomy ID, scientific or common name at any tax rank)

Flags
      --fasta-filter strings       Limit protein and RNA sequence files to the specified RefSeq nucleotide and protein accessions
      --fasta-filter-file string   Limit protein and RNA sequence files to the specified RefSeq nucleotide and protein accessions included in the specified file
      --preview                    Show information about the requested data package


Global Flags
      --api-key string    Specify an NCBI API key
      --debug             Emit debugging info
      --filename string   Specify a custom file name for the downloaded data package (default "ncbi_dataset.zip")
      --help              Print detailed help about a datasets command
      --no-progressbar    Hide progress bar
      --version           Print version of datasets

Use datasets download gene <command> --help for detailed help about a command.

datasets summary下载元数据

(base) [yut@io02 ~]$ datasets summary --help

Print a data report containing gene, genome or virus metadata in JSON format.

Usage
  datasets summary [flags]
  datasets summary [command]

Sample Commands
  datasets summary genome accession GCF_000001405.40
  datasets summary genome taxon "mus musculus"
  datasets summary gene gene-id 672
  datasets summary gene symbol brca1 --taxon mouse
  datasets summary gene accession NP_000483.3
  datasets summary virus genome accession NC_045512.2
  datasets summary virus genome taxon sars-cov-2 --host dog

Available Commands
  gene        Print a summary of a gene dataset
  genome      Print a data report containing genome metadata
  virus       Print a data report containing virus genome metadata

Global Flags
      --api-key string   Specify an NCBI API key
      --debug            Emit debugging info
      --help             Print detailed help about a datasets command
      --version          Print version of datasets

Use datasets summary <command> --help for detailed help about a command.

  • 实例
(base) [yut@io02 ~]$ datasets summary genome accession GCF_000001405.40
New version of client (15.27.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets
{"reports": [{"accession":"GCF_000001405.40","annotation_info":{"busco":{"busco_lineage":"primates_odb10","busco_ver":"4.1.4","complete":0.99187225,"duplicated":0.007256894,"fragmented":0.0015239477,"missing":0.0066037737,"single_copy":0.9846154,"total_count":"13780"},"method":"Best-placed RefSeq; Gnomon; RefSeqFE; cmsearch; tRNAscan-SE","name":"GCF_000001405.40-RS_2023_10","pipeline":"NCBI eukaryotic genome annotation pipeline","provider":"NCBI RefSeq","release_date":"2023-10-02","report_url":"https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Homo_sapiens/GCF_000001405.40-RS_2023_10.html","software_version":"10.2","stats":{"gene_counts":{"non_coding":22158,"other":413,"protein_coding":20080,"pseudogene":17001,"total":59652}},"status":"Updated annotation"},"assembly_info":{"assembly_level":"Chromosome","assembly_name":"GRCh38.p14","assembly_status":"current","assembly_type":"haploid-with-alt-loci","bioproject_accession":"PRJNA31257","bioproject_lineage":[{"bioprojects":[{"accession":"PRJNA31257","title":"The Human Genome Project, currently maintained by the Genome Reference Consortium (GRC)"}]}],"blast_url":"https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch\u0026PROG_DEF=blastn\u0026BLAST_SPEC=GDH_GCF_000001405.40","description":"Genome Reference Consortium Human Build 38 patch release 14 (GRCh38.p14)","paired_assembly":{"accession":"GCA_000001405.29","only_genbank":"4 unlocalized and unplaced scaffolds.","status":"current"},"refseq_category":"reference genome","release_date":"2022-02-03","submitter":"Genome Reference Consortium","synonym":"hg38"},"assembly_stats":{"contig_l50":18,"contig_n50":57879411,"gaps_between_scaffolds_count":349,"gc_count":"1374283647","gc_percent":41,"number_of_component_sequences":35611,"number_of_contigs":996,"number_of_organelles":1,"number_of_scaffolds":470,"scaffold_l50":16,"scaffold_n50":67794873,"total_number_of_chromosomes":24,"total_sequence_length":"3099441038","total_ungapped_length":"2948318359"},"current_accession":"GCF_000001405.40","organelle_info":[{"description":"Mitochondrion","submitter":"Genome Reference Consortium","total_seq_length":"16569"}],"organism":{"common_name":"human","organism_name":"Homo sapiens","tax_id":9606},"paired_accession":"GCA_000001405.29","source_database":"SOURCE_DATABASE_REFSEQ"}],"total_count": 1}

(base) [yut@io02 ~]$ datasets summary gene gene-id 672
New version of client (15.27.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets
{"reports": [{"gene":{"annotations":[{"annotation_name":"GCF_000001405.40-RS_2023_10","annotation_release_date":"2023-10-02","assembly_accession":"GCF_000001405.40","assembly_name":"GRCh38.p14","genomic_locations":[{"genomic_accession_version":"NC_000017.11","genomic_range":{"begin":"43044295","end":"43170327","orientation":"minus"},"sequence_name":"17"}]},{"annotation_name":"GCF_009914755.1-RS_2023_10","annotation_release_date":"2023-10-02","assembly_accession":"GCF_009914755.1","assembly_name":"T2T-CHM13v2.0","genomic_locations":[{"genomic_accession_version":"NC_060941.1","genomic_range":{"begin":"43902857","end":"44029084","orientation":"minus"},"sequence_name":"17"}]}],"chromosomes":["17"],"common_name":"human","description":"BRCA1 DNA repair associated","ensembl_gene_ids":["ENSG00000012048"],"gene_groups":[{"id":"672","method":"NCBI Ortholog"}],"gene_id":"672","nomenclature_authority":{"authority":"HGNC","identifier":"HGNC:1100"},"omim_ids":["113705"],"orientation":"minus","protein_count":368,"reference_standards":[{"gene_range":{"accession_version":"NG_005905.2","range":[{"begin":"92501","end":"173689","orientation":"plus"}]},"type":"REFSEQ_GENE"}],"swiss_prot_accessions":["P38398"],"symbol":"BRCA1","synonyms":["IRIS","PSCP","BRCAI","BRCC1","FANCS","PNCA4","RNF53","BROVCA1","PPP1R53"],"tax_id":"9606","taxname":"Homo sapiens","transcript_count":368,"transcript_type_counts":[{"count":368,"type":"PROTEIN_CODING"}],"type":"PROTEIN_CODING"},"query":["672"]}],"total_count": 1}
  • 下载结果为json格式

dataformat将json转换成表格格式

(base) [yut@io02 ~]$ dataformat tsv

Convert data to TSV format.

Refer to NCBI's [download and install](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/) documentation for information about getting started with the command-line tools.

Usage
  dataformat tsv [command]

Report Commands
  genome             Convert Genome Assembly Data Report into TSV format
  genome-seq         Convert Genome Assembly Sequence Report into TSV format
  gene               Convert Gene Report into TSV format
  gene-product       Convert Gene Product Report into TSV format
  virus-genome       Convert Virus Data Report into TSV format
  virus-annotation   Convert Virus Annotation Report into TSV format
  microbigge         Convert MicroBIGG-E Data Report into TSV format
  prok-gene          Convert Prokaryote Gene Report into TSV format
  prok-gene-location Convert Prokaryote Gene Location Report into TSV format
  genome-annotations Convert Genome Annotation Report into TSV format

Flags
      --elide-header   Do not output header
  -h, --help           help for tsv



Global Flags
      --force   Force dataformat to run without type check prompt

Use dataformat tsv <command> --help for detailed help about a command.

(base) [yut@io02 ~]$ dataformat tsv gene
Error: --inputfile and/or --packagefile must be specified, or data can be read from standard input
Usage
  dataformat tsv gene [flags]

Examples
  dataformat tsv gene --inputfile gene_package/ncbi_dataset/data/data_report.jsonl
  dataformat tsv gene --package genes.zip

Flags
      --fields strings     Comma-separated list of fields (default annotation-assembly-accession,annotation-assembly-name,annotation-genomic-range-accession,annotation-genomic-range-exon-order,annotation-genomic-range-exon-orientation,annotation-genomic-range-exon-start,annotation-genomic-range-exon-stop,annotation-genomic-range-range-order,annotation-genomic-range-range-orientation,annotation-genomic-range-range-start,annotation-genomic-range-range-stop,annotation-genomic-range-seq-name,annotation-release-date,annotation-release-name,chromosomes,common-name,description,ensembl-geneids,gene-id,gene-type,genomic-region-gene-range-accession,genomic-region-gene-range-range-order,genomic-region-gene-range-range-orientation,genomic-region-gene-range-range-start,genomic-region-gene-range-range-stop,genomic-region-genomic-region-type,group-id,group-method,name-authority,name-id,omim-ids,orientation,protein-count,ref-standard-gene-range-accession,ref-standard-gene-range-range-order,ref-standard-gene-range-range-orientation,ref-standard-gene-range-range-start,ref-standard-gene-range-range-stop,ref-standard-genomic-region-type,replaced-gene-id,rna-type,swissprot-accessions,symbol,synonyms,tax-id,tax-name,transcript-count)
                               - annotation-assembly-accession
                               - annotation-assembly-name
                               - annotation-genomic-range-accession
                               - annotation-genomic-range-exon-order
                               - annotation-genomic-range-exon-orientation
                               - annotation-genomic-range-exon-start
                               - annotation-genomic-range-exon-stop
                               - annotation-genomic-range-range-order
                               - annotation-genomic-range-range-orientation
                               - annotation-genomic-range-range-start
                               - annotation-genomic-range-range-stop
                               - annotation-genomic-range-seq-name
                               - annotation-release-date
                               - annotation-release-name
                               - chromosomes
                               - common-name
                               - description
                               - ensembl-geneids
                               - gene-id
                               - gene-type
                               - genomic-region-gene-range-accession
                               - genomic-region-gene-range-range-order
                               - genomic-region-gene-range-range-orientation
                               - genomic-region-gene-range-range-start
                               - genomic-region-gene-range-range-stop
                               - genomic-region-genomic-region-type
                               - group-id
                               - group-method
                               - name-authority
                               - name-id
                               - omim-ids
                               - orientation
                               - protein-count
                               - ref-standard-gene-range-accession
                               - ref-standard-gene-range-range-order
                               - ref-standard-gene-range-range-orientation
                               - ref-standard-gene-range-range-start
                               - ref-standard-gene-range-range-stop
                               - ref-standard-genomic-region-type
                               - replaced-gene-id
                               - rna-type
                               - swissprot-accessions
                               - symbol
                               - synonyms
                               - tax-id
                               - tax-name
                               - transcript-count
  -h, --help               help for gene
      --inputfile string   Input file (default "ncbi_dataset/data/data_report.jsonl")
      --package string     Data package (zip archive), inputfile parameter is relative to the root path inside the archive



Global Flags
      --elide-header   Do not output header
      --force          Force dataformat to run without type check prompt

  • 实例
(base) [yut@io02 ~]$ datasets summary gene gene-id 672  --as-json-lines |dataformat tsv gene
New version of client (15.27.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets
Annotation Assembly Accession   Annotation Assembly Name        Annotation Genomic Range Accession      Annotation Genomic Range Exons Order    Annotation Genomic Range Exons Orientation      Annotation Genomic Range Exons Start    Annotation Genomic Range Exons Stop     Annotation Genomic Range Order        Annotation Genomic Range Orientation    Annotation Genomic Range Start  Annotation Genomic Range Stop   Annotation Genomic Range Seq Name       Annotation Release Date Annotation Release Name Chromosomes     Common Name     Description     Ensembl GeneIDs NCBI GeneID   Gene Type       Genomic Region Gene Range Sequence Accession    Genomic Region Gene Range Order Genomic Region Gene Range Orientation   Genomic Region Gene Range Start Genomic Region Gene Range Stop  Genomic Region Genomic Region Type      Gene Group Identifier   Gene Group Method     Nomenclature Authority  Nomenclature ID OMIM IDs        Orientation     Proteins        Reference Standard Gene Range Sequence Accession        Reference Standard Gene Range Order     Reference Standard Gene Range Orientation       Reference Standard Gene Range Start     Reference Standard Gene Range Stop    Reference Standard Genomic Region Type  Replaced NCBI GeneID    RNA Type        SwissProt Accessions    Symbol  Synonyms        Taxonomic ID    Taxonomic Name  Transcripts
GCF_000001405.40        GRCh38.p14      NC_000017.11                                            minus   43044295        43170327        17      2023-10-02      GCF_000001405.40-RS_2023_10     17      human   BRCA1 DNA repair associated     ENSG00000012048 672     PROTEIN_CODING       672      NCBI Ortholog   HGNC    HGNC:1100       113705  minus   368     NG_005905.2             plus    92501   173689  REFSEQ_GENE                     P38398  BRCA1   IRIS,PSCP,BRCAI,BRCC1,FANCS,PNCA4,RNF53,BROVCA1,PPP1R53 9606    Homo sapiens    368
GCF_009914755.1 T2T-CHM13v2.0   NC_060941.1                                             minus   43902857        44029084        17      2023-10-02      GCF_009914755.1-RS_2023_10      17      human   BRCA1 DNA repair associated     ENSG00000012048 672     PROTEIN_CODING               672      NCBI Ortholog   HGNC    HGNC:1100       113705  minus   368     NG_005905.2             plus    92501   173689  REFSEQ_GENE                     P38398  BRCA1   IRIS,PSCP,BRCAI,BRCC1,FANCS,PNCA4,RNF53,BROVCA1,PPP1R53 9606    Homo sapiens    368

(base) [yut@io02 ~]$ datasets summary gene gene-id 672  --as-json-lines |dataformat tsv gene --fields gene-id,gene-type,symbol
New version of client (15.27.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets
NCBI GeneID     Gene Type       Symbol
672     PROTEIN_CODING  BRCA1

# --as-json-lines必须加上
# --fields指定需要的字段,多个空格隔开

通过json文件解析其他字段

  • 某些字段无法通过dataformat提取出来,可先保存成json文件,然后通过下面脚本解析:
(base) [yut@node01 ~]$ cat dataset.json
{"accession":"GCA_013141435.1","annotation_info":{"method":"Best-placed reference protein set; GeneMarkS-2+","name":"NCBI Prokaryotic Genome Annotation Pipeline (PGAP)","pipeline":"NCBI Prokaryotic Genome Annotation Pipeline (PGAP)","provider":"NCBI","release_date":"2020-05-14","software_version":"4.11","stats":{"gene_counts":{"non_coding":27,"protein_coding":2566,"pseudogene":17,"total":2610}}},"assembly_info":{"assembly_level":"Contig","assembly_method":"MetaSPAdes v. 3.10.1","assembly_name":"ASM1314143v1","assembly_status":"current","assembly_type":"haploid","bioproject_accession":"PRJNA622654","bioproject_lineage":[{"bioprojects":[{"accession":"PRJNA622654","title":"Metagenomic profiling of ammonia and methane-oxidizing microorganisms in a Dutch drinking water treatment plant"}]}],"biosample":{"accession":"SAMN14539096","attributes":[{"name":"isolation_source","value":"Primary rapid sand filter"},{"name":"collection_date","value":"not applicable"},{"name":"geo_loc_name","value":"Netherlands"},{"name":"lat_lon","value":"not applicable"},{"name":"isolate","value":"P-RSF-IL-07"},{"name":"depth","value":"not applicable"},{"name":"env_broad_scale","value":"drinking water treatment plant"},{"name":"env_local_scale","value":"Primary rapid sand filter"},{"name":"env_medium","value":"not applicable"},{"name":"metagenomic","value":"1"},{"name":"environmental-sample","value":"1"},{"name":"sample_type","value":"metagenomic assembly"},{"name":"metagenome-source","value":"drinking water metagenome"},{"name":"derived_from","value":"This BioSample is a metagenomic assembly obtained from the drinking water metagenome BioSample:SAMN14524263, SAMN14524264, SAMN14524265, SAMN14524266"}],"bioprojects":[{"accession":"PRJNA622654"}],"description":{"comment":"Keywords: GSC:MIxS;MIMAG:6.0","organism":{"organism_name":"Ferruginibacter sp.","tax_id":1940288},"title":"MIMAG Metagenome-assembled Genome sample from Ferruginibacter sp."},"last_updated":"2020-05-19T00:50:12.857","models":["MIMAG.water"],"owner":{"contacts":[{}],"name":"Radboud University"},"package":"MIMAG.water.6.0","publication_date":"2020-05-19T00:50:12.857","sample_ids":[{"label":"Sample name","value":"Ferruginibacter sp. P-RSF-IL-07"}],"status":{"status":"live","when":"2020-05-19T00:50:12.857"},"submission_date":"2020-04-04T13:17:04.950"},"comments":"The annotation was added by the NCBI Prokaryotic Genome Annotation Pipeline (PGAP). Information about PGAP can be found here: https://www.ncbi.nlm.nih.gov/genome/annotation_prok/","genome_notes":["derived from metagenome"],"release_date":"2020-05-21","sequencing_tech":"Illumina MiSeq","submitter":"Radboud University"},"assembly_stats":{"contig_l50":10,"contig_n50":104094,"gc_count":"978119","gc_percent":32,"genome_coverage":"270.4x","number_of_component_sequences":43,"number_of_contigs":43,"total_sequence_length":"3056910","total_ungapped_length":"3056910"},"average_nucleotide_identity":{"best_ani_match":{"ani":79.65,"assembly":"GCA_003426875.1","assembly_coverage":0.01,"category":"type","organism_name":"Lutibacter oceani","type_assembly_coverage":0.01},"category":"category_na","comment":"na","match_status":"low_coverage","submitted_organism":"Ferruginibacter sp.","submitted_species":"Ferruginibacter sp.","taxonomy_check_status":"Inconclusive"},"current_accession":"GCA_013141435.1","organism":{"infraspecific_names":{"isolate":"P-RSF-IL-07"},"organism_name":"Ferruginibacter sp.","tax_id":1940288},"source_database":"SOURCE_DATABASE_GENBANK","wgs_info":{"master_wgs_url":"https://www.ncbi.nlm.nih.gov/nuccore/JABFQZ000000000.1","wgs_contigs_url":"https://www.ncbi.nlm.nih.gov/Traces/wgs/JABFQZ01","wgs_project_accession":"JABFQZ01"}}

(base) [yut@node01 ~]$ Parse_dataset_genome_json_metadata.py  *json
Save result in output.csv
(base) [yut@node01 ~]$ cat output.csv
Accession,Geo Location Name,Latitude and Longitude,Collection date,Env broad scale,Env local scale,Env medium,Sample type
GCA_013141435.1,Netherlands,not applicable,not applicable,drinking water treatment plant,Primary rapid sand filter,not applicable,metagenomic assembly
(base) [yut@node01 ~]$ cat ~/Software/Important_scripts/Parse_dataset_genome_json_metadata.py
#!/usr/bin/env python
import argparse
import json
import pandas as pd

# 创建参数解析器
parser = argparse.ArgumentParser(description='Parse JSON data')
parser.add_argument('json_file', help='Path to the JSON file')

# 解析参数
args = parser.parse_args()

# 读取JSON文件
with open(args.json_file, 'r') as file:
    json_str = file.read()

# 解析JSON
data = json.loads(json_str)

# 获取env_broad_scale字段的值
# 获取所需字段的值
accession = data["accession"]
geo_loc_name = data["assembly_info"]["biosample"]["attributes"][2]["value"]
lat_lon = data["assembly_info"]["biosample"]["attributes"][3]["value"]
collection_date = data['assembly_info']['biosample']['attributes'][1]['value']
env_broad_scale = data["assembly_info"]["biosample"]["attributes"][6]["value"]
env_local_scale = data['assembly_info']['biosample']['attributes'][7]['value']
env_medium = data['assembly_info']['biosample']['attributes'][8]['value']
sample_type = data['assembly_info']['biosample']['attributes'][11]['value']

# output
# 创建DataFrame
df = pd.DataFrame({
    'Accession': [accession],
    'Geo Location Name': [geo_loc_name],
    'Latitude and Longitude': [lat_lon],
    'Collection date' : [collection_date],
    'Env broad scale' : [env_broad_scale],
    'Env local scale'  : [env_local_scale],
    'Env medium' : [env_medium],
    'Sample type' : [sample_type]
})

# 将DataFrame保存为CSV文件
df.to_csv('output.csv', index=False)
print('Save result in output.csv ')

# run
$ (base) [yut@node01 ~]$ Parse_dataset_genome_json_metadata.py  dataset.json
Save result in output.csv
# output
(base) [yut@node01 ~]$ cat output.csv
Accession,Geo Location Name,Latitude and Longitude,Collection date,Env broad scale,Env local scale,Env medium,Sample type
GCA_013141435.1,Netherlands,not applicable,not applicable,drinking water treatment plant,Primary rapid sand filter,not applicable,metagenomic assembly

问题

  • Error: Internal error (invalid zip archive)并且没有输出文件
(base) [yut@io02 02_Glacier_new_taxa]$ time datasets download genome accession --inputfile GTDB_R214_63_Ferruginibacter_genus.GCA --include gff3,rna,cds,protein,genome,seq-report
New version of client (15.27.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets
Collecting 63 genome records [================================================] 100% 63/63
Downloading: ncbi_dataset.zip    146MB done
Validating package files [===========>------------------------------------]  28% 70/252
Error: Internal error (invalid zip archive). Please try again

上述问题可能是输入编号既包括GCA又包括GCF编号,解决办法是将两者分开下载,或者等到Validating package files停掉命令

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/153571.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

牛客 —— 链表中倒数第k个结点(C语言,快慢指针,配图)

目录 1. 思路1&#xff1a;倒数第K个节点&#xff0c;就是整数第N-K1的节点 2. 思路2&#xff1a;快慢指针 1. 思路1&#xff1a;倒数第K个节点&#xff0c;就是整数第N-K1的节点 链表中&#xff0c;一共有N个节点&#xff0c;如果我们想要得出倒数第K个节点&#xff0c;我们…

windows server 华夏ERP部署手册

软件包准备&#xff1a; .安装MySql 找到mysql程序双击进行安装,进入这个页面 选择Server only点击Next 进入到下图,点击execute&#xff0c;等待完成&#xff0c;点击下一步 点击install安装插件 安装完插件点击下一步 等待程序加载完成,点击下一步 继续下一步 进行下一步 进行…

不加家长好友,如何私密发成绩?

身为老师的你&#xff0c;是否经常收到家长们的询问&#xff0c;要求你告知他们孩子的成绩&#xff1f;而你却因为规定&#xff0c;不能直接将成绩公布&#xff1f;那么&#xff0c;如何解决这个问题呢&#xff1f; 成绩查询系统。是专门为学生和家长提供成绩查询服务的系统。可…

智慧农业新篇章:拓世法宝AI智能直播一体机助力乡村振兴与农业可持续发展

随着乡村振兴战略的深入推进&#xff0c;农业发展日益成为国家关注的焦点。在这一大背景下&#xff0c;助农项目的兴起成为支持乡村振兴的一项重要举措。 乡村振兴战略的实施&#xff0c;得益于《关于推动文化产业赋能乡村振兴的意见》、《关于全面推进乡村振兴加快农业农村现…

vscode Prettier配置

常用配置项&#xff1a; .prettierrc.json 是 Prettier 格式化工具的配置文件 {"printWidth": 200, // 指定行的最大长度"tabWidth": 2, // 指定缩进的空格数"useTabs": false, // 是否使用制表符进行缩进&#xff0c;默认为 false"singl…

Behave介绍和快速示例

Behave是一个用于行为驱动开发 (Behavior-Driven Development, BDD) 的 Python 库。使用 Behave&#xff0c;可以编写自然语言格式的使用场景来描述软件的行为&#xff0c;然后用 Python 实现这些场景下的步骤&#xff0c;形成可直接运行的测试。 Behave的目标是帮助用户、开发…

RT-DETR算法优化改进:自研独家创新BSAM注意力 ,基于CBAM升级 | 注意力机制大作战

💡💡💡本文全网首发独家改进:提出新颖的注意力BSAM(BiLevel Spatial Attention Module),创新度极佳,适合科研创新,效果秒杀CBAM,Channel Attention+Spartial Attention升级为新颖的 BiLevel Attention+Spartial Attention 1)作为注意力BSAM使用; 推荐指数:…

B/S麻醉临床信息系统(手麻系统)源码

手术麻醉系统是一套以数字形式与医院信息系统&#xff08;如HIS、EMR、LIS、PACS等&#xff09;和医疗设备等软、硬件集成并获取围手术期相关信息的计算机系统&#xff0c;其核心是对围手术期患者信息自动采集、储存、分析并呈现。该系统通过整合围手术期中病人信息、人员信息、…

DBeaver还原mysql数据库

DBeaver还原mysql数据库 DBEaver还原mysql数据库新建一个要还原的数据库选择工具》恢复数据库 DBEaver还原mysql数据库 新建一个要还原的数据库 选中数据库,右键新建一个数据库&#xff0c;字符集和排序规则默认的即可 选择工具》恢复数据库 选中刚刚创建好的数据库&#x…

springboot326校园体育场馆(设施)使用管理网站

交流学习&#xff1a; 更多项目&#xff1a; 全网最全的Java成品项目列表 https://docs.qq.com/doc/DUXdsVlhIdVlsemdX 演示 项目功能演示&#xff1a; ————————————————

Garmin佳明发布quatix 7 Pro航海商务智能腕表,于陆地之外乘风破浪

全新航海商务智能腕表&#xff0c;专为水上活动爱好者设计 搭载1.3英寸AMOLED屏幕&#xff0c;内置LED手电筒&#xff0c;长达16天电池续航 【2023年11月16日】今日&#xff0c;专业运动智能可穿戴设备及创新航海设备品牌佳明&#xff08;纽交所代码&#xff1a;GRMN&#xff…

Linux/麒麟系统上部署Vue+SpringBoot前后端分离项目

目录 1. 前端准备工作 1.1 在项目根目录创建两份环境配置文件 1.2 环境配置 2. 后端准备工作 2.1 在项目resources目录创建两份环境配置文件 2.2 环境配置 3. 前后端打包 3.1 前端打包 3.2 后端打包 4、服务器前后端配置及部署 4.1 下载、安装、启动Nginx 4.2 前端项目部署…

学而优则“创”西电学子助力openGauss教学“破圈”,一举斩获金奖

在你的大学生涯&#xff0c;是否有过发现某本教材作者就是本校老师的经历&#xff1f;是否曾经为在课堂上见到作者本人而感到些许骄傲&#xff1f;实际上&#xff0c;这样的巧合在很多专业领域常有发生&#xff0c;因为一线教师往往既是知识的传道者&#xff0c;也是知识的生产…

Kafka 集群如何实现数据同步?

哈喽大家好&#xff0c;我是咸鱼 最近这段时间比较忙&#xff0c;将近一周没更新文章&#xff0c;再不更新我那为数不多的粉丝量就要库库往下掉了 T﹏T 刚好最近在学 Kafka&#xff0c;于是决定写篇跟 Kafka 相关的文章&#xff08;文中有不对的地方欢迎大家指出&#xff09;…

简单高效的定制移动固态硬盘,稳定易用的办公小助手

我在平时的工作中&#xff0c;常常需要处理各种大文件和数据&#xff0c;如果硬盘速度跟不上&#xff0c;那工作效率就会大幅降低。今年固态硬盘的价格大幅下降&#xff0c;有很多国产品牌加入其中&#xff0c;很容易找到一款性价比高的固态硬盘&#xff0c;搭配硬盘盒使用&…

EV代码签名证书

为了增强软件的安全性和可信度&#xff0c;EV代码签名证书&#xff08;Extended Validation Code Signing Certificate&#xff09;成为了一种具有最高级别保障的关键工具。 EV代码签名证书是一种由受信任的证书颁发机构&#xff08;CA&#xff09;或证书供应商提供的高级别代…

免费最强下载工具IDM,没有之一

IDM(Internet Download Manager)下载工具是我见过的最强下载工具&#xff0c;没有之一。主要以下特点&#xff1a; 下载程度超快实时检测下载行为下载任何文件探测视频下载地址&#xff0c;几分钟下载高清视频可多进程下载&#xff0c;可多线程下载 IDM官网地址&#xff1a;下…

性能测试知多少---系统架构分析

之前有对性能需求进行过分析&#xff0c;那篇主要从项目业务、背景等角度如何抽丝剥茧的将项目的需求抽离出来。在我们进行需求的时候也需要对被测项目的架构有一定的认识&#xff0c;如果不了解被测系统的架构&#xff0c;那么在后期的性能分析与调优阶段将无从下手。 简单系…

八股文-TCP的三次握手

TCP协议是一种面向连接、可靠传输的协议&#xff0c;而建立连接的过程就是著名的三次握手。这个过程保证了通信的双方能够同步信息&#xff0c;确保后续的数据传输是可靠和有序的。本文将深入解析TCP三次握手的步骤及其意义。 漫画TCP的三次握手 TCP连接的建立采用了三次握手的…