如何粗暴地下载huggingface_hub指定数据文件

参考这里:
https://huggingface.co/docs/huggingface_hub/guides/download

可见下载单个文件,下载整个仓库文件都是可行的。

这是使用snapshot_download下载的一个例子:

https://qq742971636.blog.csdn.net/article/details/135150482

snapshot_download 会利用 hf_hub_download 函数实现下载所有文件,hf_hub_download是负责下载单个文件的。

snapshot_download 函数有点sb的一点是,你的本地目录已经存在了某个文件,你运行snapshot_download之后就重复下载。一看介绍:如果local_dir_use_symlinks=False并且blob文件不在缓存目录中,那么文件会被下载并直接放在local_dir下。这意味着如果您稍后需要重新下载它们,它们将被完全重新下载。

    If `local_dir` is provided, the file structure from the repo will be replicated in this location. You can configure
    how you want to move those files:
      - If `local_dir_use_symlinks="auto"` (default), files are downloaded and stored in the cache directory as blob
        files. Small files (<5MB) are duplicated in `local_dir` while a symlink is created for bigger files. The goal
        is to be able to manually edit and save small files without corrupting the cache while saving disk space for
        binary files. The 5MB threshold can be configured with the `HF_HUB_LOCAL_DIR_AUTO_SYMLINK_THRESHOLD`
        environment variable.
      - If `local_dir_use_symlinks=True`, files are downloaded, stored in the cache directory and symlinked in `local_dir`.
        This is optimal in term of disk usage but files must not be manually edited.
      - If `local_dir_use_symlinks=False` and the blob files exist in the cache directory, they are duplicated in the
        local dir. This means disk usage is not optimized.
      - Finally, if `local_dir_use_symlinks=False` and the blob files do not exist in the cache directory, then the
        files are downloaded and directly placed under `local_dir`. This means if you need to download them again later,
        they will be re-downloaded entirely.

什么乱七八糟的,必须将filtered_repo_files文件里剔除掉已经存在的,不然每次都重复下就不用玩了,选择直接改源码。

在这里插入图片描述

修改文件/opt/xiedong/miniconda3/envs/datacomp/lib/python3.10/site-packages/huggingface_hub/_snapshot_download.py


import os
from pathlib import Path
from typing import Dict, List, Literal, Optional, Union

from tqdm.auto import tqdm as base_tqdm
from tqdm.contrib.concurrent import thread_map

from .constants import (
    DEFAULT_ETAG_TIMEOUT,
    DEFAULT_REVISION,
    HF_HUB_CACHE,
    HF_HUB_ENABLE_HF_TRANSFER,
    REPO_TYPES,
)
from .file_download import REGEX_COMMIT_HASH, hf_hub_download, repo_folder_name
from .hf_api import HfApi
from .utils import filter_repo_objects, logging, validate_hf_hub_args
from .utils import tqdm as hf_tqdm

logger = logging.get_logger(__name__)


@validate_hf_hub_args
def snapshot_download(
        repo_id: str,
        *,
        repo_type: Optional[str] = None,
        revision: Optional[str] = None,
        cache_dir: Union[str, Path, None] = None,
        local_dir: Union[str, Path, None] = None,
        local_dir_use_symlinks: Union[bool, Literal["auto"]] = "auto",
        library_name: Optional[str] = None,
        library_version: Optional[str] = None,
        user_agent: Optional[Union[Dict, str]] = None,
        proxies: Optional[Dict] = None,
        etag_timeout: float = DEFAULT_ETAG_TIMEOUT,
        resume_download: bool = False,
        force_download: bool = False,
        token: Optional[Union[bool, str]] = None,
        local_files_only: bool = False,
        allow_patterns: Optional[Union[List[str], str]] = None,
        ignore_patterns: Optional[Union[List[str], str]] = None,
        max_workers: int = 8,
        tqdm_class: Optional[base_tqdm] = None,
        endpoint: Optional[str] = None,
) -> str:
    """Download repo files.

    Download a whole snapshot of a repo's files at the specified revision. This is useful when you want all files from
    a repo, because you don't know which ones you will need a priori. All files are nested inside a folder in order
    to keep their actual filename relative to that folder. You can also filter which files to download using
    `allow_patterns` and `ignore_patterns`.

    If `local_dir` is provided, the file structure from the repo will be replicated in this location. You can configure
    how you want to move those files:
      - If `local_dir_use_symlinks="auto"` (default), files are downloaded and stored in the cache directory as blob
        files. Small files (<5MB) are duplicated in `local_dir` while a symlink is created for bigger files. The goal
        is to be able to manually edit and save small files without corrupting the cache while saving disk space for
        binary files. The 5MB threshold can be configured with the `HF_HUB_LOCAL_DIR_AUTO_SYMLINK_THRESHOLD`
        environment variable.
      - If `local_dir_use_symlinks=True`, files are downloaded, stored in the cache directory and symlinked in `local_dir`.
        This is optimal in term of disk usage but files must not be manually edited.
      - If `local_dir_use_symlinks=False` and the blob files exist in the cache directory, they are duplicated in the
        local dir. This means disk usage is not optimized.
      - Finally, if `local_dir_use_symlinks=False` and the blob files do not exist in the cache directory, then the
        files are downloaded and directly placed under `local_dir`. This means if you need to download them again later,
        they will be re-downloaded entirely.

    An alternative would be to clone the repo but this requires git and git-lfs to be installed and properly
    configured. It is also not possible to filter which files to download when cloning a repository using git.

    Args:
        repo_id (`str`):
            A user or an organization name and a repo name separated by a `/`.
        repo_type (`str`, *optional*):
            Set to `"dataset"` or `"space"` if downloading from a dataset or space,
            `None` or `"model"` if downloading from a model. Default is `None`.
        revision (`str`, *optional*):
            An optional Git revision id which can be a branch name, a tag, or a
            commit hash.
        cache_dir (`str`, `Path`, *optional*):
            Path to the folder where cached files are stored.
        local_dir (`str` or `Path`, *optional*):
            If provided, the downloaded files will be placed under this directory, either as symlinks (default) or
            regular files (see description for more details).
        local_dir_use_symlinks (`"auto"` or `bool`, defaults to `"auto"`):
            To be used with `local_dir`. If set to "auto", the cache directory will be used and the file will be either
            duplicated or symlinked to the local directory depending on its size. It set to `True`, a symlink will be
            created, no matter the file size. If set to `False`, the file will either be duplicated from cache (if
            already exists) or downloaded from the Hub and not cached. See description for more details.
        library_name (`str`, *optional*):
            The name of the library to which the object corresponds.
        library_version (`str`, *optional*):
            The version of the library.
        user_agent (`str`, `dict`, *optional*):
            The user-agent info in the form of a dictionary or a string.
        proxies (`dict`, *optional*):
            Dictionary mapping protocol to the URL of the proxy passed to
            `requests.request`.
        etag_timeout (`float`, *optional*, defaults to `10`):
            When fetching ETag, how many seconds to wait for the server to send
            data before giving up which is passed to `requests.request`.
        resume_download (`bool`, *optional*, defaults to `False):
            If `True`, resume a previously interrupted download.
        force_download (`bool`, *optional*, defaults to `False`):
            Whether the file should be downloaded even if it already exists in the local cache.
        token (`str`, `bool`, *optional*):
            A token to be used for the download.
                - If `True`, the token is read from the HuggingFace config
                  folder.
                - If a string, it's used as the authentication token.
        local_files_only (`bool`, *optional*, defaults to `False`):
            If `True`, avoid downloading the file and return the path to the
            local cached file if it exists.
        allow_patterns (`List[str]` or `str`, *optional*):
            If provided, only files matching at least one pattern are downloaded.
        ignore_patterns (`List[str]` or `str`, *optional*):
            If provided, files matching any of the patterns are not downloaded.
        max_workers (`int`, *optional*):
            Number of concurrent threads to download files (1 thread = 1 file download).
            Defaults to 8.
        tqdm_class (`tqdm`, *optional*):
            If provided, overwrites the default behavior for the progress bar. Passed
            argument must inherit from `tqdm.auto.tqdm` or at least mimic its behavior.
            Note that the `tqdm_class` is not passed to each individual download.
            Defaults to the custom HF progress bar that can be disabled by setting
            `HF_HUB_DISABLE_PROGRESS_BARS` environment variable.

    Returns:
        Local folder path (string) of repo snapshot

    <Tip>

    Raises the following errors:

    - [`EnvironmentError`](https://docs.python.org/3/library/exceptions.html#EnvironmentError)
      if `token=True` and the token cannot be found.
    - [`OSError`](https://docs.python.org/3/library/exceptions.html#OSError) if
      ETag cannot be determined.
    - [`ValueError`](https://docs.python.org/3/library/exceptions.html#ValueError)
      if some parameter value is invalid

    </Tip>
    """
    if cache_dir is None:
        cache_dir = HF_HUB_CACHE
    if revision is None:
        revision = DEFAULT_REVISION
    if isinstance(cache_dir, Path):
        cache_dir = str(cache_dir)

    if repo_type is None:
        repo_type = "model"
    if repo_type not in REPO_TYPES:
        raise ValueError(f"Invalid repo type: {repo_type}. Accepted repo types are: {str(REPO_TYPES)}")

    storage_folder = os.path.join(cache_dir, repo_folder_name(repo_id=repo_id, repo_type=repo_type))
    print(f"storage_folder: {storage_folder}")

    # if we have no internet connection we will look for an
    # appropriate folder in the cache
    # If the specified revision is a commit hash, look inside "snapshots".
    # If the specified revision is a branch or tag, look inside "refs".
    if local_files_only:
        if REGEX_COMMIT_HASH.match(revision):
            commit_hash = revision
        else:
            # retrieve commit_hash from file
            ref_path = os.path.join(storage_folder, "refs", revision)
            with open(ref_path) as f:
                commit_hash = f.read()

        snapshot_folder = os.path.join(storage_folder, "snapshots", commit_hash)

        if os.path.exists(snapshot_folder):
            return snapshot_folder

        raise ValueError(
            "Cannot find an appropriate cached snapshot folder for the specified"
            " revision on the local disk and outgoing traffic has been disabled. To"
            " enable repo look-ups and downloads online, set 'local_files_only' to"
            " False."
        )
    print(f"revision: {revision}")
    # if we have internet connection we retrieve the correct folder name from the huggingface api
    api = HfApi(library_name=library_name, library_version=library_version, user_agent=user_agent, endpoint=endpoint)
    repo_info = api.repo_info(repo_id=repo_id, repo_type=repo_type, revision=revision, token=token)
    assert repo_info.sha is not None, "Repo info returned from server must have a revision sha."
    assert repo_info.siblings is not None, "Repo info returned from server must have a siblings list."

    filtered_repo_files = list(
        filter_repo_objects(
            items=[f.rfilename for f in repo_info.siblings],
            allow_patterns=allow_patterns,
            ignore_patterns=ignore_patterns,
        )
    )
    # print(f"filtered_repo_files: {filtered_repo_files}")

    commit_hash = repo_info.sha
    print(f"commit_hash: {commit_hash}")
    snapshot_folder = os.path.join(storage_folder, "snapshots", commit_hash)
    print(f"snapshot_folder: {snapshot_folder}")
    # if passed revision is not identical to commit_hash
    # then revision has to be a branch name or tag name.
    # In that case store a ref.
    if revision != commit_hash:
        ref_path = os.path.join(storage_folder, "refs", revision)
        os.makedirs(os.path.dirname(ref_path), exist_ok=True)
        with open(ref_path, "w") as f:
            f.write(commit_hash)
    print(f"ref_path: {ref_path}")

    # we pass the commit_hash to hf_hub_download
    # so no network call happens if we already
    # have the file locally.
    def _inner_hf_hub_download(repo_file: str):
        return hf_hub_download(
            repo_id,
            filename=repo_file,
            repo_type=repo_type,
            revision=commit_hash,
            endpoint=endpoint,
            cache_dir=cache_dir,
            local_dir=local_dir,
            local_dir_use_symlinks=local_dir_use_symlinks,
            library_name=library_name,
            library_version=library_version,
            user_agent=user_agent,
            proxies=proxies,
            etag_timeout=etag_timeout,
            resume_download=resume_download,
            force_download=force_download,
            token=token,
        )

    # 处理filtered_repo_files,删除在local_dir中存在的文件
    files_exist_in_local_dir = os.listdir(local_dir)
    # print(f"files_exist_in_local_dir: {files_exist_in_local_dir}")
    filtered_repo_files = [file for file in filtered_repo_files if file not in files_exist_in_local_dir]
    print("sorted filtered_repo_files")
    filtered_repo_files.sort()
    print(f"len(filtered_repo_files): {len(filtered_repo_files)}")

    if len(filtered_repo_files) == 0:
        raise ValueError(
            "No files to download. Please check that the allow_patterns and ignore_patterns"
            " parameters are correct."
        )

    if HF_HUB_ENABLE_HF_TRANSFER:
        print("单线程下载")
        # print(f"filtered_repo_files: {filtered_repo_files}")
        # when using hf_transfer we don't want extra parallelism
        # from the one hf_transfer provides
        for file in filtered_repo_files:
            _inner_hf_hub_download(file)
    else:
        print("多线程下载")
        print(f"max_workers: {max_workers}")
        # print(f"filtered_repo_files: {filtered_repo_files}")
        thread_map(
            _inner_hf_hub_download,
            filtered_repo_files,
            desc=f"Fetching {len(filtered_repo_files)} files",
            max_workers=max_workers,
            # User can use its own tqdm class or the default one from `huggingface_hub.utils`
            tqdm_class=tqdm_class or hf_tqdm,
        )

    if local_dir is not None:
        return str(os.path.realpath(local_dir))
    return snapshot_folder


这样略过那些已经下载了的文件,就舒服多了。

下载完成后再用hash5检查一下文件是不是都下载对劲了就行,不对劲的使用hf_hub_download直接下载单个文件。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/263331.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

轻松管理TXT文本,高效批量内容调整,打造高效工作流程!

在数字时代&#xff0c;文本文件已经成为我们生活和工作中不可或缺的一部分。无论是简单的笔记、待办事项&#xff0c;还是复杂的项目报告、小说草稿&#xff0c;TXT文本都能为我们提供灵活的存储和编辑方式。但是&#xff0c;随着文本文件的增多&#xff0c;如何轻松管理、高效…

Java 并发编程中的线程池

7 并发编程中的线程池 自定义线程池 package com.rainsun.d7_thread_pool;import lombok.extern.slf4j.Slf4j;import java.util.ArrayDeque; import java.util.Deque; import java.util.HashSet; import java.util.concurrent.TimeUnit; import java.util.concurrent.locks.Co…

vue3引入使用高德地图,不显示地图问题

将全局引入的mockjs去除&#xff0c;就可以了。

基于ChatGLM搭建专业领域问答机器人的思路

如果我们对ChatGLM进一步提出涉及专业领域的问题&#xff0c;而此方面知识是ChatGLM未经数据训练的&#xff0c;那么ChatGLM的回答效果如何呢&#xff1f;本节将考察ChatGLM在专业领域的问答水平&#xff0c;并尝试解决此方面的问题。 在使用ChatGLM制作专业领域问答机器人之前…

如何利用烛龙和谷歌插件优化CLS(累积布局偏移) | 京东云技术团队

简介 CLS 衡量的是页面的整个生命周期内发生的每次意外布局偏移的最大突发性_布局偏移分数_。布局变化的发生是因为浏览器倾向于异步加载页面元素。更重要的是&#xff0c;您的页面上可能存在一些初始尺寸未知的媒体元素。这种组合意味着浏览器在加载完成之前无法确定单个元素…

anconda常用命令

一、基础指令说明 1、查看anconda版本号 conda --version 2、查看当前已有虚拟环境 conda env list 3、创建新环境 conda create -n classify python3.9 创建一个叫做classify的虚拟环境&#xff0c;其中python等于3.9 4、进入虚拟环境 activate classify 5、安装包 接下来…

【Skynet 入门实战练习】事件模块 | 批处理模块 | GM 指令 | 模糊搜索

文章目录 前言事件模块批处理模块GM 指令模块模糊搜索最后 前言 本节完善了项目&#xff0c;实现了事件、批处理、模糊搜索模块、GM 指令模块。 事件模块 什么是事件模块&#xff1f;事件模块是用来在各系统之间传递事件消息的。 为什么需要事件模块&#xff1f;主要目的是…

由浅入深走进Python异步编程【多进程】(含代码实例讲解 || multiprocessing、异步进程池、进程通信)

写在前面 从底层到第三方库&#xff0c;全面讲解python的异步编程。这节讲述的是python的多线程实现&#xff0c;纯干货&#xff0c;无概念&#xff0c;代码实例讲解。 本系列有6章左右&#xff0c;点击头像或者专栏查看更多内容&#xff0c;陆续更新&#xff0c;欢迎关注。 …

群多多社群人脉H5-2.1.4多开插件+小程序独立前端+搭建教程

功能介绍&#xff1a; 1、群多多社群大全&#xff0c;是一个集发布、展示社群信息、人脉推广的裂变工具/平台。 2、通过人脉广场&#xff0c;将商家信息通过名片进行展示&#xff0c;让资源对接、人脉推广更加便捷高效。 3、行业群、兴趣群、知识付费群、交友群、商家活动推…

DMA实验3-外设到内存搬运

实验要求 使用 DMA 的方式将串口接收缓存寄存器的值搬运到内存中&#xff0c;同时闪烁 LED1 。 CubeMX 配置 DMA 配置&#xff1a; 串口中断配置 代码实现 如何判断串口接收是否完成&#xff1f;如何知道串口收到数据的长度&#xff1f; 使用串口空闲中断&#xff08;IDL…

十一、W5100S/W5500+RP2040之MicroPython开发<MQTT阿里云示例>

文章目录 1. 前言2. 平台操作流程3. WIZnet以太网芯片4. 示例讲解以及使用4.1 程序流程图4.2 测试准备4.3 连接方式4.4 相关代码4.5 烧录验证 5. 注意事项6. 相关链接 1. 前言 在这个智能硬件和物联网时代&#xff0c;MicroPython和树莓派PICO正以其独特的优势引领着嵌入式开发…

谷歌 | Duet AI 让洞察、聚类模型和可视化变得简单

迷失在数据的海洋 我们都经历过这样的情况&#xff1a;淹没在数据的海洋中&#xff0c;努力驾驭复杂的管道&#xff0c;感觉数据令人头晕。管理大量充满不同工具和 Google 搜索的选项卡以及花费大量时间筛选数据和代码以创建满足您需求的模型所带来的挫败感&#xff0c;真的会…

探索UX设计师的日常任务,赶紧看看

UX 设计师专注于产品开发的各个方面&#xff0c;包括设计、可用性、功能、甚至品牌和营销。他们的工作涉及用户与产品交互的整个端到端旅程&#xff0c;包括为产品和业务识别新的机会。 鉴于他们广泛的范围&#xff0c;UX 设计师根据公司和项目的要求&#xff0c;执行多种不同…

探索 MajicStudio:一款多功能视频编辑软件

一、产品简介 MajicStudio是一款基于人工智能的图片编辑与设计工具&#xff0c;拥有简洁的界面与丰富功能。采用深度学习和计算机视觉技术可以自动识别图片要素。 二、应用场景 MajicStudio的AI图像功能适用于多场景&#xff0c;包括艺术设计、电商、游戏和文创等场景。 三…

【Proteus仿真】【Arduino单片机】蓝牙遥控小车

文章目录 一、功能简介二、软件设计三、实验现象联系作者 一、功能简介 本项目使用Proteus8仿真Arduino单片机控制器&#xff0c;使LCD1602液晶&#xff0c;L298电机&#xff0c;直流电机&#xff0c;HC05/06蓝牙模块等。 主要功能&#xff1a; 系统运行后&#xff0c;LCD1602…

数据结构和算法笔记2:二分法

二分法网上有两种写法&#xff0c;一种左闭右闭&#xff0c;一种左闭右开&#xff0c;个人习惯左闭右闭的写法&#xff0c; 有序数组查找数 这是标准二分法&#xff0c;对应力扣的704. 二分查找&#xff1a; 求值为target的索引 int search(vector<int>& nums, i…

Thread类的基本用法

⭐ 作者&#xff1a;小胡_不糊涂 &#x1f331; 作者主页&#xff1a;小胡_不糊涂的个人主页 &#x1f4c0; 收录专栏&#xff1a;JavaEE &#x1f496; 持续更文&#xff0c;关注博主少走弯路&#xff0c;谢谢大家支持 &#x1f496; Thread 1. 线程创建1.1 继承Thread类1.2 实…

arcEngine修改字段标注

修改字段标注 在arcEngine中&#xff0c;有时候需要修改图层要素的标注值&#xff0c;而且每个字段值对应了要修改的内容&#xff0c;如字段值”1“替换成”A“&#xff0c;字段值”2“替换成”B“等&#xff0c;这就需要在替换的图层中&#xff0c;遍历每个要素&#xff0c;查…

LeetCode 21 合并两个有序链表

题目描述 合并两个有序链表 将两个升序链表合并为一个新的 升序 链表并返回。新链表是通过拼接给定的两个链表的所有节点组成的。 示例 1&#xff1a; 输入&#xff1a;l1 [1,2,4], l2 [1,3,4] 输出&#xff1a;[1,1,2,3,4,4]示例 2&#xff1a; 输入&#xff1a;l1 [],…

「用户与社区的深度对话」2023年度IvorySQL满意度调研

致IvorySQL社区成员&#xff0c; &#x1f3c3;‍♂️2023年即将进入尾声&#xff0c;感谢每一位社区朋友对IvorySQL的支持。我们诚挚地邀请您参与我们的社区满意度调研。您的反馈对我们至关重要&#xff0c;将有助于改进我们的服务&#xff0c;为您提供更好的社区体验&#xf…