Datasets 提供两种数据集对象:Dataset 和 ✨ IterableDataset ✨。
- Dataset 提供快速随机访问数据集中的行,并支持内存映射,因此即使加载大型数据集也只需较少的内存。
- IterableDataset 适用于超大数据集,甚至无法完全下载到磁盘或内存中。它允许在数据集完全下载之前就开始访问和使用数据集。
0 读取数据
from datasets import load_dataset
dataset = load_dataset("rotten_tomatoes", split="train")
dataset
'''
Dataset({
features: ['text', 'label'],
num_rows: 8530
})
'''
1 Dataset
1.1 索引
dataset[0]
'''
{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
'label': 1}
'''
dataset[-1]
'''
{'text': 'things really get weird , though not particularly scary : the movie is all portent and no content .',
'label': 0}
'''
dataset[0]['text']
'''
'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'
'''
dataset['text']
1.2 切片
dataset[:3]
'''
{'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .',
'effective but too-tepid biopic'],
'label': [1, 1, 1]}
'''
2 IterableDataset
当设置 streaming=True
时加载的数据集为 IterableDataset:
IterableDataset 的行为与 Dataset 不同:
- 无法随机访问。
- 只能逐个迭代获取元素,例如使用
next(iter())
或for
循环。
from datasets import load_dataset
iter_dataset = load_dataset("rotten_tomatoes", split="train",streaming=True)
iter_dataset
'''
IterableDataset({
features: ['text', 'label'],
n_shards: 1
})
'''
for i in iter_dataset:
print(i)
break
'''
{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'label': 1}
'''
2.1 从现有 Dataset 创建 IterableDataset
iter_dataset2=dataset.to_iterable_dataset()
for i in iter_dataset2:
print(i)
break
'''
{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'label': 1}
'''
2.2 获取指定数量的示例
list(iter_dataset2.take(3))
'''
[{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
'label': 1},
{'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .',
'label': 1},
{'text': 'effective but too-tepid biopic', 'label': 1}]
'''