Transformers包详解


Transformers

官方文档:https://huggingface.co/docs/transformers/index

🤗 Transformers provides APIs to easily download and train state-of-the-art pretrained models. Using pretrained models can reduce your compute costs, carbon footprint, and save you time from training a model from scratch. The models can be used across different modalities such as:

  • 📝 Text: text classification, information extraction, question answering, summarization, translation, and text generation in over 100 languages.
  • 🖼️ Images: image classification, object detection, and segmentation.
  • 🗣️ Audio: speech recognition and audio classification.
  • 🐙 Multimodal: table question answering, optical character recognition, information extraction from scanned documents, video classification, and visual question answering.

安装

Pip

pip install transformers

Conda

conda install -c huggingface transformers

Pipeline usage

Pipeline

  1. Start by creating a pipeline() and specify an inference task:
from transformers import pipeline

generator = pipeline(task="text-generation")
  1. Pass your input text to the pipeline():
generator("Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone")
[{'generated_text': 'Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone, Seven for the Iron-priests at the door to the east, and thirteen for the Lord Kings at the end of the mountain'}]

generator(
    [
        "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone",
        "Nine for Mortal Men, doomed to die, One for the Dark Lord on his dark throne",
    ]
)

generator(
    "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone",
    num_return_sequences=2,
)

Choose a model and tokenizer

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
model = AutoModelForCausalLM.from_pretrained("distilgpt2")

from transformers import pipeline

generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer)

generator("Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone")

Audio pipeline

from transformers import pipeline

audio_classifier = pipeline(
    task="audio-classification", model="ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"
)

audio_classifier("jfk_moon_speech.wav")

Vision pipeline

from transformers import pipeline

vision_classifier = pipeline(task="image-classification")
vision_classifier(
    images="<https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg>"
)

Load pretrained instances with an AutoClass

AutoTokenizer

几乎每个 NLP 任务都从分词器开始。分词器将您的输入转换为模型可以处理的格式。

Load a tokenizer with AutoTokenizer.from_pretrained()

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

sequence = "In a hole in the ground there lived a hobbit."
print(tokenizer(sequence))

{'input_ids': [101, 1999, 1037, 4920, 1999, 1996, 2598, 2045, 2973, 1037, 7570, 10322, 4183, 1012, 102], 
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

AutoFeatureExtractor

对于音频和视觉任务,特征提取器将音频信号或图像处理为正确的输入格式。

Load a feature extractor with AutoFeatureExtractor.from_pretrained()

from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained(
    "ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"
)

AutoProcessor

多模式任务需要结合两种预处理工具的处理器。例如,LayoutLMV2模型需要一个特征提取器来处理图像和一个分词器来处理文本;处理器将两者结合在一起。

Load a processor with AutoProcessor.from_pretrained()

from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("microsoft/layoutlmv2-base-uncased")

AutoModel

最后,这些AutoModelFor类允许您为给定任务加载预训练模型(有关可用任务的完整列表,请参见此处)。例如,使用AutoModelForSequenceClassification.from_pretrained()加载序列分类模型:

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")

轻松重用相同的检查点来为不同的任务加载架构:

from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased")

通常,我们建议使用AutoTokenizer类和AutoModelFor类来加载模型的预训练实例。这将确保您每次都加载正确的架构。

Preprocess【预处理】

在您可以在模型中使用数据之前,需要将数据处理为模型可接受的格式。模型不理解原始文本、图像或音频。这些输入需要转换成数字并组装成张量。

  • 使用分词器预处理文本数据。
  • 使用特征提取器预处理图像或音频数据。
  • 使用处理器预处理多模式任务的数据。

NLP

处理文本数据的主要工具是分词器。标记器首先根据一组规则将文本拆分为标记。标记被转换为数字,用于构建张量作为模型的输入。模型所需的任何其他输入也由标记器添加。

如果您计划使用预训练模型,请务必使用相关的预训练标记器。这确保文本以与预训练语料库相同的方式拆分,并在预训练期间使用相同的相应标记到索引(通常称为vocab)。

通过使用AutoTokenizer类加载预训练的分词器来快速开始。这会下载模型预训练时使用的词汇

Tokenize

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.")
print(encoded_input)
{'input_ids': [101, 2079, 2025, 19960, 10362, 1999, 1996, 3821, 1997, 16657, 1010, 2005, 2027, 2024, 11259, 1998, 4248, 2000, 4963, 1012, 102], 
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

分词器返回一个包含三个重要项目的字典:

tokenizer.decode(encoded_input["input_ids"])
batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_inputs = tokenizer(batch_sentences)
print(encoded_inputs)
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102], 
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], 
               [101, 1327, 1164, 5450, 23434, 136, 102]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1]]}

Pad

这给我们带来了一个重要的话题。当您处理一批句子时,它们的长度并不总是相同的。这是一个问题,因为作为模型输入的张量需要具有统一的形状。填充是一种通过向具有较少标记的句子添加特殊填充标记来确保张量是矩形的策略。

padding参数设置True为填充批次中较短的序列以匹配最长的序列:

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_input = tokenizer(batch_sentences, padding=True)
print(encoded_input)
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], 
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], 
               [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], 
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}

Truncation

另一方面,有时序列可能太长,模型无法处理。在这种情况下,您需要将序列截断为更短的长度。

truncation参数设置True为将序列截断为模型接受的最大长度:

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_input = tokenizer(batch_sentences, padding=True, truncation=True)
print(encoded_input)
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], 
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], 
               [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], 
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}

Build tensors

最后,希望标记器返回馈送到模型的实际张量。

return_tensors参数设置pt为 PyTorch 或tfTensorFlow:

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_input = tokenizer(batch, padding=True, truncation=True, return_tensors="pt")
print(encoded_input)
{'input_ids': tensor([[  101,   153,  7719, 21490,  1122,  1114,  9582,  1623,   102],
                      [  101,  5226,  1122,  9649,  1199,  2610,  1236,   102,     0]]), 
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
                           [0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
                           [1, 1, 1, 1, 1, 1, 1, 1, 0]])}

Audio

音频输入的预处理与文本输入不同,但最终目标保持不变:创建模型可以理解的数字序列。特征提取器旨在从原始图像或音频数据中提取特征并将其转换为张量。

from datasets import load_dataset, Audio

dataset = load_dataset("superb", "ks")

dataset["train"][0]["audio"]
{'array': array([ 0.        ,  0.        ,  0.        , ..., -0.00592041,
        -0.00405884, -0.00253296], dtype=float32),
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/05734a36d88019a09725c20cc024e1c4e7982e37d7d55c0c1ca1742ea1cdd47f/_background_noise_/doing_the_dishes.wav',
 'sampling_rate': 16000}

这将返回三个项目:

  • array是将语音信号加载 - 并可能重新采样 - 作为一维数组。
  • path指向音频文件的位置。
  • sampling_rate指每秒测量语音信号中的数据点数。

Resample

lj_speech = load_dataset("lj_speech", split="train")
lj_speech[0]["audio"]

lj_speech = lj_speech.cast_column("audio", Audio(sampling_rate=16_000))

lj_speech[0]["audio"]

Feature extractor

下一步是加载一个特征提取器来规范化和填充输入。填充文本数据时,0为较短的序列添加 a。同样的想法也适用于音频数据,音频特征提取器将添加一个0- 解释为静音 - 到array.

from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")
audio_input = [dataset["train"][0]["audio"]["array"]]
feature_extractor(audio_input, sampling_rate=16000)

Pad and truncate

dataset["train"][0]["audio"]["array"].shape

dataset["train"][1]["audio"]["array"].shape

def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays,
        sampling_rate=16000,
        padding=True,
        max_length=1000000,
        truncation=True,
    )
    return inputs

processed_dataset = preprocess_function(dataset["train"][:5])

processed_dataset["input_values"][0].shape

processed_dataset["input_values"][1].shape

Vision

from datasets import load_dataset

dataset = load_dataset("food101", split="train[:100]")

dataset[0]["image"]

Feature extractor

from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224")

Data augmentation

from torchvision.transforms import Compose, Normalize, RandomResizedCrop, ColorJitter, ToTensor

normalize = Normalize(mean=feature_extractor.image_mean, std=feature_extractor.image_std)
_transforms = Compose(
    [RandomResizedCrop(feature_extractor.size), ColorJitter(brightness=0.5, hue=0.5), ToTensor(), normalize]
)

def transforms(examples):
    examples["pixel_values"] = [_transforms(image.convert("RGB")) for image in examples["image"]]
    return examples

dataset.set_transform(transforms)

dataset[0]["image"]

{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F1A7B0630D0>,
 'label': 6,
 'pixel_values': tensor([[[ 0.0353,  0.0745,  0.1216,  ..., -0.9922, -0.9922, -0.9922],
          [-0.0196,  0.0667,  0.1294,  ..., -0.9765, -0.9843, -0.9922],
          [ 0.0196,  0.0824,  0.1137,  ..., -0.9765, -0.9686, -0.8667],
          ...,
          [ 0.0275,  0.0745,  0.0510,  ..., -0.1137, -0.1216, -0.0824],
          [ 0.0667,  0.0824,  0.0667,  ..., -0.0588, -0.0745, -0.0980],
          [ 0.0353,  0.0353,  0.0431,  ..., -0.0039, -0.0039, -0.0588]],

         [[ 0.2078,  0.2471,  0.2863,  ..., -0.9451, -0.9373, -0.9451],
          [ 0.1608,  0.2471,  0.3098,  ..., -0.9373, -0.9451, -0.9373],
          [ 0.2078,  0.2706,  0.3020,  ..., -0.9608, -0.9373, -0.8275],
          ...,
          [-0.0353,  0.0118, -0.0039,  ..., -0.2392, -0.2471, -0.2078],
          [ 0.0196,  0.0353,  0.0196,  ..., -0.1843, -0.2000, -0.2235],
          [-0.0118, -0.0039, -0.0039,  ..., -0.0980, -0.0980, -0.1529]],

         [[ 0.3961,  0.4431,  0.4980,  ..., -0.9216, -0.9137, -0.9216],
          [ 0.3569,  0.4510,  0.5216,  ..., -0.9059, -0.9137, -0.9137],
          [ 0.4118,  0.4745,  0.5216,  ..., -0.9137, -0.8902, -0.7804],
          ...,
          [-0.2314, -0.1922, -0.2078,  ..., -0.4196, -0.4275, -0.3882],
          [-0.1843, -0.1686, -0.2000,  ..., -0.3647, -0.3804, -0.4039],
          [-0.1922, -0.1922, -0.1922,  ..., -0.2941, -0.2863, -0.3412]]])}

Multimodal

用于多模式任务。您将结合迄今为止所学的所有知识,并将您的技能应用于自动语音识别 (ASR) 任务。这意味着您将需要:

  • 用于预处理音频数据的特征提取器。
  • 用于处理文本的标记器。
from datasets import load_dataset

lj_speech = load_dataset("lj_speech", split="train")

lj_speech = lj_speech.map(remove_columns=["file", "id", "normalized_text"])

lj_speech[0]["audio"]

lj_speech[0]["text"]

lj_speech = lj_speech.cast_column("audio", Audio(sampling_rate=16_000))

Processor

from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")

def prepare_dataset(example):
    audio = example["audio"]

    example["input_values"] = processor(audio["array"], sampling_rate=16000)

    with processor.as_target_processor():
        example["labels"] = processor(example["text"]).input_ids
    return example

prepare_dataset(lj_speech[0])

Fine-tune a pretrained model【微调】

使用预训练模型有很多好处。它降低了计算成本和碳足迹,并允许您使用最先进的模型,而无需从头开始训练。🤗 Transformers 为各种任务提供了对数千个预训练模型的访问。当您使用预训练模型时,您可以在特定于您的任务的数据集上对其进行训练。这被称为微调,一种非常强大的训练技术。

在微调预训练模型之前,请下载数据集并为训练做好准备

from datasets import load_dataset

dataset = load_dataset("yelp_review_full")
dataset["train"][100]

{'label': 0,
 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\\\nThe cashier took my friends\\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\\\"serving off their orders\\\\" when they didn\\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\\\nThe manager was rude when giving me my order. She didn\\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\\\nI\\'ve eaten at various McDonalds restaurants for over 30 years. I\\'ve worked at more than one location. I expect bad days, bad moods, and the occasional mistake. But I have yet to have a decent experience at this store. It will remain a place I avoid unless someone in my party needs to avoid illness from low blood sugar. Perhaps I should go back to the racially biased service of Steak n Shake instead!'}

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

Train

Transformers 提供了一个针对训练优化的Trainer类 🤗 Transformers 模型,无需手动编写自己的训练循环即可更轻松地开始训练。Trainer API 支持广泛的训练选项和功能,例如日志记录、梯度累积和混合精度。

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

Training hyperparameters

接下来,创建一个TrainingArguments类,其中包含您可以调整的所有超参数以及用于激活不同训练选项的标志。您可以从默认的训练超参数开始,但可以随意尝试这些参数以找到您的最佳设置。

from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="test_trainer")

Metrics

Trainer\在训练期间不会自动评估模型性能。您需要向**Trainer*传递一个函数来计算和报告指标。🤗 数据集库提供了一个简单的函数,您可以使用(有关更多信息,请参阅本*教程[accuracy](<https://huggingface.co/metrics/accuracy>)**)函数加载:load_metric

import numpy as np
from datasets import load_metric

metric = load_metric("accuracy")

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

Trainer

使用您的模型、训练参数、训练和测试数据集以及评估函数创建一个Trainer对象:

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

Train in native PyTorch

Trainer负责训练循环,并允许您在一行代码中微调模型。对于喜欢编写自己的训练循环的用户,您还可以在原生 PyTorch 中微调🤗 Transformers 模型。

del model
del pytorch_model
del trainer
torch.cuda.empty_cache()

tokenized_datasets = tokenized_datasets.remove_columns(["text"])

tokenized_datasets = tokenized_datasets.rename_column("label", "labels")

tokenized_datasets.set_format("torch")

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

DataLoader

from torch.utils.data import DataLoader

train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

Optimizer and learning rate scheduler

from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

Training loop

from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

Metrics

metric = load_metric("accuracy")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

Distributed training with 🤗 Accelerate

随着模型变得越来越大,并行性已经成为一种策略,可以在有限的硬件上训练更大的模型,并将训练速度提高几个数量级。在 Hugging Face,我们创建了🤗 Accelerate库,以帮助用户轻松地在任何类型的分布式设置上训练 🤗 Transformers 模型,无论是一台机器上的多个 GPU 还是多台机器上的多个 GPU。

Setup

Get started by installing 🤗 Accelerate:

pip install accelerate

Then import and create an [Accelerator](<https://huggingface.co/docs/accelerate/accelerator.html#accelerate.Accelerator>) object. Accelerator will automatically detect your type of distributed setup and initialize all the necessary components for training. You don’t need to explicitly place your model on a device.

from accelerate import Accelerator

accelerator = Accelerator()

train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
)

Backward

for epoch in range(num_epochs):
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

As you can see in the following code, you only need to add four additional lines of code to your training loop to enable distributed training!

+ from accelerate import Accelerator
  from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

+ accelerator = Accelerator()

  model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
  optimizer = AdamW(model.parameters(), lr=3e-5)

- device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
- model.to(device)

+ train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare(
+     train_dataloader, eval_dataloader, model, optimizer
+ )

  num_epochs = 3
  num_training_steps = num_epochs * len(train_dataloader)
  lr_scheduler = get_scheduler(
      "linear",
      optimizer=optimizer,
      num_warmup_steps=0,
      num_training_steps=num_training_steps
  )

  progress_bar = tqdm(range(num_training_steps))

  model.train()
  for epoch in range(num_epochs):
      for batch in train_dataloader:
-         batch = {k: v.to(device) for k, v in batch.items()}
          outputs = model(**batch)
          loss = outputs.loss
-         loss.backward()
+         accelerator.backward(loss)

          optimizer.step()
          lr_scheduler.step()
          optimizer.zero_grad()
          progress_bar.update(1)

Train

accelerate config

accelerate launch train.py

from accelerate import notebook_launcher

notebook_launcher(training_function)

Support Models

Supported models

  1. ALBERT (from Google Research and the Toyota Technological Institute at Chicago) released with the paper ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
  2. BART (from Facebook) released with the paper BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
  3. BARThez (from École polytechnique) released with the paper BARThez: a Skilled Pretrained French Sequence-to-Sequence Model by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis.
  4. BARTpho (from VinAI Research) released with the paper BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen.
  5. BEiT (from Microsoft) released with the paper BEiT: BERT Pre-Training of Image Transformers by Hangbo Bao, Li Dong, Furu Wei.
  6. BERT (from Google) released with the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
  7. BERTweet (from VinAI Research) released with the paper BERTweet: A pre-trained language model for English Tweets by Dat Quoc Nguyen, Thanh Vu and Anh Tuan Nguyen.
  8. BERT For Sequence Generation (from Google) released with the paper Leveraging Pre-trained Checkpoints for Sequence Generation Tasks by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
  9. BigBird-RoBERTa (from Google Research) released with the paper Big Bird: Transformers for Longer Sequences by Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed.
  10. BigBird-Pegasus (from Google Research) released with the paper Big Bird: Transformers for Longer Sequences by Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed.
  11. Blenderbot (from Facebook) released with the paper Recipes for building an open-domain chatbot by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
  12. BlenderbotSmall (from Facebook) released with the paper Recipes for building an open-domain chatbot by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
  13. BORT (from Alexa) released with the paper Optimal Subarchitecture Extraction For BERT by Adrian de Wynter and Daniel J. Perry.
  14. ByT5 (from Google Research) released with the paper ByT5: Towards a token-free future with pre-trained byte-to-byte models by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel.
  15. CamemBERT (from Inria/Facebook/Sorbonne) released with the paper CamemBERT: a Tasty French Language Model by Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
  16. CANINE (from Google Research) released with the paper CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting.
  17. ConvNeXT (from Facebook AI) released with the paper A ConvNet for the 2020s by Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie.
  18. CLIP (from OpenAI) released with the paper Learning Transferable Visual Models From Natural Language Supervision by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
  19. ConvBERT (from YituTech) released with the paper ConvBERT: Improving BERT with Span-based Dynamic Convolution by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan.
  20. CPM (from Tsinghua University) released with the paper CPM: A Large-scale Generative Chinese Pre-trained Language Model by Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun.
  21. CTRL (from Salesforce) released with the paper CTRL: A Conditional Transformer Language Model for Controllable Generation by Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong and Richard Socher.
  22. Data2Vec (from Facebook) released with the paper Data2Vec: A General Framework for Self-supervised Learning in Speech, Vision and Language by Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli.
  23. DeBERTa (from Microsoft) released with the paper DeBERTa: Decoding-enhanced BERT with Disentangled Attention by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
  24. DeBERTa-v2 (from Microsoft) released with the paper DeBERTa: Decoding-enhanced BERT with Disentangled Attention by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
  25. DeiT (from Facebook) released with the paper Training data-efficient image transformers & distillation through attention by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.
  26. DETR (from Facebook) released with the paper End-to-End Object Detection with Transformers by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko.
  27. DialoGPT (from Microsoft Research) released with the paper DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
  28. DistilBERT (from HuggingFace), released together with the paper DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into DistilGPT2, RoBERTa into DistilRoBERTa, Multilingual BERT into DistilmBERT and a German version of DistilBERT.
  29. DPR (from Facebook) released with the paper Dense Passage Retrieval for Open-Domain Question Answering by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
  30. EncoderDecoder (from Google Research) released with the paper Leveraging Pre-trained Checkpoints for Sequence Generation Tasks by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
  31. ELECTRA (from Google Research/Stanford University) released with the paper ELECTRA: Pre-training text encoders as discriminators rather than generators by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
  32. FlauBERT (from CNRS) released with the paper FlauBERT: Unsupervised Language Model Pre-training for French by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
  33. FNet (from Google Research) released with the paper FNet: Mixing Tokens with Fourier Transforms by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon.
  34. Funnel Transformer (from CMU/Google Brain) released with the paper Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
  35. GPT (from OpenAI) released with the paper Improving Language Understanding by Generative Pre-Training by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
  36. GPT-2 (from OpenAI) released with the paper Language Models are Unsupervised Multitask Learners by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodeiand Ilya Sutskever.
  37. GPT-J (from EleutherAI) released in the repository kingoflolz/mesh-transformer-jax by Ben Wang and Aran Komatsuzaki.
  38. GPT Neo (from EleutherAI) released in the repository EleutherAI/gpt-neo by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy.
  39. Hubert (from Facebook) released with the paper HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
  40. I-BERT (from Berkeley) released with the paper I-BERT: Integer-only BERT Quantization by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
  41. ImageGPT (from OpenAI) released with the paper Generative Pretraining from Pixels by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever.
  42. LayoutLM (from Microsoft Research Asia) released with the paper LayoutLM: Pre-training of Text and Layout for Document Image Understanding by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
  43. LayoutLMv2 (from Microsoft Research Asia) released with the paper LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou.
  44. LayoutXLM (from Microsoft Research Asia) released with the paper LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding by Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Furu Wei.
  45. LED (from AllenAI) released with the paper Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan.
  46. Longformer (from AllenAI) released with the paper Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan.
  47. LUKE (from Studio Ousia) released with the paper LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention by Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto.
  48. mLUKE (from Studio Ousia) released with the paper mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models by Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka.
  49. LXMERT (from UNC Chapel Hill) released with the paper LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering by Hao Tan and Mohit Bansal.
  50. M2M100 (from Facebook) released with the paper Beyond English-Centric Multilingual Machine Translation by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
  51. MarianMT Machine translation models trained using OPUS data by Jörg Tiedemann. The Marian Framework is being developed by the Microsoft Translator Team.
  52. MaskFormer (from Meta and UIUC) released with the paper Per-Pixel Classification is Not All You Need for Semantic Segmentation by Bowen Cheng, Alexander G. Schwing, Alexander Kirillov.
  53. MBart (from Facebook) released with the paper Multilingual Denoising Pre-training for Neural Machine Translation by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
  54. MBart-50 (from Facebook) released with the paper Multilingual Translation with Extensible Multilingual Pretraining and Finetuning by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.
  55. Megatron-BERT (from NVIDIA) released with the paper Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
  56. Megatron-GPT2 (from NVIDIA) released with the paper Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
  57. MPNet (from Microsoft Research) released with the paper MPNet: Masked and Permuted Pre-training for Language Understanding by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu.
  58. MT5 (from Google AI) released with the paper mT5: A massively multilingual pre-trained text-to-text transformer by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
  59. Nyströmformer (from the University of Wisconsin - Madison) released with the paper Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh.
  60. Pegasus (from Google) released with the paper PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
  61. Perceiver IO (from Deepmind) released with the paper Perceiver IO: A General Architecture for Structured Inputs & Outputs by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.
  62. PhoBERT (from VinAI Research) released with the paper PhoBERT: Pre-trained language models for Vietnamese by Dat Quoc Nguyen and Anh Tuan Nguyen.
  63. PLBart (from UCLA NLP) released with the paper Unified Pre-training for Program Understanding and Generation by Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang.
  64. PoolFormer (from Sea AI Labs) released with the paper MetaFormer is Actually What You Need for Vision by Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng.
  65. ProphetNet (from Microsoft Research) released with the paper ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
  66. QDQBert (from NVIDIA) released with the paper Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.
  67. REALM (from Google Research) released with the paper REALM: Retrieval-Augmented Language Model Pre-Training by Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat and Ming-Wei Chang.
  68. Reformer (from Google Research) released with the paper Reformer: The Efficient Transformer by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
  69. RemBERT (from Google Research) released with the paper Rethinking embedding coupling in pre-trained language models by Hyung Won Chung, Thibault Févry, Henry Tsai, M. Johnson, Sebastian Ruder.
  70. RoBERTa (from Facebook), released together with the paper RoBERTa: A Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
  71. RoFormer (from ZhuiyiTechnology), released together with the paper RoFormer: Enhanced Transformer with Rotary Position Embedding by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
  72. SegFormer (from NVIDIA) released with the paper SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
  73. SEW (from ASAPP) released with the paper Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
  74. SEW-D (from ASAPP) released with the paper Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
  75. SpeechToTextTransformer (from Facebook), released together with the paper fairseq S2T: Fast Speech-to-Text Modeling with fairseq by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino.
  76. SpeechToTextTransformer2 (from Facebook), released together with the paper Large-Scale Self- and Semi-Supervised Learning for Speech Translation by Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau.
  77. Splinter (from Tel Aviv University), released together with the paper Few-Shot Question Answering by Pretraining Span Selection by Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy.
  78. SqueezeBert (from Berkeley) released with the paper SqueezeBERT: What can computer vision teach NLP about efficient neural networks? by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
  79. Swin Transformer (from Microsoft) released with the paper Swin Transformer: Hierarchical Vision Transformer using Shifted Windows by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
  80. T5 (from Google AI) released with the paper Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
  81. T5v1.1 (from Google AI) released in the repository google-research/text-to-text-transfer-transformer by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
  82. TAPAS (from Google AI) released with the paper TAPAS: Weakly Supervised Table Parsing via Pre-training by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos.
  83. Transformer-XL (from Google/CMU) released with the paper Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context by Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
  84. TrOCR (from Microsoft), released together with the paper TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
  85. UniSpeech (from Microsoft Research) released with the paper UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
  86. UniSpeechSat (from Microsoft Research) released with the paper UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.
  87. ViLT (from NAVER AI Lab/Kakao Enterprise/Kakao Brain) released with the paper ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision by Wonjae Kim, Bokyung Son, Ildoo Kim.
  88. Vision Transformer (ViT) (from Google AI) released with the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
  89. ViTMAE (from Meta AI) released with the paper Masked Autoencoders Are Scalable Vision Learners by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
  90. VisualBERT (from UCLA NLP) released with the paper VisualBERT: A Simple and Performant Baseline for Vision and Language by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
  91. WavLM (from Microsoft Research) released with the paper WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
  92. Wav2Vec2 (from Facebook AI) released with the paper wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
  93. Wav2Vec2Phoneme (from Facebook AI) released with the paper Simple and Effective Zero-shot Cross-lingual Phoneme Recognition by Qiantong Xu, Alexei Baevski, Michael Auli.
  94. XGLM (From Facebook AI) released with the paper Few-shot Learning with Multilingual Language Models by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O’Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li.
  95. XLM (from Facebook) released together with the paper Cross-lingual Language Model Pretraining by Guillaume Lample and Alexis Conneau.
  96. XLM-ProphetNet (from Microsoft Research) released with the paper ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
  97. XLM-RoBERTa (from Facebook AI), released together with the paper Unsupervised Cross-lingual Representation Learning at Scale by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov.
  98. XLM-RoBERTa-XL (from Facebook AI), released together with the paper Larger-Scale Transformers for Multilingual Masked Language Modeling by Naman Goyal, Jingfei Du, Myle Ott, Giri Anantharaman, Alexis Conneau.
  99. XLNet (from Google/CMU) released with the paper XLNet: Generalized Autoregressive Pretraining for Language Understanding by Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
  100. XLSR-Wav2Vec2 (from Facebook AI) released with the paper Unsupervised Cross-Lingual Representation Learning For Speech Recognition by Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli.
  101. XLS-R (from Facebook AI) released with the paper XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale by Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli.
  102. YOSO (from the University of Wisconsin - Madison) released with the paper You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling by Zhanpeng Zeng, Yunyang Xiong, Sathya N. Ravi, Shailesh Acharya, Glenn Fung, Vikas Singh.

Supported frameworks

The table below represents the current support in the library for each of those models, whether they have a Python tokenizer (called “slow”). A “fast” tokenizer backed by the 🤗 Tokenizers library, whether they have support in Jax (via Flax), PyTorch, and/or TensorFlow.

模型对比

👇Downstream Tasks

Text classification

文本分类是一种常见的 NLP 任务,它为文本分配标签或类别。当今一些最大的公司在生产中广泛使用文本分类的许多实际应用。最流行的文本分类形式之一是情感分析,它为文本序列分配正面、负面或中性的标签。

Load IMDb dataset

Load the IMDb dataset from the 🤗 Datasets library:

from datasets import load_dataset

imdb = load_dataset("imdb")
imdb["test"][0]

{
    "label": 0,
    "text": "I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn't match the background, and painfully one-dimensional characters cannot be overcome with a 'sci-fi' setting. (I'm sure there are those of you out there who think Babylon 5 is good sci-fi TV. It's not. It's clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It's really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it's rubbish as they have to always say \\"Gene Roddenberry's Earth...\\" otherwise people would not continue watching. Roddenberry's ashes must be turning in their orbit as this dull, cheap, poorly edited (watching it without advert breaks really brings this home) trudging Trabant of a show lumbers into space. Spoiler. So, kill off a main character. And then bring him back as another actor. Jeeez! Dallas all over again.",
}

There are two fields in this dataset:

  • text: a string containing the text of the movie review.
  • label: a value that can either be 0 for a negative review or 1 for a positive review.

Preprocess

Load the DistilBERT tokenizer to process the text field:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

**def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized_imdb = imdb.map(preprocess_function, batched=True)**

使用DataCollatorWithPadding创建一批示例。它还会动态地将您的文本填充到其批次中最长元素的长度,因此它们是统一的长度。虽然可以tokenizer通过设置在函数中填充文本,但padding=True动态填充更有效。

from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Train

Load DistilBERT with AutoModelForSequenceClassification along with the number of expected labels:

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

此时,只剩下三个步骤:

  1. 在TrainingArguments中定义您的训练超参数。
  2. 将训练参数连同模型、数据集、标记器和数据整理器一起传递给Trainer 。
  3. 调用train()来微调你的模型。
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

Token classification

标记分类为句子中的各个标记分配标签。最常见的令牌分类任务之一是命名实体识别 (NER)。NER 尝试为句子中的每个实体(例如人、位置或组织)查找标签。

Load WNUT 17 dataset

Load the WNUT 17 dataset from the 🤗 Datasets library:

from datasets import load_dataset

wnut = load_dataset("wnut_17")

wnut["train"][0]
{'id': '0',
 'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 8, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0],
 'tokens': ['@paulwalk', 'It', "'s", 'the', 'view', 'from', 'where', 'I', "'m", 'living', 'for', 'two', 'weeks', '.', 'Empire', 'State', 'Building', '=', 'ESB', '.', 'Pretty', 'bad', 'storm', 'here', 'last', 'evening', '.']
}

Each number in ner_tags represents an entity. Convert the number to a label name for more information:

label_list = wnut["train"].features[f"ner_tags"].feature.names
label_list
[
    "O",
    "B-corporation",
    "I-corporation",
    "B-creative-work",
    "I-creative-work",
    "B-group",
    "I-group",
    "B-location",
    "I-location",
    "B-person",
    "I-person",
    "B-product",
    "I-product",
]

描述ner_tag了一个实体,例如公司、地点或个人。每个前缀的字母ner_tag表示实体的标记位置:

  • B-表示实体的开始。
  • I-示令牌包含在同一实体内(例如,State令牌是像 一样的实体的一部分Empire State Building)。
  • 0表示令牌不对应于任何实体。

Preprocess

Load the DistilBERT tokenizer to process the tokens:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
tokens
['[CLS]', '@', 'paul', '##walk', 'it', "'", 's', 'the', 'view', 'from', 'where', 'i', "'", 'm', 'living', 'for', 'two', 'weeks', '.', 'empire', 'state', 'building', '=', 'es', '##b', '.', 'pretty', 'bad', 'storm', 'here', 'last', 'evening', '.', '[SEP]']

添加特殊标记[CLS][SEP]子词标记化会在输入和标签之间产生不匹配。对应于单个标签的单个词可以分成两个子词。您需要通过以下方式重新对齐标记和标签:

  1. 使用该方法将所有标记映射到其对应的单词[word_ids](<https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#tokenizers.Encoding.word_ids>)
  2. 将标签分配100[CLS][SEP]给特殊标记,因此 PyTorch 损失函数会忽略它们。
  3. 仅标记给定单词的第一个标记。分配100给同一单词的其他子标记。

以下是如何创建一个函数来重新对齐标记和标签,并将序列截断为不超过 DistilBERT 的最大输入长度:

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples[f"ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs
tokenized_wnut = wnut.map(tokenize_and_align_labels, batched=True)
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

Train

from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=14)
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_wnut["train"],
    eval_dataset=tokenized_wnut["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

Question answering

问答任务返回给定问题的答案。常见的问答形式有两种:

  • 提取:从给定的上下文中提取答案。
  • 抽象的:从正确回答问题的上下文中生成答案。

Load SQuAD dataset

Load the SQuAD dataset from the 🤗 Datasets library:

from datasets import load_dataset

squad = load_dataset("squad")

squad["train"][0]
{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'id': '5733be284776f41900661182',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'title': 'University_of_Notre_Dame'
}

Preprocess

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

应该注意一些特定于问答的预处理步骤:

  1. 数据集中的某些示例的长度可能会context超过模型的最大输入长度。仅截断contextby 设置truncation=”only_second”。
  2. context接下来,通过设置 将答案的开始和结束位置映射到原始位return_offset_mapping=True。
  3. 有了映射,您可以找到答案的开始和结束标记。使用sequence_ids方法找出偏移量的哪一部分对应于 ,哪一部分question对应于context。

以下是如何创建一个函数来截断并将答案的开始和结束标记映射到context:

def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

使用🤗 Datasets[map](<https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map>)函数将预处理函数应用于整个数据集。您可以通过设置一次处理数据集的多个元素来加速该map功能。batched=True删除不需要的列:

tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)

使用DefaultDataCollator创建一批示例。与🤗 Transformers 中的其他数据整理器不同,DefaultDataCollator它不应用额外的预处理,例如填充。

from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()

Train

from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_squad["train"],
    eval_dataset=tokenized_squad["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

Language modeling

语言建模预测句子中的单词。语言建模有两种形式。

  • 因果语言建模预测一系列token中的下一个token,模型只能关注左边的token。
  • 掩蔽语言建模预测序列中的掩蔽标记,并且模型可以双向处理标记。

加载 ELI5 数据集

仅从🤗 Datasets 库加载 ELI5 数据集的前 5000 行,因为它非常大:

from datasets import load_dataset

eli5 = load_dataset("eli5", split="train_asks[:5000]")

eli5 = eli5.train_test_split(test_size=0.2)
eli5["train"][0]
{'answers': {'a_id': ['c3d1aib', 'c3d4lya'],
  'score': [6, 3],
  'text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\\n\\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.",
   "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"]},
 'answers_urls': {'url': []},
 'document': '',
 'q_id': 'nyxfp',
 'selftext': '_URL_0_\\n\\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?',
 'selftext_urls': {'url': ['<http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg>']},
 'subreddit': 'askscience',
 'title': 'Few questions about this space walk photograph.',
 'title_urls': {'url': []}}

预处理

对于因果语言建模,加载 DistilGPT2 分词器来处理text子字段:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilgpt2")

对于掩码语言建模,请改为加载 DistilRoBERTa 标记器:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")

text使用以下方法从其嵌套结构中提取子字段[flatten](<https://huggingface.co/docs/datasets/process.html#flatten>)

eli5 = eli5.flatten()
eli5["train"][0]
{'answers.a_id': ['c3d1aib', 'c3d4lya'],
 'answers.score': [6, 3],
 'answers.text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\\n\\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.",
  "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"],
 'answers_urls.url': [],
 'document': '',
 'q_id': 'nyxfp',
 'selftext': '_URL_0_\\n\\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?',
 'selftext_urls.url': ['<http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg>'],
 'subreddit': 'askscience',
 'title': 'Few questions about this space walk photograph.',
 'title_urls.url': []}

每个子字段现在都是一个单独的列,如answers前缀所示。请注意,这answers.text是一个列表。不是单独标记每个句子,而是将列表转换为字符串以联合标记它们。

def preprocess_function(examples):
    return tokenizer([" ".join(x) for x in examples["answers.text"]], truncation=True)

使用🤗 Datasets[map](<https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map>)函数将预处理函数应用于整个数据集。您可以map通过设置batched=True一次处理数据集的多个元素并使用num_proc. 删除不需要的列:

tokenized_eli5 = eli5.map(
    preprocess_function,
    batched=True,
    num_proc=4,
    remove_columns=eli5["train"].column_names,
)

现在您需要第二个预处理函数来捕获从任何冗长示例中截断的文本,以防止信息丢失。这个预处理函数应该:

  • 连接所有文本。
  • 将连接的文本拆分为由block_size
block_size = 128

def group_texts(examples):
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

group_texts函数应用于整个数据集:

lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)

可以使用序列结束标记作为填充标记,并设置mlm=False. 这将使用输入作为向右移动一个元素的标签:

from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

对于掩码语言建模,请使用相同的DataCollatorForLanguageModeling ,除非您应指定mlm_probability在每次迭代数据时随机掩码标记。

from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

训练

因果语言建模

from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

model = AutoModelForCausalLM.from_pretrained("distilgpt2")
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset["train"],
    eval_dataset=lm_dataset["test"],
    data_collator=data_collator,
)

trainer.train()

掩蔽语言建模

from transformers import AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained("distilroberta-base")
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset["train"],
    eval_dataset=lm_dataset["test"],
    data_collator=data_collator,
)

trainer.train()

Translation

翻译将一系列文本从一种语言转换为另一种语言。它是您可以制定为序列到序列问题的几个任务之一,这是一个扩展到视觉和音频任务的强大框架。

加载 OPUS Books 数据集

从🤗 Datasets 库加载 OPUS Books 数据集:

from datasets import load_dataset

books = load_dataset("opus_books", "en-fr")

将此数据集拆分为训练集和测试集:

books = books["train"].train_test_split(test_size=0.2)
books["train"][0]
{'id': '90560',
 'translation': {'en': 'But this lofty plateau measured only a few fathoms, and soon we reentered Our Element.',
  'fr': 'Mais ce plateau élevé ne mesurait que quelques toises, et bientôt nous fûmes rentrés dans notre élément.'}}

预处理

加载 T5 标记器以处理语言对:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("t5-small")

预处理功能需要:

  1. 在输入前加上一个提示,以便 T5 知道这是一个翻译任务。一些能够执行多个 NLP 任务的模型需要提示特定任务。
  2. 分别标记输入(英语)和目标(法语)。您无法使用在英语词汇上预训练的分词器对法语文本进行分词。上下文管理器将帮助先将标记器设置为法语,然后再对其进行标记。
  3. 将序列截断为不超过max_length参数设置的最大长度。
source_lang = "en"
target_lang = "fr"
prefix = "translate English to French: "

def preprocess_function(examples):
    inputs = [prefix + example[source_lang] for example in examples["translation"]]
    targets = [example[target_lang] for example in examples["translation"]]
    model_inputs = tokenizer(inputs, max_length=128, truncation=True)

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs
tokenized_books = books.map(preprocess_function, batched=True)
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

训练

from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    fp16=True,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_books["train"],
    eval_dataset=tokenized_books["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

Summarization

摘要创建了一个较短版本的文档或文章,其中包含所有重要信息。除了翻译之外,它是另一个可以表述为序列到序列任务的任务示例。总结可以是:

  • 提取:从文档中提取最相关的信息。
  • 抽象的:生成捕获最相关信息的新文本。

加载 BillSum 数据集

从🤗 Datasets 库加载 BillSum 数据集:

from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")
billsum = billsum.train_test_split(test_size=0.2)
billsum["train"][0]
{'summary': 'Existing law authorizes state agencies to enter into contracts for the acquisition of goods or services upon approval by the Department of General Services. Existing law sets forth various requirements and prohibitions for those contracts, including, but not limited to, a prohibition on entering into contracts for the acquisition of goods or services of $100,000 or more with a contractor that discriminates between spouses and domestic partners or same-sex and different-sex couples in the provision of benefits. Existing law provides that a contract entered into in violation of those requirements and prohibitions is void and authorizes the state or any person acting on behalf of the state to bring a civil action seeking a determination that a contract is in violation and therefore void. Under existing law, a willful violation of those requirements and prohibitions is a misdemeanor.\\nThis bill would also prohibit a state agency from entering into contracts for the acquisition of goods or services of $100,000 or more with a contractor that discriminates between employees on the basis of gender identity in the provision of benefits, as specified. By expanding the scope of a crime, this bill would impose a state-mandated local program.\\nThe California Constitution requires the state to reimburse local agencies and school districts for certain costs mandated by the state. Statutory provisions establish procedures for making that reimbursement.\\nThis bill would provide that no reimbursement is required by this act for a specified reason.',
 'text': 'The people of the State of California do enact as follows:\\n\\n\\nSECTION 1.\\nSection 10295.35 is added to the Public Contract Code, to read:\\n10295.35.\\n(a) (1) Notwithstanding any other law, a state agency shall not enter into any contract for the acquisition of goods or services in the amount of one hundred thousand dollars ($100,000) or more wi

text字段是输入,字段summary是目标。

预处理

加载 T5 标记器以处理textsummary

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("t5-small")

预处理功能需要:

  1. 在输入前加上一个提示,以便 T5 知道这是一个汇总任务。一些能够执行多个 NLP 任务的模型需要提示特定任务。
  2. 使用具有该as_target_tokenizer()功能的上下文管理器来并行化输入和标签的标记化。
  3. 将序列截断为不超过max_length参数设置的最大长度。
prefix = "summarize: "

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs
tokenized_billsum = billsum.map(preprocess_function, batched=True)
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

训练

from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    fp16=True,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_billsum["train"],
    eval_dataset=tokenized_billsum["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

Multiple choice

多项选择任务类似于问答,除了提供几个候选答案以及上下文。该模型经过训练,可以从给定上下文的多个输入中选择正确答案。

加载 SWAG 数据集

从🤗 Datasets 库加载 SWAG 数据集:

from datasets import load_dataset

swag = load_dataset("swag", "regular")
swag["train"][0]
{'ending0': 'passes by walking down the street playing their instruments.',
 'ending1': 'has heard approaching them.',
 'ending2': "arrives and they're outside dancing and asleep.",
 'ending3': 'turns the lead singer watches the performance.',
 'fold-ind': '3416',
 'gold-source': 'gold',
 'label': 0,
 'sent1': 'Members of the procession walk down the street holding small horn brass instruments.',
 'sent2': 'A drum line',
 'startphrase': 'Members of the procession walk down the street holding small horn brass instruments. A drum line',
 'video-id': 'anetv_jkn6uvmqwh4'}

预处理

加载 BERT 分词器来处理每个句子的开头和四个可能的结尾:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

预处理功能需要做:

  1. 制作该字段的四份副本,sent1以便您可以将它们中的每一份结合起来sent2以重新创建句子的开头方式。
  2. 结合sent2四个可能的句子结尾中的每一个。
  3. 展平这两个列表,以便您可以对它们进行标记,然后再将它们展平,这样每个示例都有一个对应input_ids的attention_mask、 和labels字段。
ending_names = ["ending0", "ending1", "ending2", "ending3"]

def preprocess_function(examples):
    first_sentences = [[context] * 4 for context in examples["sent1"]]
    question_headers = examples["sent2"]
    second_sentences = [
        [f"{header} {examples[end][i]}" for end in ending_names] for i, header in enumerate(question_headers)
    ]

    first_sentences = sum(first_sentences, [])
    second_sentences = sum(second_sentences, [])

    tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)
    return {k: [v[i : i + 4] for i in range(0, len(v), 4)] for k, v in tokenized_examples.items()}
tokenized_swag = swag.map(preprocess_function, batched=True)
from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
import torch

@dataclass
class DataCollatorForMultipleChoice:
    """
    Data collator that will dynamically pad the inputs for multiple choice received.
    """

    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self, features):
        label_name = "label" if "label" in features[0].keys() else "labels"
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]["input_ids"])
        flattened_features = [
            [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
        ]
        flattened_features = sum(flattened_features, [])

        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )

        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        batch["labels"] = torch.tensor(labels, dtype=torch.int64)
        return batch

训练

from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer

model = AutoModelForMultipleChoice.from_pretrained("bert-base-uncased")
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_swag["train"],
    eval_dataset=tokenized_swag["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),
)

trainer.train()

👇Guides

Tokenizers

The PreTrainedTokenizerFast depends on the 🤗 Tokenizers library. The tokenizers obtained from the 🤗 Tokenizers library can be loaded very simply into 🤗 Transformers.

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

tokenizer.pre_tokenizer = Whitespace()
files = [...]
tokenizer.train(files, trainer)

Loading directly from the tokenizer object

from transformers import PreTrainedTokenizerFast

fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)

Loading from a JSON file

tokenizer.save("tokenizer.json")
from transformers import PreTrainedTokenizerFast

fast_tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")

创建自定义架构

自动[AutoClass](<https://huggingface.co/docs/transformers/model_doc/auto>)推断模型架构并下载预训练的配置和权重。通常,我们建议使用AutoClass生成与检查点无关的代码。但是想要更多地控制特定模型参数的用户可以从几个基类创建一个自定义的 🤗 Transformers 模型。这对于任何有兴趣学习、训练或试验 🤗 Transformers 模型的人来说可能特别有用。

Configuration

A configuration refers to a model’s specific attributes. Each model configuration has different attributes; for instance, all NLP models have the hidden_size, num_attention_heads, num_hidden_layers and vocab_size attributes in common. These attributes specify the number of attention heads or hidden layers to construct a model with.

Get a closer look at DistilBERT by accessing DistilBertConfig to inspect it’s attributes:

my_config = DistilBertConfig(activation="relu", attention_dropout=0.4)
print(my_config)
DistilBertConfig {
  "activation": "relu",
  "attention_dropout": 0.4,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "transformers_version": "4.16.2",
  "vocab_size": 30522
}
my_config = DistilBertConfig.from_pretrained("distilbert-base-uncased", activation="relu", attention_dropout=0.4)
my_config.save_pretrained(save_directory="./your_model_save_path")
my_config = DistilBertConfig.from_pretrained("./your_model_save_path/my_config.json")

Model

The next step is to create a model. The model - also loosely referred to as the architecture - defines what each layer is doing and what operations are happening. Attributes like num_hidden_layers from the configuration are used to define the architecture. Every model shares the base class PreTrainedModel and a few common methods like resizing input embeddings and pruning self-attention heads. In addition, all models are also either a [torch.nn.Module](<https://pytorch.org/docs/stable/generated/torch.nn.Module.html>), [tf.keras.Model](<https://www.tensorflow.org/api_docs/python/tf/keras/Model>) or [flax.linen.Module](<https://flax.readthedocs.io/en/latest/flax.linen.html#module>) subclass. This means models are compatible with each of their respective framework’s usage.

from transformers import DistilBertModel

my_config = DistilBertConfig.from_pretrained("./your_model_save_path/my_config.json")
model = DistilBertModel(my_config)
model = DistilBertModel.from_pretrained("distilbert-base-uncased")
model = DistilBertModel.from_pretrained("distilbert-base-uncased", config=my_config)

Tokenizer

The last base class you need before using a model for textual data is a tokenizer to convert raw text to tensors. There are two types of tokenizers you can use with 🤗 Transformers:

  • PreTrainedTokenizer: a Python implementation of a tokenizer.
  • PreTrainedTokenizerFast: a tokenizer from our Rust-based 🤗 Tokenizer library. This tokenizer type is significantly faster - especially during batch tokenization - due to it’s Rust implementation. The fast tokenizer also offers additional methods like offset mapping which maps tokens to their original words or characters.

Both tokenizers support common methods such as encoding and decoding, adding new tokens, and managing special tokens.

If you trained your own tokenizer, you can create one from your vocabulary file:

from transformers import DistilBertTokenizer

my_tokenizer = DistilBertTokenizer(vocab_file="my_vocab_file.txt", do_lower_case=False, padding_side="left")
from transformers import DistilBertTokenizer

slow_tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
from transformers import DistilBertTokenizerFast

fast_tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")

Feature Extractor

A feature extractor processes audio or image inputs. It inherits from the base FeatureExtractionMixin class, and may also inherit from the ImageFeatureExtractionMixin class for processing image features or the SequenceFeatureExtractor class for processing audio inputs.

Depending on whether you are working on an audio or vision task, create a feature extractor associated with the model you’re using. For example, create a default ViTFeatureExtractor if you are using ViT for image classification:

from transformers import ViTFeatureExtractor

vit_extractor = ViTFeatureExtractor()
print(vit_extractor)
ViTFeatureExtractor {
  "do_normalize": true,
  "do_resize": true,
  "feature_extractor_type": "ViTFeatureExtractor",
  "image_mean": [
    0.5,
    0.5,
    0.5
  ],
  "image_std": [
    0.5,
    0.5,
    0.5
  ],
  "resample": 2,
  "size": 224
}
from transformers import ViTFeatureExtractor

my_vit_extractor = ViTFeatureExtractor(resample="PIL.Image.BOX", do_normalize=False, image_mean=[0.3, 0.3, 0.3])
print(my_vit_extractor)
ViTFeatureExtractor {
  "do_normalize": false,
  "do_resize": true,
  "feature_extractor_type": "ViTFeatureExtractor",
  "image_mean": [
    0.3,
    0.3,
    0.3
  ],
  "image_std": [
    0.5,
    0.5,
    0.5
  ],
  "resample": "PIL.Image.BOX",
  "size": 224
}
from transformers import Wav2Vec2FeatureExtractor

w2v2_extractor = Wav2Vec2FeatureExtractor()
print(w2v2_extractor)
Wav2Vec2FeatureExtractor {
  "do_normalize": true,
  "feature_extractor_type": "Wav2Vec2FeatureExtractor",
  "feature_size": 1,
  "padding_side": "right",
  "padding_value": 0.0,
  "return_attention_mask": false,
  "sampling_rate": 16000
}

Processor

For models that support multimodal tasks, 🤗 Transformers offers a processor class that conveniently wraps a feature extractor and tokenizer into a single object. For example, let’s use the Wav2Vec2Processor for an automatic speech recognition task (ASR). ASR transcribes audio to text, so you will need a feature extractor and a tokenizer.

Create a feature extractor to handle the audio inputs:

from transformers import Wav2Vec2FeatureExtractor

feature_extractor = Wav2Vec2FeatureExtractor(padding_value=1.0, do_normalize=True)
from transformers import Wav2Vec2CTCTokenizer

tokenizer = Wav2Vec2CTCTokenizer(vocab_file="my_vocab_file.txt")
from transformers import Wav2Vec2Processor

processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

Multilingual models for inference

🤗 Transformers 中有几种多语言模型,它们的推理用法与单语言模型不同。不过,并非所有多语言模型的使用都不同。一些模型,比如bert-base-multilingual-uncased,可以像单语模型一样使用。

XLM

XLM 有十个不同的检查点,其中只有一个是单语的。剩下的九个模型检查点可以分为两类:使用语言嵌入的检查点和不使用语言嵌入的检查点。

带有语言嵌入的 XLM

以下 XLM 模型使用语言嵌入来指定推理时使用的语言:

  • xlm-mlm-ende-1024(蒙面语言建模,英语-德语)
  • xlm-mlm-enfr-1024(蒙面语言建模,英法)
  • xlm-mlm-enro-1024(蒙面语言建模,英语-罗马尼亚语)
  • xlm-mlm-xnli15-1024(蒙面语言建模,XNLI 语言)
  • xlm-mlm-tlm-xnli15-1024(蒙面语言建模+翻译,XNLI 语言)
  • xlm-clm-enfr-1024(因果语言建模,英法)
  • xlm-clm-ende-1024(因果语言建模,英语-德语)

input_ids语言嵌入表示为与传递给模型的形状相同的张量。这些张量中的值取决于所使用的语言,并由分词器lang2idid2lang属性标识。

在此示例中,加载xlm-clm-enfr-1024检查点(因果语言建模,英语-法语):

import torch
from transformers import XLMTokenizer, XLMWithLMHeadModel

tokenizer = XLMTokenizer.from_pretrained("xlm-clm-enfr-1024")
model = XLMWithLMHeadModel.from_pretrained("xlm-clm-enfr-1024")

标记器的lang2id属性显示此模型的语言及其 ID:

print(tokenizer.lang2id)
{ 'en' : 0 , 'fr' : 1 }
input_ids = torch.tensor([tokenizer.encode("Wikipedia was used to")])  # batch size of 1

将语言 ID 设置为"en"并使用它来定义语言嵌入。语言嵌入是一个充满的张量,0因为它是英语的语言 ID。该张量的大小应与 相同input_ids

language_id = tokenizer.lang2id["en"]  # 0
langs = torch.tensor([language_id] * input_ids.shape[1])  # torch.tensor([0, 0, 0, ..., 0])

# We reshape it to be of size (batch_size, sequence_length)
langs = langs.view(1, -1)  # is now of shape [1, sequence_length] (we have a batch size of 1)
outputs = model(input_ids, langs=langs)

run_generation.py脚本可以使用检查点生成带有语言嵌入的文本xlm-clm

没有语言嵌入的 XLM

以下 XLM 模型在推理期间不需要语言嵌入:

  • xlm-mlm-17-1280(蒙面语言建模,17 种语言)
  • xlm-mlm-100-1280(蒙面语言建模,100 种语言)

与之前的 XLM 检查点不同,这些模型用于通用句子表示。

BERT

以下 BERT 模型可用于多语言任务:

  • bert-base-multilingual-uncased(蒙面语言建模+下一句预测,102种语言)
  • bert-base-multilingual-cased(蒙面语言建模+下一句预测,104种语言)

这些模型在推理过程中不需要语言嵌入。他们应该从上下文中识别语言并做出相应的推断。

XLM-RoBERTa

以下 XLM-RoBERTa 模型可用于多语言任务:

  • xlm-roberta-base(蒙面语言建模,100 种语言)
  • xlm-roberta-large(蒙面语言建模,100 种语言)

XLM-RoBERTa 接受了 100 种语言的 2.5TB 新创建和清理的 CommonCrawl 数据的培训。在分类、序列标记和问答等下游任务上,它比以前发布的多语言模型(如 mBERT 或 XLM)提供了强大的收益。

M2M100

以下 M2M100 型号可用于多语言翻译:

  • facebook/m2m100_418M(翻译)
  • facebook/m2m100_1.2B(翻译)

在本例中,加载facebook/m2m100_418M检查点以将中文翻译成英文。您可以在分词器中设置源语言:

from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer

en_text = "Do not meddle in the affairs of wizards, for they are subtle and quick to anger."
chinese_text = "不要插手巫師的事務, 因為他們是微妙的, 很快就會發怒."

tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M", src_lang="zh")
model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")
encoded_zh = tokenizer(chinese_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en"))
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)

MBart

以下 MBar 模型可用于多语言翻译:

  • facebook/mbart-large-50-one-to-many-mmt(一对多多语言机器翻译,50种语言)
  • facebook/mbart-large-50-many-to-many-mmt(多对多多语言机器翻译,50种语言)
  • facebook/mbart-large-50-many-to-one-mmt(多对一多语言机器翻译,50种语言)
  • facebook/mbart-large-50(多语言翻译,50种语言)
  • facebook/mbart-large-cc25

在此示例中,加载facebook/mbart-large-50-many-to-many-mmt检查点以将芬兰语翻译成英语。您可以在分词器中设置源语言:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

en_text = "Do not meddle in the affairs of wizards, for they are subtle and quick to anger."
fi_text = "Älä sekaannu velhojen asioihin, sillä ne ovat hienovaraisia ja nopeasti vihaisia."

tokenizer = AutoTokenizer.from_pretrained("facebook/mbart-large-50-many-to-many-mmt", src_lang="fi_FI")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")

encoded_en = tokenizer(en_text, return_tensors="pt")

generated_tokens = model.generate(**encoded_en, forced_bos_token_id=tokenizer.lang_code_to_id("en_XX"))
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)

Datasets

官方文档:Datasets (huggingface.co)

一般下载的Datasets数据集是一个DatasetsDict对象,里面会根据训练,验证和测试集分成三个DataSet对象。每个DataSet对象会有很多feature属性,这些属性组成该DataSet的字典。具体的DataSet类的构建与torch的Dataset构建类似,但是构建返回的__getitem__是个字典,里面包含了训练输入数据和label,通过字典的key调用。

Datasets是一个库,用于轻松访问和共享数据集,以及自然语言处理 (NLP)、计算机视觉和音频任务的评估指标。

在一行代码中加载数据集,并使用我们强大的数据处理方法快速准备好您的数据集,以便在深度学习模型中进行训练。在 Apache Arrow 格式的支持下,通过零副本读取处理大型数据集,没有任何内存限制,以实现最佳速度和效率。我们还与Hugging Face Hub进行了深度集成,使您可以轻松加载数据集并与更广泛的NLP社区共享。目前有超过 2658 个数据集,以及超过 34 个可用的指标。

Quick Start

加载数据集和模型

首先从一般语言理解评估 (GLUE) 基准测试加载 Microsoft 研究院释义语料库 (MRPC) 训练数据集。MRPC 是人类注释的句子对的语料库,用于训练模型以确定句子对在语义上是否等效。

from datasets import load_dataset
dataset = load_dataset('glue', 'mrpc', split='train')

导入分词器模型:

from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained('bert-base-cased')
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

标记化数据集

下一步是标记文本,以便构建模型可以理解的整数序列。使用 Dataset.map() 对整个数据集进行编码,并将输入截断并填充到模型的最大长度。这确保了构建适当的张量批处理。

def encode(examples):
    return tokenizer(examples['sentence1'], examples['sentence2'], truncation=True, padding='max_length')

dataset = dataset.map(encode, batched=True)
dataset[0]
{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
'label': 1,
'idx': 0,
'input_ids': array([  101,  7277,  2180,  5303,  4806,  1117,  1711,   117,  2292, 1119,  1270,   107,  1103,  7737,   107,   117,  1104,  9938, 4267, 12223, 21811,  1117,  2554,   119,   102, 11336,  6732, 3384,  1106,  1140,  1112,  1178,   107,  1103,  7737,   107, 117,  7277,  2180,  5303,  4806,  1117,  1711,  1104,  9938, 4267, 12223, 21811,  1117,  2554,   119,   102]),
'token_type_ids': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
'attention_mask': array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])}

注意:数据集多了三个新的参数:input_ids, token_type_ids, attention_mask

设置数据集的格式

需要相应地格式化数据集。需要对数据集进行三项更改:

  1. 将列重命名为 label,这是 BertForSequenceClassification 中预期的输入名称:labels
>>> dataset = dataset.map(lambda examples: {'labels': examples['label']}, batched=True)
  1. 从 Dataset 对象检索实际的张量,而不是使用当前的 Python 对象。
  2. 筛选数据集以仅返回模型输入:input_idstoken_type_idsattention_mask

Dataset.set_format() 动态完成最后两个步骤。设置格式后,将数据集包装在 :torch.utils.data.DataLoader

import torch
dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'labels'])
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
next(iter(dataloader))
{'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
                        [1, 1, 1,  ..., 0, 0, 0],
                        [1, 1, 1,  ..., 0, 0, 0],
                        ...,
                        [1, 1, 1,  ..., 0, 0, 0],
                        [1, 1, 1,  ..., 0, 0, 0],
                        [1, 1, 1,  ..., 0, 0, 0]]),
'input_ids': tensor([[  101,  7277,  2180,  ...,     0,     0,     0],
                  [  101, 10684,  2599,  ...,     0,     0,     0],
                  [  101,  1220,  1125,  ...,     0,     0,     0],
                  ...,
                  [  101, 16944,  1107,  ...,     0,     0,     0],
                  [  101,  1109, 11896,  ...,     0,     0,     0],
                  [  101,  1109,  4173,  ...,     0,     0,     0]]),
'label': tensor([1, 0, 1, 0, 1, 1, 0, 1]),
'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
                     [0, 0, 0,  ..., 0, 0, 0],
                     [0, 0, 0,  ..., 0, 0, 0],
                     ...,
                     [0, 0, 0,  ..., 0, 0, 0],
                     [0, 0, 0,  ..., 0, 0, 0],
                     [0, 0, 0,  ..., 0, 0, 0]])}

训练模型

最后,创建一个简单的训练循环并开始训练:

from tqdm import tqdm
device = 'cuda' if torch.cuda.is_available() else 'cpu' 
model.train().to(device)
optimizer = torch.optim.AdamW(params=model.parameters(), lr=1e-5)
for epoch in range(3):
    for i, batch in enumerate(tqdm(dataloader)):
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs[0]
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        if i % 10 == 0:
            print(f"loss: {loss}")

安装

pip

The most straightforward way to install 🤗 Datasets is with pip:

pip install datasets

conda

🤗 Datasets can also be installed with conda, a package management system:

conda install -c huggingface -c conda-forge datasets

Hugging Face Hub

Load a dataset

在花时间下载数据集之前,快速获取有关数据集的所有相关信息通常很有帮助。load_dataset_builder()方法允许您在不下载数据集的情况下检查数据集的属性。

from datasets import load_dataset_builder
dataset_builder = load_dataset_builder('imdb')
print(dataset_builder.cache_dir)
/Users/thomwolf/.cache/huggingface/datasets/imdb/plain_text/1.0.0/fdc76b18d5506f14b0646729b8d371880ef1bc48a26d00835a7f3da44004b676
print(dataset_builder.info.features)
{'text': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['neg', 'pos'], names_file=None, id=None)}
print(dataset_builder.info.splits)
{'train': SplitInfo(name='train', num_bytes=33432835, num_examples=25000, dataset_name='imdb'), 'test': SplitInfo(name='test', num_bytes=32650697, num_examples=25000, dataset_name='imdb'), 'unsupervised': SplitInfo(name='unsupervised', num_bytes=67106814, num_examples=50000, dataset_name='imdb')}

对所需的数据集感到满意后,使用 load_dataset()将其加载到一行中:

from datasets import load_dataset
dataset = load_dataset('glue', 'mrpc', split='train')

选择配置

某些数据集(如通用语言理解评估 (GLUE) 基准测试)实际上由多个数据集组成。这些子数据集称为配置,您必须在加载数据集时显式选择一个。如果未提供配置名称,🤗则数据集将引发 a 并提醒您选择配置。ValueError使用 get_dataset_config_names() 函数检索数据集可用的所有可能配置的列表:

from datasets import get_dataset_config_names

configs = get_dataset_config_names("glue")
print(configs)
# ['cola', 'sst2', 'mrpc', 'qqp', 'stsb', 'mnli', 'mnli_mismatched', 'mnli_matched', 'qnli', 'rte', 'wnli', 'ax']

加载配置的方式不正确:

from datasets import load_dataset
dataset = load_dataset('glue')

ValueError: Config name is missing.
Please pick one among the available configs: ['cola', 'sst2', 'mrpc', 'qqp', 'stsb', 'mnli', 'mnli_mismatched', 'mnli_matched', 'qnli', 'rte', 'wnli', 'ax']
Example of usage:
        *load_dataset('glue', 'cola')*

加载配置的正确方法:

dataset = load_dataset('glue', 'sst2')
Downloading and preparing dataset glue/sst2 (download: 7.09 MiB, generated: 4.81 MiB, total: 11.90 MiB) to /Users/thomwolf/.cache/huggingface/datasets/glue/sst2/1.0.0...
Downloading: 100%|██████████████████████████████████████████████████████████████| 7.44M/7.44M [00:01<00:00, 7.03MB/s]
Dataset glue downloaded and prepared to /Users/thomwolf/.cache/huggingface/datasets/glue/sst2/1.0.0. Subsequent calls will reuse this data.
print(dataset)
{'train': Dataset(schema: {'sentence': 'string', 'label': 'int64', 'idx': 'int32'}, num_rows: 67349),
    'validation': Dataset(schema: {'sentence': 'string', 'label': 'int64', 'idx': 'int32'}, num_rows: 872),
    'test': Dataset(schema: {'sentence': 'string', 'label': 'int64', 'idx': 'int32'}, num_rows: 1821)
}

选择拆分

拆分是数据集的特定子集,如traintestsplit 。请确保在加载数据集时选择拆分。如果未提供参数,🤗则数据集将仅返回包含数据集子集的字典。

from datasets import load_dataset
datasets = load_dataset('glue', 'mrpc')
print(datasets)
{train: Dataset({
    features: ['idx', 'label', 'sentence1', 'sentence2'],
    num_rows: 3668
})
validation: Dataset({
    features: ['idx', 'label', 'sentence1', 'sentence2'],
    num_rows: 408
})
test: Dataset({
    features: ['idx', 'label', 'sentence1', 'sentence2'],
    num_rows: 1725
})
}

You can list the split names for a dataset, or a specific configuration, with the get_dataset_split_names() method:

from datasets import get_dataset_split_names
get_dataset_split_names('sent_comp')
['validation', 'train']
get_dataset_split_names('glue', 'cola')
['test', 'train', 'validation']

The Dataset object

本部分熟悉数据集对象。了解存储在 Dataset 对象中的元数据,以及查询 Dataset 对象以返回行和列的基础知识。

加载数据集的实例时,将返回 Dataset 对象。此对象的行为类似于普通的 Python 容器。

from datasets import load_dataset
dataset = load_dataset('glue', 'mrpc', split='train')

元数据

数据集对象包含有关数据集的大量有用信息。例如,访问 DatasetInfo 以返回数据集、作者甚至数据集大小的简短说明。这将为您提供数据集最重要属性的快速快照。

dataset.info
DatasetInfo(
    description='GLUE, the General Language Understanding Evaluation benchmark\\n(<https://gluebenchmark.com/>) is a collection of resources for training,\\nevaluating, and analyzing natural language understanding systems.\\n\\n', 
    citation='@inproceedings{dolan2005automatically,\\n  title={Automatically constructing a corpus of sentential paraphrases},\\n  author={Dolan, William B and Brockett, Chris},\\n  booktitle={Proceedings of the Third International Workshop on Paraphrasing (IWP2005)},\\n  year={2005}\\n}\\n@inproceedings{wang2019glue,\\n  title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},\\n  author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},\\n  note={In the Proceedings of ICLR.},\\n  year={2019}\\n}\\n', homepage='<https: www.microsoft.com="" en-us="" download="" details.aspx?id="52398">', 
    license='', 
    features={'sentence1': Value(dtype='string', id=None), 'sentence2': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None), 'idx': Value(dtype='int32', id=None)}, post_processed=None, supervised_keys=None, builder_name='glue', config_name='mrpc', version=1.0.0, splits={'train': SplitInfo(name='train', num_bytes=943851, num_examples=3668, dataset_name='glue'), 'validation': SplitInfo(name='validation', num_bytes=105887, num_examples=408, dataset_name='glue'), 'test': SplitInfo(name='test', num_bytes=442418, num_examples=1725, dataset_name='glue')}, 
    download_checksums={'<https: dl.fbaipublicfiles.com="" glue="" data="" mrpc_dev_ids.tsv="">': {'num_bytes': 6222, 'checksum': '971d7767d81b997fd9060ade0ec23c4fc31cbb226a55d1bd4a1bac474eb81dc7'}, '<https: dl.fbaipublicfiles.com="" senteval="" senteval_data="" msr_paraphrase_train.txt="">': {'num_bytes': 1047044, 'checksum': '60a9b09084528f0673eedee2b69cb941920f0b8cd0eeccefc464a98768457f89'}, '<https: dl.fbaipublicfiles.com="" senteval="" senteval_data="" msr_paraphrase_test.txt="">': {'num_bytes': 441275, 'checksum': 'a04e271090879aaba6423d65b94950c089298587d9c084bf9cd7439bd785f784'}}, 
    download_size=1494541, 
    post_processing_size=None, 
    dataset_size=1492156, 
    size_in_bytes=2986697
)</https:></https:></https:></https:>

可以通过直接调用数据集的特定属性来请求它们

dataset.split
NamedSplit('train')
dataset.description
'GLUE, the General Language Understanding Evaluation benchmark\\n(<https://gluebenchmark.com/>) is a collection of resources for training,\\nevaluating, and analyzing natural language understanding systems.\\n\\n'
dataset.citation
'@inproceedings{dolan2005automatically,\\n  title={Automatically constructing a corpus of sentential paraphrases},\\n  author={Dolan, William B and Brockett, Chris},\\n  booktitle={Proceedings of the Third International Workshop on Paraphrasing (IWP2005)},\\n  year={2005}\\n}\\n@inproceedings{wang2019glue,\\n  title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},\\n  author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},\\n  note={In the Proceedings of ICLR.},\\n  year={2019}\\n}\\n\\nNote that each GLUE dataset has its own citation. Please see the source to see\\nthe correct citation for each contained dataset.'
dataset.homepage
'<https://www.microsoft.com/en-us/download/details.aspx?id=52398>'

功能和列

数据集是行和类型化列的表。查询数据集将返回一个 Python 字典,其中键对应于列名,值对应于列值:

dataset[0]
{'idx': 0,
'label': 1,
'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'}

返回具有以下标准属性的行数和列数:

dataset.shape
(3668, 4)
dataset.num_columns
4
dataset.num_rows
3668
len(dataset)
3668

列出带有 Dataset.column_names() 的列名称:

dataset.column_names
['idx', 'label', 'sentence1', 'sentence2']

获取有关具有以下功能的列的详细信息:

dataset.features
{'idx': Value(dtype='int32', id=None),
    'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None),
    'sentence1': Value(dtype='string', id=None),
    'sentence2': Value(dtype='string', id=None),
}

返回有关类标签等要素的更具体信息

dataset.features['label'].num_classes
2
dataset.features['label'].names
['not_equivalent', 'equivalent']

行、切片、批和列

使用切片表示法或索引列表一次获取数据集的几行:

dataset[:3]
{'idx': [0, 1, 2],
    'label': [1, 0, 1],
    'sentence1': ['Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', "Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .", 'They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .'],
    'sentence2': ['Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', "Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 .", "On June 10 , the ship 's owners had published an advertisement on the Internet , offering the explosives for sale ."]
}
dataset[[1, 3, 5]]
{'idx': [1, 3, 5],
    'label': [0, 0, 1], 
    'sentence1': ["Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .", 'Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 .', 'Revenue in the first quarter of the year dropped 15 percent from the same period a year earlier .'],
    'sentence2': ["Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 .", 'Tab shares jumped 20 cents , or 4.6 % , to set a record closing high at A $ 4.57 .', "With the scandal hanging over Stewart 's company , revenue the first quarter of the year dropped 15 percent from the same period a year earlier ."]
}

按列名查询将返回其值。例如,如果您只想返回前三个示例:

dataset['sentence1'][:3]
['Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', "Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .", 'They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .']

根据查询 Dataset 对象的方式,返回的格式将有所不同:

  • 像这样的单个行返回 Python 值字典。dataset[0]
  • 类似这样的批处理返回值列表的 Python 字典。dataset[5:10]
  • 类似这样的列返回 Python 值列表。dataset['sentence1']

Train with 🤗 Datasets

对数据集进行标记化,并将数据集与 PyTorch 或 TensorFlow 等框架配合使用。默认情况下,所有数据集列都作为 Python 对象返回。但是,您可以通过设置数据集的格式来弥合 Python 对象与机器学习框架之间的差距。格式化将列转换为兼容的 PyTorch 或 TensorFlow 类型。

通常,在使用数据集训练模型之前,您可能希望修改数据集的结构和内容。例如,您可能希望删除列或将其转换为其他类型。🤗 数据集提供了执行此操作所需的工具,但由于每个数据集都非常不同,因此处理方法将单独变化。

标记化

标记化将文本划分为称为标记的单个单词。令牌被转换为数字,这是模型作为其输入接收的内容。

首先,安装Transformers

pip install transformers

接下来,导入分词器。使用与您正在使用的模型关联的分词器非常重要,因此文本将以相同的方式拆分。在此示例中,加载 BERT 分词器,因为您使用的是 BERT模型:

from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')

现在,您可以对数据集的字段进行标记化:sentence1

encoded_dataset = dataset.map(lambda examples: tokenizer(examples['sentence1']), batched=True)
encoded_dataset.column_names
['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask']
encoded_dataset[0]
{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
'label': 1,
'idx': 0,
'input_ids': [  101,  7277,  2180,  5303,  4806,  1117,  1711,   117,  2292, 1119,  1270,   107,  1103,  7737,   107,   117,  1104,  9938, 4267, 12223, 21811,  1117,  2554,   119,   102],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
}

标记化过程将创建三个新列:input_idstoken_type_idsattention_mask 。这些是模型的输入。

PyTorch

如果使用的是 PyTorch,请使用 Dataset.set_format() 设置格式,它接受两个主要参数:

  1. type定义要强制转换为的列的类型。例如torch,返回 PyTorch 张量。
  2. columns指定应设置格式的列。

设置格式后,使用 包装数据集。您的数据集现在已准备好在训练循环中使用!torch.utils.data.DataLoader

import torch
from datasets import load_dataset
from transformers import AutoTokenizer
dataset = load_dataset('glue', 'mrpc', split='train')
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
dataset = dataset.map(lambda e: tokenizer(e['sentence1'], truncation=True, padding='max_length'), batched=True)
dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
next(iter(dataloader))
{'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
                         ...,
                         [1, 1, 1,  ..., 0, 0, 0]]),
'input_ids': tensor([[  101,  7277,  2180,  ...,     0,     0,     0],
                    ...,
                    [  101,  1109,  4173,  ...,     0,     0,     0]]),
'label': tensor([1, 0, 1, 0, 1, 1, 0, 1]),
'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
                         ...,
                         [0, 0, 0,  ..., 0, 0, 0]])}

评估预测(Evaluate predictions)

数据集提供了各种常见和特定于 NLP 的指标,供您衡量模型性能。在本教程的此部分中,您将加载一个指标,并使用它来评估模型预测。

您可以查看list_metrics()提供了哪些指标:

from datasets import list_metrics
metrics_list = list_metrics()
len(metrics_list)
28
print(metrics_list)
['accuracy', 'bertscore', 'bleu', 'bleurt', 'cer', 'comet', 'coval', 'cuad', 'f1', 'gleu', 'glue', 'indic_glue', 'matthews_correlation', 'meteor', 'pearsonr', 'precision', 'recall', 'rouge', 'sacrebleu', 'sari', 'seqeval', 'spearmanr', 'squad', 'squad_v2', 'super_glue', 'wer', 'wiki_split', 'xnli']

负载指标

使用🤗数据集加载指标非常容易。实际上,您会注意到它与加载数据集非常相似!使用 load_metric()从中心加载指标:

from datasets import load_metric
metric = load_metric('glue', 'mrpc')

这将从 GLUE 基准测试加载与 MRPC 数据集关联的指标。

选择配置

如果您使用的是基准数据集,则需要选择与所使用的配置关联的指标。通过提供配置名称来选择指标配置:

metric = load_metric('glue', 'mrpc')

衡量指标对象

在开始使用衡量指标对象之前,您应该更好地了解它。与数据集一样,您可以返回有关指标的一些基本信息。例如,访问数据集中的参数MetricInfo

print(metric.inputs_description)
Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation
Examples:
    >>> glue_metric = datasets.load_metric('glue', 'sst2')  # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'accuracy': 1.0}
    ...
    >>> glue_metric = datasets.load_metric('glue', 'mrpc')  # 'mrpc' or 'qqp'
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'accuracy': 1.0, 'f1': 1.0}
    ...

请注意,对于 MRPC 配置,指标期望输入格式为零或 1。有关可随指标一起返回的属性的完整列表,请查看 MetricInfo

计算指标

加载指标后,即可使用它来评估模型预测。提供对 compute() 的模型预测和引用:

model_predictions = model(model_inputs)
final_score = metric.compute(predictions=model_predictions, references=gold_references)

实践 FineTune Practice

NER

  • PKU中文分词数据集微调基本逻辑
import torch
from torch.utils.data import Dataset, DataLoader, random_split
from data_process import DataGenerator
from transformers import AutoTokenizer, BertTokenizer, BertModel, AutoModelForTokenClassification, DataCollatorForTokenClassification
from transformers import TrainingArguments, Trainer

torch.manual_seed(0)

# 加载预训练模型和分词器
tokenizer = AutoTokenizer.from_pretrained("/nfs/volume-1280-3/rushin/work/models/hfl/chinese-roberta-wwm-ext")
model = AutoModelForTokenClassification.from_pretrained("/nfs/volume-1280-3/rushin/work/models/hfl/chinese-roberta-wwm-ext", num_labels=4)
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

# Load Data,加载文本数据List(corpus)和标签数据List[list](labels)
data_path = "../seg-data/training/pku_training.utf8"
generator = DataGenerator(data_path)
corpus, labels = generator.generate_train_data()

# 切分训练集和验证集
l = len(corpus)
train_size = int(l*0.8)

train_corpus = corpus[:train_size]
valid_corpus = corpus[train_size:]
train_labels = labels[:train_size]
valid_labels = labels[train_size:]

# 构造训练和验证数据字典,主要需要tokenizer返回的字典数据加上labels属性
def tokenize_and_align_labels(corpus, corpus_labels):
    tokenized_inputs = tokenizer(corpus, padding=True, truncation=True, max_length=512)
    labels = []
    for i, label in enumerate(corpus_labels):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        label_ids = []
        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
            else:
                label_ids.append(label[word_idx])
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

train_encoding = tokenize_and_align_labels(train_corpus, train_labels)
valid_encoding = tokenize_and_align_labels(valid_corpus, valid_labels)

# 构造Dataset类
class TokenDataset(Dataset):
    def __init__(self, encoding):
        super().__init__()
        self.encoding = encoding

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encoding.items()}
        return item

    def __len__(self):
        return len(self.encoding["labels"]) 

# 生成训练和验证数据对象
train_data = TokenDataset(train_encoding)
valid_data = TokenDataset(valid_encoding)

# 设置训练参数
training_args = TrainingArguments(
    output_dir="./results/chinese-roberta-wwm-ext",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=300,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=valid_data,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# 开始训练
trainer.train()
  • 总结:对于预训练过程
    • 首先,需要根据任务确认数据集需要的属性字段信息,构造出针对该任务的数据集类,这一步也是关键的一步。
      • 分析数据集的属性,根据文本进行tokenize,将tokens变成input_ids,
      • labels根据原始数据生成模型需要的labels属性。
    • 构造出基本数据后,直接按照transformers训练模板设置参数训练即可

文章作者: 杰克成
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 杰克成 !
评论
  目录