GPT2 详解：从零到代码实践

内容主要整理于书籍：《从零构建大模型》，尽可能使用书籍中的实现，但是添加个人理解。
内容主要为书中实现 GPT2 的部分。

TIP

前置知识：阅读本文，需要了解 Transformer 模型，了解 Encoder only 和 Decoder only 模型，并且熟悉 Pytorch 框架以及 Python 语言。

1. GPT2 初识

GPT 全称是 Generative Pre-trained Transformer(生成式预训练Transformer)，是一种基于人工智能技术的语言模型，广泛应用于自然语言处理领域
GPT 的核心思想是通过大规模语料库的预训练，学习语言的规律，并能够生成连贯、自然的文本。
GPT 广泛应用于 文本创作、对话生成、问答系统等
GPT2 发布于 2019年，是一个Decoder only 架构的深度学习模型

2. GPT2 关键组件

本节详解GPT2 的各个关键组件

2.1 导入相关库：

python

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

import tiktoken

tiktoken库：在本笔记中，主要用于加载GPT2的分词器，故不会详解该库具体使用
- 一个 Rust 实现的能够快速进行 分词 的分词器库，由 OpenAI 开发
- 分词算法为 BPE 算法
- 主要功能：编码(文本 -> Token ids)，解码(Token ids -> 文本)，计算Token数

2.2 LayerNorm：归一化层

2.2.1 理解归一化

layer norm是一种归一化技术，用于稳定深度神经网络的训练过程，在2016年被提出
归一化计算，使得数据均值为0，方差为1
为什么需要归一化：在神经网络的训练过程中，可能会出现 梯度消失/爆炸、训练不稳定、学习率敏感 等问题，影响模型的训练
归一化的作用：加速训练收敛，允许使用更大的学习率，减少对初始化的依赖，提供轻微的正则化效果
Layer norm：针对特征进行归一化

2.2.2 LayerNorm计算过程:

总览：对于输入x，减去均值，除以标准差(方差的平方根)，最后进行缩放和平移

1. 计算均值（mean）：
   μ = (1/H) * Σ(x_i)
   
2. 计算方差（variance）：
   σ² = (1/H) * Σ((x_i - μ)²)
   
3. 归一化（normalize）：
   x̂ = (x - μ) / √(σ² + ε)
   
4. 缩放和平移（scale and shift）：
   y = γ * x̂ + β

2.2.3 LayerNorm代码实现：

python

class LayerNorm(nn.Module):
  def __init__(self, emb_dim):
    super().__init__()
    self.eps = 1e-5
    self.scale = nn.Parameter(torch.ones(emb_dim))
    self.shift = nn.Parameter(torch.zeros(emb_dim))

  def forward(self, x):
    mean = x.mean(dim=-1, keepdim=True)
    var = x.var(dim=-1, keepdim=True, unbiased=False)

    norm_x = (x - mean) / torch.sqrt(var + self.eps)
    res = self.scale * norm_x + self.shift

    return res

2.2.4 Pre-LN与Post-LN：

Pre-LN：归一化应用于注意力机制和dropout层之前，即应用于残差链接之前。现代大模型的做法。训练更稳定
Post-LN：归一化应用于注意力机制和dropout层之后，即应用于残差链接之后。早前Transformer架构模型的做法。往往需要 warm up，即学习率预热

2.2.5 其他问题

为什么使用layernorm，而不是batchnorm：LLM的训练中，序列长度是变化的，batch norm难以处理，并且layer norm不依赖于batch size，更加稳定，训练和推理行为一致
何时使用layernorm：Transformer类模型，RNN/LSTM网络，小Batch size训练，序列长度变化的任务
何时使用batchnorm：图像分类或其他batch size较大且稳定的CV任务等

2.3 GELU激活函数

激活函数是神经网络中的非线性变换函数，用于在神经网络中引入非线性，使得模型能够学习复杂的非线性关系
常见激活函数：Sigmoid函数，Tanh函数，Relu函数等
GELU优势：
- 平滑性：几乎处处可导，梯度变化平滑，有助于优化
- 非单调性：负区域有轻微的非单调性，有助于模型学习复杂模式

数学表达：

GELU(x) = x * Φ(x)

其中 Φ(x) 是标准正态分布的累积分布函数（CDF）：

Φ(x) = ∫_{-∞}^{x} (1/√(2π)) * e^(-t²/2) dt

2.3.1 代码实现

python

# 使用Tanh近似实现
class GELU(nn.Module):
  def __init__(self):
    super().__init__()

  def forward(self, x):
    return 0.5 * x * (1 + torch.tanh(
      torch.sqrt(torch.tensor(2.0 / torch.pi)) * (x + 0.044715 * torch.pow(x, 3))
    ))

2.3.2 Pytorch实现

python

import torch
import torch.nn as nn
import torch.nn.functional as F

# 方法 1: 使用 nn.GELU() 模块（推荐）
gelu = nn.GELU()
x = torch.randn(32, 128, 768)
y = gelu(x)

# 方法 2: 使用 F.gelu() 函数
y = F.gelu(x)                        # 精确实现（默认）
y = F.gelu(x, approximate='none')    # 精确实现
y = F.gelu(x, approximate='tanh')    # Tanh 近似

2.3.3 GELU 近似方式对比

近似方式	速度	精度	使用场景
精确	慢	最高	研究、高精度要求场景
Tanh近似	中等	较好	生产环境
Sigmoid近似	块	中等	计算受限场景

2.4 多头注意力机制

2.4.1 注意力机制-多头注意力机制理解

当我们理解句子"The animal didn't cross the street because it was too tired"时，我们能够知道 it 表示的是 the animal，我们将 it 和 the animal 关联起来
这就是注意力：让模型关注序列中的重要部分，动态计算不同位置之间的关联性，根据关联性加权聚合信息

2.4.2 关键概念

注意力计算：

Attention(Q, K, V) = softmax(QK^T / √d_k) V

Q：查询，指代当前关注的内容
K：键，用于匹配的索引
V：值，实际的信息内容

计算过程：

1. 对于输入序列X，通过三次线性变化得到Q，K，V：
	Q = W_q(X)
	K = W_k(x)
	V = W_v(x)
2. 计算Q与K之间的相似度，即计算二者的点积，进行缩放(除以√d_k)，得到注意力分数

3. 对注意力分数应用 softmax 归一化

4. 使用注意力分数，乘以 V 进行加权聚合信息

TIP

多头注意力：在计算得到Q、K、V后，对d_out维度进行拆分，即可得到多个头

2.4.3 代码实现

python

# 多头注意力
class MultiHeadAttention(nn.Module):
  def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
    super().__init__()
    # 检查 d_out 是否可以被 num_heads 整除
    assert d_out % num_heads == 0, "d_out must be divisible by num_heads"

    self.d_out = d_out
    self.num_heads = num_heads
    self.head_dim = d_out // num_heads

    self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
    self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
    self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)

    # out_proj: 线性层，用于将多头注意力的输出进行投影
    self.out_proj = nn.Linear(d_out, d_out)

    self.dropout = nn.Dropout(dropout)

    # 掩码：上三角区域为1，其他区域为0（包含对角线）
    self.register_buffer("mask",
                         torch.triu(torch.ones(context_length, context_length), diagonal=1))

  def forward(self, x):
    b, num_tokens, d_in = x.shape

    # 计算 qkv，维度为：[batch, num_tokens, d_out]
    queries = self.W_query(x)
    keys = self.W_key(x)
    values = self.W_value(x)

    # 拆分为多个头，将 d_out 维度拆分为多个头，维度为：[batch, num_tokens, num_heads, head_dim]
    queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
    keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
    values = values.view(b, num_tokens, self.num_heads, self.head_dim)

    # 维度交换,将维度变为：[batch, num_heads, num_tokens, head_dim]
    queries = queries.transpose(1, 2)
    keys = keys.transpose(1, 2)
    values = values.transpose(1, 2)

    # 计算注意力分数：keys.transpose(2, 3)实际上就是对key进行转置
    attn_scores = queries @ keys.transpose(2, 3)

    # self.mask，转换为bool值
    # 只提取 [:num_tokens, :num_tokens] 位置，因为 num_tokens 可能小于 context_length
    # context_length是最大的tokens，num_tokens是实际的tokens
    mask_bool = self.mask.bool()[:num_tokens, :num_tokens]

    # 使用掩码填充注意力分数，上三角区域填充为负无穷大
    attn_scores.masked_fill_(mask_bool, -torch.inf)

    # 对 attn_scores 缩放，然后进行 softmax 计算
    # keys.shape[-1]**0.5：d_k开平方跟
    # 维度越高，点积结果的方差越大，所以需要进行缩放
    attn_weights = torch.softmax(
      attn_scores / keys.shape[-1]**0.5, dim=-1
    )

    # 对 注意力权重 进行 dropout
    # dropout后，会自动对数值进行缩放
    attn_weights = self.dropout(attn_weights)

    # 计算上下文向量
    # 注意力权重 @ V 得到上下文向量，交换维度
    context_vec = (attn_weights @ values).transpose(1, 2)

    # 合并多个头
    # contiguou: 重新排列内存，使的变量在内存那种连续存储
    context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)

    context_vec = self.out_proj(context_vec)

    return context_vec

2.5 前馈层

python

# 全连接层
# 维度变化：
#	1. emb_dim --> 4*emb_dim
#	2. 4*emb_dim --> emb_dim
class FeedForward(nn.Module):
  def __init__(self, cfg):
    super().__init__()

    self.layers = nn.Sequential(
      nn.Linear(cfg["emb_dim"], 4*cfg["emb_dim"]),
      GELU(),
      nn.Linear(4*cfg["emb_dim"], cfg["emb_dim"]),
    )

  def forward(self, x):
    return self.layers(x)

2.6 Transformer块

GPT模型是Decoder only模型
GPT2 将Transformer 中的Decoder块重复堆叠了12次

INFO

Decoder块的实现较为简单，若有pytorch基础，并且基本了解Transformer模型，较容易理解，因此不再对代码进行过多讨论，直接给出代码实现，可比照GPT2模型架构图理解

2.6.1 Transformer块代码实现

python

class TransformerBlock(nn.Module):
  def __init__(self, cfg):
    super().__init__()
    # 注意力
    self.att = MultiHeadAttention(
      d_in=cfg["emb_dim"],
      d_out=cfg["emb_dim"],
      context_length=cfg["context_length"],
      num_heads=cfg["n_heads"],
      dropout=cfg["drop_rate"],
      qkv_bias=cfg["qkv_bias"]
    )

    self.ff = FeedForward(cfg)
    # 两个归一化层
    self.norm1 = LayerNorm(cfg["emb_dim"])
    self.norm2 = LayerNorm(cfg["emb_dim"])
    self.drop_shortcut = nn.Dropout(cfg["drop_rate"])

  def forward(self, x):
    # Pre-LN：先进行归一化，稳定性更好，训练初期就比叫稳定，可以不进行预热
    shortcut = x
    x = self.norm1(x)
    x = self.att(x)
    x = self.drop_shortcut(x)
    x = x + shortcut

    shortcut = x
    x = self.norm2(x)
    x = self.ff(x)
    x = self.drop_shortcut(x)
    x = x + shortcut

    return x

2.7 GPT2模型实现

同样不在赘述代码

python

# GPT2
class GPTModel(nn.Module):
  def __init__(self, cfg):
    super().__init__()
    self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
    self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
    self.drop_emb = nn.Dropout(cfg["drop_rate"])

    self.trf_blocks = nn.Sequential(
      # 根据 n_layers 生成一个列表
      # *的作用是解包，将list内的元素解包，作为参数传入Sequential
      *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])]
    )

    self.final_norm = LayerNorm(cfg["emb_dim"])

    self.out_head = nn.Linear(
      cfg["emb_dim"], cfg["vocab_size"], bias=False
    )

  def forward(self, in_idx):
    batch_size, seq_len = in_idx.shape
    tok_embeds = self.tok_emb(in_idx)
    pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
    x = tok_embeds + pos_embeds
    x = self.drop_emb(x)
    x = self.trf_blocks(x)
    x = self.final_norm(x)
    logits = self.out_head(x)
    return logits

3. 辅助模块

INFO

辅助模块，主要用于加载数据集，生成dataloader，以及计算损失函数，本文偏向于GPT架构以及核心模块实现，其他内容在此不再赘述，可阅读书籍获取更详细的解释

python

# -------------------------------------------
class GPTDatasetV1(Dataset):
  # max_length：每个训练样本的长度（token 数量），决定了模型一次能看到多少个 token，也称为"上下文窗口"或"序列长度"，更大的值需要更多显存，但能捕捉更长的依赖关系
  # stride：滑动窗口每次移动的步长，如果 stride = max_length：窗口不重叠
  # 如果 stride < max_length：窗口有重叠，生成更多训练样本

  def __init__(self, txt, tokenizer, max_length, stride):
    # input_ids: 训练模型的 输入 x
    # target_ids: 训练模型的目标 y
    self.input_ids = []
    self.target_ids = []
    # 将 txt 转换为 token_ids
    token_ids = tokenizer.encode(txt)

    for i in range(0, len(token_ids)-max_length, stride):
      input_chunk = token_ids[i:i+max_length]
      target_chunk = token_ids[i+1:i+max_length+1]
      self.input_ids.append(torch.tensor(input_chunk))
      self.target_ids.append(torch.tensor(target_chunk))

  def __len__(self):
    return len(self.input_ids)

  def __getitem__(self, idx):
    return self.input_ids[idx], self.target_ids[idx]

# -------------------------------------------
def create_dataloader_v1(txt, batch_size=4,
                        max_length=256, stride=128, shuffle=True, drop_last=True, num_workers=0):
  tokenizer = tiktoken.get_encoding('gpt2')
  dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

  dataloader = DataLoader(
    dataset,
    batch_size=batch_size,
    shuffle=shuffle,
    drop_last=drop_last,
    num_workers=num_workers
  )

  return dataloader

# -------------------------------------------
# 损失函数，计算单个batch
# 使用 cross_entropy 时，不要自行进行 softmax 计算
def calc_loss_batch(input_batch, target_batch, model, device):
  input_batch = input_batch.to(device)
  target_batch = target_batch.to(device)
  logits = model(input_batch)
  loss = torch.nn.functional.cross_entropy(
    logits.flatten(0, 1),
    target_batch.flatten()
  )
  return loss

# -------------------------------------------
def calc_loss_loader(data_loader, model, device, num_batches=None):
  total_loss = 0.

  if len(data_loader) == 0:
    return float("nan")
  elif num_batches is None:
    num_batches = len(data_loader)
  else:
    num_batches = min(num_batches, len(data_loader))

  for i, (input_batch, target_batch) in enumerate(data_loader):
    if i < num_batches:
      loss = calc_loss_batch(input_batch, target_batch, model, device)
      total_loss += loss.item()
    else:
      break
  return total_loss / num_batches

# -------------------------------------------
def evaluate_model(model, train_loader, val_loader, device, eval_iter):
  model.eval()
  with torch.no_grad():
    train_loss = calc_loss_loader(
      train_loader, model, device, num_batches=eval_iter
    )

    val_loss = calc_loss_loader(
      val_loader, model, device, num_batches=eval_iter
    )
  model.train()
  return train_loss, val_loss

# -------------------------------------------
def text_to_token_ids(text, tokenizer):
  encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
  encoded_tensor = torch.tensor(encoded).unsqueeze(0)
  return encoded_tensor

def token_ids_to_text(token_ids, tokenizer):
  flat = token_ids.squeeze(0)
  return tokenizer.decode(flat.tolist())

def generate_text_simple(model, idx, max_new_tokens, context_size):
  for _ in range(max_new_tokens):
    idx_cond = idx[:, -context_size:]

    with torch.no_grad():
      logits = model(idx_cond)

    logits = logits[:, -1, :]
    probas = torch.softmax(logits, dim=-1)
    idx_next = torch.argmax(probas, dim=-1, keepdim=True)
    idx = torch.cat((idx, idx_next), dim=1)
  return idx

def generate_and_print_sample(model, tokenizer, device, start_context):
  model.eval()

  context_size = model.pos_emb.weight.shape[0]
  encoded = text_to_token_ids(start_context, tokenizer).to(device)

  with torch.no_grad():
    token_ids = generate_text_simple(
      model=model, idx=encoded,
      max_new_tokens=50, context_size=context_size
    )
    decoded_text = token_ids_to_text(token_ids, tokenizer)
    print(decoded_text.replace("\n", " "))
  model.train()

4. 训练GPT2

4.1 训练

python

GPT_CONFIG_124M = {
  "vocab_size": 50257,      # 词表大小
  "context_length": 256,   # 上下文长度
  "emb_dim": 768,           # 嵌入维度
  "n_heads": 12,            # 注意力头数
  "n_layers": 12,           # 堆叠层数
  "drop_rate": 0.1,         # dropout 率
  "qkv_bias": False         # qkv是否开启偏置项
}

# Train
def train_model_simple(model, train_loader, val_loader,
                       optimizer, device, num_epochs,
                       eval_freq, eval_iter, start_context, tokenizer
                      ):
  train_losses, val_losses, track_tokens_seen = [], [], []
  tokens_seen, global_step = 0, -1

  for epoch in range(num_epochs):
    model.train()

    for input_batch, target_batch in train_loader:
      optimizer.zero_grad()
      loss = calc_loss_batch(
        input_batch, target_batch, model, device
      )

      loss.backward()
      optimizer.step()
      tokens_seen += input_batch.numel()
      global_step += 1

      if global_step % eval_freq == 0:
        train_loss, val_loss = evaluate_model(model, train_loader, val_loader, device, eval_iter)
        train_losses.append(train_loss)
        val_losses.append(val_loss)
        track_tokens_seen.append(tokens_seen)
        print(f"Ep {epoch+1} (Step {global_step:06d}): "
              f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")

    generate_and_print_sample(model, tokenizer, device, start_context)
  return train_losses, val_losses, track_tokens_seen

tokenizer = tiktoken.get_encoding('gpt2')

file_path = 'the-verdict.txt'
with open(file_path, 'r') as f:
  text_data = f.read()

total_characters = len(text_data)
total_tokens = len(tokenizer.encode(text_data))

print('Characters:', total_characters)
print('Tokens:', total_tokens)

train_ratio = 0.9
split_idx = int(train_ratio * len(text_data))
train_data = text_data[:split_idx]
val_data = text_data[split_idx:]

train_loader = create_dataloader_v1(
  train_data,
  batch_size=2,
  max_length=GPT_CONFIG_124M["context_length"],
  stride=GPT_CONFIG_124M["context_length"],
  drop_last=True,
  shuffle=True,
  num_workers=0
)

val_loader = create_dataloader_v1(
  val_data,
  batch_size=2,
  max_length=GPT_CONFIG_124M["context_length"],
  stride=GPT_CONFIG_124M["context_length"],
  drop_last=False,
  shuffle=False,
  num_workers=0
)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = GPTModel(GPT_CONFIG_124M)
model.to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)

num_epochs = 10
train_losses, val_losses, tokens_seen = train_model_simple(
  model, train_loader, val_loader, optimizer, device,
  num_epochs=num_epochs, eval_freq=5, eval_iter=5,
  start_context="Every effort moves you", tokenizer=tokenizer
)

4.3 显示loss值

python

import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator

def plot_losses(epoche_seen, tokens_seen, train_losses, val_losses):
  fig, ax1 = plt.subplots(figsize=(5, 3))
  ax1.plot(epoche_seen, train_losses, label="Training loss")
  ax1.plot(epoche_seen, val_losses, linestyle="-.", label="Validation loss")
  ax1.set_xlabel("Epochs")
  ax1.set_ylabel("Loss")
  ax1.legend(loc="upper right")
  ax1.xaxis.set_major_locator(MaxNLocator(integer=True))
  ax2 = ax1.twiny()
  ax2.plot(tokens_seen, train_losses, alpha=0)
  ax2.set_xlabel("Tokens seen")
  fig.tight_layout()
  plt.show()

epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))
plot_losses(epochs_tensor, tokens_seen, train_losses, val_losses)

5. 完整代码

TIP

完整复制一下代码，即可运行

5.1 完整训练代码

WARNING

注意训练文件the-verdict.txt的位置

python

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import tiktoken

# ---------定义模块以及训练函数：开始
class LayerNorm(nn.Module):
  def __init__(self, emb_dim):
    super().__init__()
    self.eps = 1e-5
    self.scale = nn.Parameter(torch.ones(emb_dim))
    self.shift = nn.Parameter(torch.zeros(emb_dim))

  def forward(self, x):
    mean = x.mean(dim=-1, keepdim=True)
    var = x.var(dim=-1, keepdim=True, unbiased=False)

    norm_x = (x - mean) / torch.sqrt(var + self.eps)
    res = self.scale * norm_x + self.shift

    return res

# -------------------------------------------
class GELU(nn.Module):
  def __init__(self):
    super().__init__()

  def forward(self, x):
    return 0.5 * x * (1 + torch.tanh(
      torch.sqrt(torch.tensor(2.0 / torch.pi)) * (x + 0.044715 * torch.pow(x, 3))
    ))

# -------------------------------------------
class FeedForward(nn.Module):
  def __init__(self, cfg):
    super().__init__()

    self.layers = nn.Sequential(
      nn.Linear(cfg["emb_dim"], 4*cfg["emb_dim"]),
      GELU(),
      nn.Linear(4*cfg["emb_dim"], cfg["emb_dim"]),
    )

  def forward(self, x):
    return self.layers(x)

# -------------------------------------------
# 多头注意力
class MultiHeadAttention(nn.Module):
  def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
    super().__init__()
    # 检查 d_out 是否可以被 num_heads 整除
    assert d_out % num_heads == 0, "d_out must be divisible by num_heads"

    self.d_out = d_out
    self.num_heads = num_heads
    self.head_dim = d_out // num_heads

    self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
    self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
    self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)

    # out_proj: 线性层，用于将多头注意力的输出进行投影
    self.out_proj = nn.Linear(d_out, d_out)

    self.dropout = nn.Dropout(dropout)

    # 掩码：上三角区域为1，其他区域为0（包含对角线）
    self.register_buffer("mask",
                         torch.triu(torch.ones(context_length, context_length), diagonal=1))

  def forward(self, x):
    b, num_tokens, d_in = x.shape

    # 计算 qkv，维度为：[batch, num_tokens, d_out]
    queries = self.W_query(x)
    keys = self.W_key(x)
    values = self.W_value(x)

    # 拆分为多个头，将 d_out 维度拆分为多个头，维度为：[batch, num_tokens, num_heads, head_dim]
    queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
    keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
    values = values.view(b, num_tokens, self.num_heads, self.head_dim)

    # 维度交换,将维度变为：[batch, num_heads, num_tokens, head_dim]
    queries = queries.transpose(1, 2)
    keys = keys.transpose(1, 2)
    values = values.transpose(1, 2)

    # 计算注意力分数：keys.transpose(2, 3)实际上就是对key进行转置
    attn_scores = queries @ keys.transpose(2, 3)

    # self.mask，转换为bool值
    # 只提取 [:num_tokens, :num_tokens] 位置，因为 num_tokens 可能小于 context_length
    # context_length是最大的tokens，num_tokens是实际的tokens
    mask_bool = self.mask.bool()[:num_tokens, :num_tokens]

    # 使用掩码填充注意力分数，上三角区域填充为负无穷大
    attn_scores.masked_fill_(mask_bool, -torch.inf)

    # 对 attn_scores 缩放，然后进行 softmax 计算
    # keys.shape[-1]**0.5：d_k开平方跟
    # 维度越高，点积结果的方差越大，所以需要进行缩放
    attn_weights = torch.softmax(
      attn_scores / keys.shape[-1]**0.5, dim=-1
    )

    # 对 注意力权重 进行 dropout
    # dropout后，会自动对数值进行缩放
    attn_weights = self.dropout(attn_weights)

    # 计算上下文向量
    # 注意力权重 @ V 得到上下文向量，交换维度
    context_vec = (attn_weights @ values).transpose(1, 2)

    # 合并多个头
    # contiguou: 重新排列内存，使的变量在内存那种连续存储
    context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)

    context_vec = self.out_proj(context_vec)

    return context_vec

# -------------------------------------------
class TransformerBlock(nn.Module):
  def __init__(self, cfg):
    super().__init__()
    # 注意力
    self.att = MultiHeadAttention(
      d_in=cfg["emb_dim"],
      d_out=cfg["emb_dim"],
      context_length=cfg["context_length"],
      num_heads=cfg["n_heads"],
      dropout=cfg["drop_rate"],
      qkv_bias=cfg["qkv_bias"]
    )

    self.ff = FeedForward(cfg)
    # 两个归一化层
    self.norm1 = LayerNorm(cfg["emb_dim"])
    self.norm2 = LayerNorm(cfg["emb_dim"])
    self.drop_shortcut = nn.Dropout(cfg["drop_rate"])

  def forward(self, x):
    # Pre-LN：先进行归一化，稳定性更好，训练初期就比叫稳定，可以不进行预热
    shortcut = x
    x = self.norm1(x)
    x = self.att(x)
    x = self.drop_shortcut(x)
    x = x + shortcut

    shortcut = x
    x = self.norm2(x)
    x = self.ff(x)
    x = self.drop_shortcut(x)
    x = x + shortcut

    return x

# -------------------------------------------
# GPT2
class GPTModel(nn.Module):
  def __init__(self, cfg):
    super().__init__()
    self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
    self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
    self.drop_emb = nn.Dropout(cfg["drop_rate"])

    self.trf_blocks = nn.Sequential(
      # 根据 n_layers 生成一个列表
      # *的作用是解包，将list内的元素解包，作为参数传入Sequential
      *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])]
    )

    self.final_norm = LayerNorm(cfg["emb_dim"])

    self.out_head = nn.Linear(
      cfg["emb_dim"], cfg["vocab_size"], bias=False
    )

  def forward(self, in_idx):
    batch_size, seq_len = in_idx.shape
    tok_embeds = self.tok_emb(in_idx)
    pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
    x = tok_embeds + pos_embeds
    x = self.drop_emb(x)
    x = self.trf_blocks(x)
    x = self.final_norm(x)
    logits = self.out_head(x)
    return logits

# -------------------------------------------
class GPTDatasetV1(Dataset):
  # max_length：每个训练样本的长度（token 数量），决定了模型一次能看到多少个 token，也称为"上下文窗口"或"序列长度"，更大的值需要更多显存，但能捕捉更长的依赖关系
  # stride：滑动窗口每次移动的步长，如果 stride = max_length：窗口不重叠
  # 如果 stride < max_length：窗口有重叠，生成更多训练样本

  def __init__(self, txt, tokenizer, max_length, stride):
    # input_ids: 训练模型的 输入 x
    # target_ids: 训练模型的目标 y
    self.input_ids = []
    self.target_ids = []
    # 将 txt 转换为 token_ids
    token_ids = tokenizer.encode(txt)

    for i in range(0, len(token_ids)-max_length, stride):
      input_chunk = token_ids[i:i+max_length]
      target_chunk = token_ids[i+1:i+max_length+1]
      self.input_ids.append(torch.tensor(input_chunk))
      self.target_ids.append(torch.tensor(target_chunk))

  def __len__(self):
    return len(self.input_ids)

  def __getitem__(self, idx):
    return self.input_ids[idx], self.target_ids[idx]

# -------------------------------------------
def create_dataloader_v1(txt, batch_size=4,
                        max_length=256, stride=128, shuffle=True, drop_last=True, num_workers=0):
  tokenizer = tiktoken.get_encoding('gpt2')
  dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

  dataloader = DataLoader(
    dataset,
    batch_size=batch_size,
    shuffle=shuffle,
    drop_last=drop_last,
    num_workers=num_workers
  )

  return dataloader

# -------------------------------------------
# 损失函数，计算单个batch
# 使用 cross_entropy 时，不要自行进行 softmax 计算
def calc_loss_batch(input_batch, target_batch, model, device):
  input_batch = input_batch.to(device)
  target_batch = target_batch.to(device)
  logits = model(input_batch)
  loss = torch.nn.functional.cross_entropy(
    logits.flatten(0, 1),
    target_batch.flatten()
  )
  return loss

# -------------------------------------------
def calc_loss_loader(data_loader, model, device, num_batches=None):
  total_loss = 0.

  if len(data_loader) == 0:
    return float("nan")
  elif num_batches is None:
    num_batches = len(data_loader)
  else:
    num_batches = min(num_batches, len(data_loader))

  for i, (input_batch, target_batch) in enumerate(data_loader):
    if i < num_batches:
      loss = calc_loss_batch(input_batch, target_batch, model, device)
      total_loss += loss.item()
    else:
      break
  return total_loss / num_batches

# -------------------------------------------
def evaluate_model(model, train_loader, val_loader, device, eval_iter):
  model.eval()
  with torch.no_grad():
    train_loss = calc_loss_loader(
      train_loader, model, device, num_batches=eval_iter
    )

    val_loss = calc_loss_loader(
      val_loader, model, device, num_batches=eval_iter
    )
  model.train()
  return train_loss, val_loss

# -------------------------------------------
def text_to_token_ids(text, tokenizer):
  encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
  encoded_tensor = torch.tensor(encoded).unsqueeze(0)
  return encoded_tensor

def token_ids_to_text(token_ids, tokenizer):
  flat = token_ids.squeeze(0)
  return tokenizer.decode(flat.tolist())

def generate_text_simple(model, idx, max_new_tokens, context_size):
  for _ in range(max_new_tokens):
    idx_cond = idx[:, -context_size:]

    with torch.no_grad():
      logits = model(idx_cond)

    logits = logits[:, -1, :]
    probas = torch.softmax(logits, dim=-1)
    idx_next = torch.argmax(probas, dim=-1, keepdim=True)
    idx = torch.cat((idx, idx_next), dim=1)
  return idx

def generate_and_print_sample(model, tokenizer, device, start_context):
  model.eval()

  context_size = model.pos_emb.weight.shape[0]
  encoded = text_to_token_ids(start_context, tokenizer).to(device)

  with torch.no_grad():
    token_ids = generate_text_simple(
      model=model, idx=encoded,
      max_new_tokens=50, context_size=context_size
    )
    decoded_text = token_ids_to_text(token_ids, tokenizer)
    print(decoded_text.replace("\n", " "))
  model.train()


# -------------------------------------------
# Train
def train_model_simple(model, train_loader, val_loader,
                       optimizer, device, num_epochs,
                       eval_freq, eval_iter, start_context, tokenizer
                      ):
  train_losses, val_losses, track_tokens_seen = [], [], []
  tokens_seen, global_step = 0, -1

  for epoch in range(num_epochs):
    model.train()

    for input_batch, target_batch in train_loader:
      optimizer.zero_grad()
      loss = calc_loss_batch(
        input_batch, target_batch, model, device
      )

      loss.backward()
      optimizer.step()
      tokens_seen += input_batch.numel()
      global_step += 1

      if global_step % eval_freq == 0:
        train_loss, val_loss = evaluate_model(model, train_loader, val_loader, device, eval_iter)
        train_losses.append(train_loss)
        val_losses.append(val_loss)
        track_tokens_seen.append(tokens_seen)
        print(f"Ep {epoch+1} (Step {global_step:06d}): "
              f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")

    generate_and_print_sample(model, tokenizer, device, start_context)
  return train_losses, val_losses, track_tokens_seen
# ---------定义模块以及训练函数：结束

# -------------------------训练：开始
GPT_CONFIG_124M = {
  "vocab_size": 50257,      # 词表大小
  "context_length": 256,   # 上下文长度
  "emb_dim": 768,           # 嵌入维度
  "n_heads": 12,            # 注意力头数
  "n_layers": 12,           # 堆叠层数
  "drop_rate": 0.1,         # dropout 率
  "qkv_bias": False         # qkv是否开启偏置项
}

tokenizer = tiktoken.get_encoding('gpt2')

file_path = 'the-verdict.txt'
with open(file_path, 'r') as f:
  text_data = f.read()

total_characters = len(text_data)
total_tokens = len(tokenizer.encode(text_data))

print('Characters:', total_characters)
print('Tokens:', total_tokens)

train_ratio = 0.9
split_idx = int(train_ratio * len(text_data))
train_data = text_data[:split_idx]
val_data = text_data[split_idx:]

train_loader = create_dataloader_v1(
  train_data,
  batch_size=2,
  max_length=GPT_CONFIG_124M["context_length"],
  stride=GPT_CONFIG_124M["context_length"],
  drop_last=True,
  shuffle=True,
  num_workers=0
)

val_loader = create_dataloader_v1(
  val_data,
  batch_size=2,
  max_length=GPT_CONFIG_124M["context_length"],
  stride=GPT_CONFIG_124M["context_length"],
  drop_last=False,
  shuffle=False,
  num_workers=0
)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = GPTModel(GPT_CONFIG_124M)
model.to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)

num_epochs = 10
train_losses, val_losses, tokens_seen = train_model_simple(
  model, train_loader, val_loader, optimizer, device,
  num_epochs=num_epochs, eval_freq=5, eval_iter=5,
  start_context="Every effort moves you", tokenizer=tokenizer
)
# -------------------------训练：结束

5.2 显示loss

python

import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator

def plot_losses(epoche_seen, tokens_seen, train_losses, val_losses):
  fig, ax1 = plt.subplots(figsize=(5, 3))
  ax1.plot(epoche_seen, train_losses, label="Training loss")
  ax1.plot(epoche_seen, val_losses, linestyle="-.", label="Validation loss")
  ax1.set_xlabel("Epochs")
  ax1.set_ylabel("Loss")
  ax1.legend(loc="upper right")
  ax1.xaxis.set_major_locator(MaxNLocator(integer=True))
  ax2 = ax1.twiny()
  ax2.plot(tokens_seen, train_losses, alpha=0)
  ax2.set_xlabel("Tokens seen")
  fig.tight_layout()
  plt.show()

epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))
plot_losses(epochs_tensor, tokens_seen, train_losses, val_losses)

5.3 训练文本：the-verdict.txt

TIP

将以下内容复制到the-verdict.txt文件中即可

I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear that, in the height of his glory, he had dropped his painting, married a rich widow, and established himself in a villa on the Riviera. (Though I rather thought it would have been Rome or Florence.)

"The height of his glory"--that was what the women called it. I can hear Mrs. Gideon Thwing--his last Chicago sitter--deploring his unaccountable abdication. "Of course it's going to send the value of my picture 'way up; but I don't think of that, Mr. Rickham--the loss to Arrt is all I think of." The word, on Mrs. Thwing's lips, multiplied its _rs_ as though they were reflected in an endless vista of mirrors. And it was not only the Mrs. Thwings who mourned. Had not the exquisite Hermia Croft, at the last Grafton Gallery show, stopped me before Gisburn's "Moon-dancers" to say, with tears in her eyes: "We shall not look upon its like again"?

Well!--even through the prism of Hermia's tears I felt able to face the fact with equanimity. Poor Jack Gisburn! The women had made him--it was fitting that they should mourn him. Among his own sex fewer regrets were heard, and in his own trade hardly a murmur. Professional jealousy? Perhaps. If it were, the honour of the craft was vindicated by little Claude Nutley, who, in all good faith, brought out in the Burlington a very handsome "obituary" on Jack--one of those showy articles stocked with random technicalities that I have heard (I won't say by whom) compared to Gisburn's painting. And so--his resolve being apparently irrevocable--the discussion gradually died out, and, as Mrs. Thwing had predicted, the price of "Gisburns" went up.

It was not till three years later that, in the course of a few weeks' idling on the Riviera, it suddenly occurred to me to wonder why Gisburn had given up his painting. On reflection, it really was a tempting problem. To accuse his wife would have been too easy--his fair sitters had been denied the solace of saying that Mrs. Gisburn had "dragged him down." For Mrs. Gisburn--as such--had not existed till nearly a year after Jack's resolve had been taken. It might be that he had married her--since he liked his ease--because he didn't want to go on painting; but it would have been hard to prove that he had given up his painting because he had married her.

Of course, if she had not dragged him down, she had equally, as Miss Croft contended, failed to "lift him up"--she had not led him back to the easel. To put the brush into his hand again--what a vocation for a wife! But Mrs. Gisburn appeared to have disdained it--and I felt it might be interesting to find out why.

The desultory life of the Riviera lends itself to such purely academic speculations; and having, on my way to Monte Carlo, caught a glimpse of Jack's balustraded terraces between the pines, I had myself borne thither the next day.

I found the couple at tea beneath their palm-trees; and Mrs. Gisburn's welcome was so genial that, in the ensuing weeks, I claimed it frequently. It was not that my hostess was "interesting": on that point I could have given Miss Croft the fullest reassurance. It was just because she was _not_ interesting--if I may be pardoned the bull--that I found her so. For Jack, all his life, had been surrounded by interesting women: they had fostered his art, it had been reared in the hot-house of their adulation. And it was therefore instructive to note what effect the "deadening atmosphere of mediocrity" (I quote Miss Croft) was having on him.

I have mentioned that Mrs. Gisburn was rich; and it was immediately perceptible that her husband was extracting from this circumstance a delicate but substantial satisfaction. It is, as a rule, the people who scorn money who get most out of it; and Jack's elegant disdain of his wife's big balance enabled him, with an appearance of perfect good-breeding, to transmute it into objects of art and luxury. To the latter, I must add, he remained relatively indifferent; but he was buying Renaissance bronzes and eighteenth-century pictures with a discrimination that bespoke the amplest resources.

"Money's only excuse is to put beauty into circulation," was one of the axioms he laid down across the Sevres and silver of an exquisitely appointed luncheon-table, when, on a later day, I had again run over from Monte Carlo; and Mrs. Gisburn, beaming on him, added for my enlightenment: "Jack is so morbidly sensitive to every form of beauty."

Poor Jack! It had always been his fate to have women say such things of him: the fact should be set down in extenuation. What struck me now was that, for the first time, he resented the tone. I had seen him, so often, basking under similar tributes--was it the conjugal note that robbed them of their savour? No--for, oddly enough, it became apparent that he was fond of Mrs. Gisburn--fond enough not to see her absurdity. It was his own absurdity he seemed to be wincing under--his own attitude as an object for garlands and incense.

"My dear, since I've chucked painting people don't say that stuff about me--they say it about Victor Grindle," was his only protest, as he rose from the table and strolled out onto the sunlit terrace.

I glanced after him, struck by his last word. Victor Grindle was, in fact, becoming the man of the moment--as Jack himself, one might put it, had been the man of the hour. The younger artist was said to have formed himself at my friend's feet, and I wondered if a tinge of jealousy underlay the latter's mysterious abdication. But no--for it was not till after that event that the _rose Dubarry_ drawing-rooms had begun to display their "Grindles."

I turned to Mrs. Gisburn, who had lingered to give a lump of sugar to her spaniel in the dining-room.

"Why _has_ he chucked painting?" I asked abruptly.

She raised her eyebrows with a hint of good-humoured surprise.

"Oh, he doesn't _have_ to now, you know; and I want him to enjoy himself," she said quite simply.

I looked about the spacious white-panelled room, with its _famille-verte_ vases repeating the tones of the pale damask curtains, and its eighteenth-century pastels in delicate faded frames.

"Has he chucked his pictures too? I haven't seen a single one in the house."

A slight shade of constraint crossed Mrs. Gisburn's open countenance. "It's his ridiculous modesty, you know. He says they're not fit to have about; he's sent them all away except one--my portrait--and that I have to keep upstairs."

His ridiculous modesty--Jack's modesty about his pictures? My curiosity was growing like the bean-stalk. I said persuasively to my hostess: "I must really see your portrait, you know."

She glanced out almost timorously at the terrace where her husband, lounging in a hooded chair, had lit a cigar and drawn the Russian deerhound's head between his knees.

"Well, come while he's not looking," she said, with a laugh that tried to hide her nervousness; and I followed her between the marble Emperors of the hall, and up the wide stairs with terra-cotta nymphs poised among flowers at each landing.

In the dimmest corner of her boudoir, amid a profusion of delicate and distinguished objects, hung one of the familiar oval canvases, in the inevitable garlanded frame. The mere outline of the frame called up all Gisburn's past!

Mrs. Gisburn drew back the window-curtains, moved aside a _jardiniere_ full of pink azaleas, pushed an arm-chair away, and said: "If you stand here you can just manage to see it. I had it over the mantel-piece, but he wouldn't let it stay."

Yes--I could just manage to see it--the first portrait of Jack's I had ever had to strain my eyes over! Usually they had the place of honour--say the central panel in a pale yellow or _rose Dubarry_ drawing-room, or a monumental easel placed so that it took the light through curtains of old Venetian point. The more modest place became the picture better; yet, as my eyes grew accustomed to the half-light, all the characteristic qualities came out--all the hesitations disguised as audacities, the tricks of prestidigitation by which, with such consummate skill, he managed to divert attention from the real business of the picture to some pretty irrelevance of detail. Mrs. Gisburn, presenting a neutral surface to work on--forming, as it were, so inevitably the background of her own picture--had lent herself in an unusual degree to the display of this false virtuosity. The picture was one of Jack's "strongest," as his admirers would have put it--it represented, on his part, a swelling of muscles, a congesting of veins, a balancing, straddling and straining, that reminded one of the circus-clown's ironic efforts to lift a feather. It met, in short, at every point the demand of lovely woman to be painted "strongly" because she was tired of being painted "sweetly"--and yet not to lose an atom of the sweetness.

"It's the last he painted, you know," Mrs. Gisburn said with pardonable pride. "The last but one," she corrected herself--"but the other doesn't count, because he destroyed it."

"Destroyed it?" I was about to follow up this clue when I heard a footstep and saw Jack himself on the threshold.

As he stood there, his hands in the pockets of his velveteen coat, the thin brown waves of hair pushed back from his white forehead, his lean sunburnt cheeks furrowed by a smile that lifted the tips of a self-confident moustache, I felt to what a degree he had the same quality as his pictures--the quality of looking cleverer than he was.

His wife glanced at him deprecatingly, but his eyes travelled past her to the portrait.

"Mr. Rickham wanted to see it," she began, as if excusing herself. He shrugged his shoulders, still smiling.

"Oh, Rickham found me out long ago," he said lightly; then, passing his arm through mine: "Come and see the rest of the house."

He showed it to me with a kind of naive suburban pride: the bath-rooms, the speaking-tubes, the dress-closets, the trouser-presses--all the complex simplifications of the millionaire's domestic economy. And whenever my wonder paid the expected tribute he said, throwing out his chest a little: "Yes, I really don't see how people manage to live without that."

Well--it was just the end one might have foreseen for him. Only he was, through it all and in spite of it all--as he had been through, and in spite of, his pictures--so handsome, so charming, so disarming, that one longed to cry out: "Be dissatisfied with your leisure!" as once one had longed to say: "Be dissatisfied with your work!"

But, with the cry on my lips, my diagnosis suffered an unexpected check.

"This is my own lair," he said, leading me into a dark plain room at the end of the florid vista. It was square and brown and leathery: no "effects"; no bric-a-brac, none of the air of posing for reproduction in a picture weekly--above all, no least sign of ever having been used as a studio.

The fact brought home to me the absolute finality of Jack's break with his old life.

"Don't you ever dabble with paint any more?" I asked, still looking about for a trace of such activity.

"Never," he said briefly.

"Or water-colour--or etching?"

His confident eyes grew dim, and his cheeks paled a little under their handsome sunburn.

"Never think of it, my dear fellow--any more than if I'd never touched a brush."

And his tone told me in a flash that he never thought of anything else.

I moved away, instinctively embarrassed by my unexpected discovery; and as I turned, my eye fell on a small picture above the mantel-piece--the only object breaking the plain oak panelling of the room.

"Oh, by Jove!" I said.

It was a sketch of a donkey--an old tired donkey, standing in the rain under a wall.

"By Jove--a Stroud!" I cried.

He was silent; but I felt him close behind me, breathing a little quickly.

"What a wonder! Made with a dozen lines--but on everlasting foundations. You lucky chap, where did you get it?"

He answered slowly: "Mrs. Stroud gave it to me."

"Ah--I didn't know you even knew the Strouds. He was such an inflexible hermit."

"I didn't--till after. . . . She sent for me to paint him when he was dead."

"When he was dead? You?"

I must have let a little too much amazement escape through my surprise, for he answered with a deprecating laugh: "Yes--she's an awful simpleton, you know, Mrs. Stroud. Her only idea was to have him done by a fashionable painter--ah, poor Stroud! She thought it the surest way of proclaiming his greatness--of forcing it on a purblind public. And at the moment I was _the_ fashionable painter."

"Ah, poor Stroud--as you say. Was _that_ his history?"

"That was his history. She believed in him, gloried in him--or thought she did. But she couldn't bear not to have all the drawing-rooms with her. She couldn't bear the fact that, on varnishing days, one could always get near enough to see his pictures. Poor woman! She's just a fragment groping for other fragments. Stroud is the only whole I ever knew."

"You ever knew? But you just said--"

Gisburn had a curious smile in his eyes.

"Oh, I knew him, and he knew me--only it happened after he was dead."

I dropped my voice instinctively. "When she sent for you?"

"Yes--quite insensible to the irony. She wanted him vindicated--and by me!"

He laughed again, and threw back his head to look up at the sketch of the donkey. "There were days when I couldn't look at that thing--couldn't face it. But I forced myself to put it here; and now it's cured me--cured me. That's the reason why I don't dabble any more, my dear Rickham; or rather Stroud himself is the reason."

For the first time my idle curiosity about my companion turned into a serious desire to understand him better.

"I wish you'd tell me how it happened," I said.

He stood looking up at the sketch, and twirling between his fingers a cigarette he had forgotten to light. Suddenly he turned toward me.

"I'd rather like to tell you--because I've always suspected you of loathing my work."

I made a deprecating gesture, which he negatived with a good-humoured shrug.

"Oh, I didn't care a straw when I believed in myself--and now it's an added tie between us!"

He laughed slightly, without bitterness, and pushed one of the deep arm-chairs forward. "There: make yourself comfortable--and here are the cigars you like."

He placed them at my elbow and continued to wander up and down the room, stopping now and then beneath the picture.

"How it happened? I can tell you in five minutes--and it didn't take much longer to happen. . . . I can remember now how surprised and pleased I was when I got Mrs. Stroud's note. Of course, deep down, I had always _felt_ there was no one like him--only I had gone with the stream, echoed the usual platitudes about him, till I half got to think he was a failure, one of the kind that are left behind. By Jove, and he _was_ left behind--because he had come to stay! The rest of us had to let ourselves be swept along or go under, but he was high above the current--on everlasting foundations, as you say.

"Well, I went off to the house in my most egregious mood--rather moved, Lord forgive me, at the pathos of poor Stroud's career of failure being crowned by the glory of my painting him! Of course I meant to do the picture for nothing--I told Mrs. Stroud so when she began to stammer something about her poverty. I remember getting off a prodigious phrase about the honour being _mine_--oh, I was princely, my dear Rickham! I was posing to myself like one of my own sitters.

"Then I was taken up and left alone with him. I had sent all my traps in advance, and I had only to set up the easel and get to work. He had been dead only twenty-four hours, and he died suddenly, of heart disease, so that there had been no preliminary work of destruction--his face was clear and untouched. I had met him once or twice, years before, and thought him insignificant and dingy. Now I saw that he was superb.

"I was glad at first, with a merely aesthetic satisfaction: glad to have my hand on such a 'subject.' Then his strange life-likeness began to affect me queerly--as I blocked the head in I felt as if he were watching me do it. The sensation was followed by the thought: if he _were_ watching me, what would he say to my way of working? My strokes began to go a little wild--I felt nervous and uncertain.

"Once, when I looked up, I seemed to see a smile behind his close grayish beard--as if he had the secret, and were amusing himself by holding it back from me. That exasperated me still more. The secret? Why, I had a secret worth twenty of his! I dashed at the canvas furiously, and tried some of my bravura tricks. But they failed me, they crumbled. I saw that he wasn't watching the showy bits--I couldn't distract his attention; he just kept his eyes on the hard passages between. Those were the ones I had always shirked, or covered up with some lying paint. And how he saw through my lies!

"I looked up again, and caught sight of that sketch of the donkey hanging on the wall near his bed. His wife told me afterward it was the last thing he had done--just a note taken with a shaking hand, when he was down in Devonshire recovering from a previous heart attack. Just a note! But it tells his whole history. There are years of patient scornful persistence in every line. A man who had swum with the current could never have learned that mighty up-stream stroke. . . .

"I turned back to my work, and went on groping and muddling; then I looked at the donkey again. I saw that, when Stroud laid in the first stroke, he knew just what the end would be. He had possessed his subject, absorbed it, recreated it. When had I done that with any of my things? They hadn't been born of me--I had just adopted them. . . .

"Hang it, Rickham, with that face watching me I couldn't do another stroke. The plain truth was, I didn't know where to put it--_I had never known_. Only, with my sitters and my public, a showy splash of colour covered up the fact--I just threw paint into their faces. . . . Well, paint was the one medium those dead eyes could see through--see straight to the tottering foundations underneath. Don't you know how, in talking a foreign language, even fluently, one says half the time not what one wants to but what one can? Well--that was the way I painted; and as he lay there and watched me, the thing they called my 'technique' collapsed like a house of cards. He didn't sneer, you understand, poor Stroud--he just lay there quietly watching, and on his lips, through the gray beard, I seemed to hear the question: 'Are you sure you know where you're coming out?'

"If I could have painted that face, with that question on it, I should have done a great thing. The next greatest thing was to see that I couldn't--and that grace was given me. But, oh, at that minute, Rickham, was there anything on earth I wouldn't have given to have Stroud alive before me, and to hear him say: 'It's not too late--I'll show you how'?

"It _was_ too late--it would have been, even if he'd been alive. I packed up my traps, and went down and told Mrs. Stroud. Of course I didn't tell her _that_--it would have been Greek to her. I simply said I couldn't paint him, that I was too moved. She rather liked the idea--she's so romantic! It was that that made her give me the donkey. But she was terribly upset at not getting the portrait--she did so want him 'done' by some one showy! At first I was afraid she wouldn't let me off--and at my wits' end I suggested Grindle. Yes, it was I who started Grindle: I told Mrs. Stroud he was the 'coming' man, and she told somebody else, and so it got to be true. . . . And he painted Stroud without wincing; and she hung the picture among her husband's things. . . ."

He flung himself down in the arm-chair near mine, laid back his head, and clasping his arms beneath it, looked up at the picture above the chimney-piece.

"I like to fancy that Stroud himself would have given it to me, if he'd been able to say what he thought that day."

And, in answer to a question I put half-mechanically--"Begin again?" he flashed out. "When the one thing that brings me anywhere near him is that I knew enough to leave off?"

He stood up and laid his hand on my shoulder with a laugh. "Only the irony of it is that I _am_ still painting--since Grindle's doing it for me! The Strouds stand alone, and happen once--but there's no exterminating our kind of art."

GPT2 详解：从零到代码实践 ​

1. GPT2 初识 ​

2. GPT2 关键组件 ​

2.1 导入相关库： ​

2.2 LayerNorm：归一化层 ​

2.2.1 理解归一化 ​

2.2.2 LayerNorm计算过程: ​

2.2.3 LayerNorm代码实现： ​

2.2.4 Pre-LN与Post-LN： ​

2.2.5 其他问题 ​

2.3 GELU激活函数 ​

2.3.1 代码实现 ​

2.3.2 Pytorch实现 ​

2.3.3 GELU 近似方式对比 ​

2.4 多头注意力机制 ​

2.4.1 注意力机制-多头注意力机制 理解 ​

2.4.2 关键概念 ​

2.4.3 代码实现 ​

2.5 前馈层 ​

2.6 Transformer块 ​

2.6.1 Transformer块代码实现 ​

2.7 GPT2模型实现 ​

3. 辅助模块 ​

4. 训练GPT2 ​

4.1 训练 ​

4.3 显示loss值 ​

5. 完整代码 ​

5.1 完整训练代码 ​

5.2 显示loss ​

5.3 训练文本：the-verdict.txt ​

GPT2 详解：从零到代码实践

1. GPT2 初识

2. GPT2 关键组件

2.1 导入相关库：

2.2 LayerNorm：归一化层

2.2.1 理解归一化

2.2.2 LayerNorm计算过程:

2.2.3 LayerNorm代码实现：

2.2.4 Pre-LN与Post-LN：

2.2.5 其他问题

2.3 GELU激活函数

2.3.1 代码实现

2.3.2 Pytorch实现

2.3.3 GELU 近似方式对比

2.4 多头注意力机制

2.4.1 注意力机制-多头注意力机制理解

2.4.2 关键概念

2.4.3 代码实现

2.5 前馈层

2.6 Transformer块

2.6.1 Transformer块代码实现

2.7 GPT2模型实现

3. 辅助模块

4. 训练GPT2

4.1 训练

4.3 显示loss值

5. 完整代码

5.1 完整训练代码

5.2 显示loss

5.3 训练文本：the-verdict.txt