评价此页

★ ★ ★ ★ ★

beginner/nlp/word_embeddings_tutorial

在 Google Colab 中运行

注意

转到末尾下载完整示例代码。

词嵌入：编码词汇语义#

创建日期：2017 年 4 月 8 日 | 最后更新：2021 年 9 月 14 日 | 最后验证：2024 年 11 月 5 日

词嵌入是实数的密集向量，每个词汇表中的词都有一个。在自然语言处理中，特征几乎总是词！但是我们应该如何表示计算机中的一个词呢？您可以存储其 ascii 字符表示，但这只告诉您这个词是什么，而没有说明它意味着什么（您可能可以从其词缀推断出其词性，或从其大写推断出属性，但不多）。更重要的是，您如何在某种意义上组合这些表示？我们通常希望神经网络产生密集输出，其中输入是 \(|V|\) 维的，其中 \(V\) 是我们的词汇表，但输出通常只有几维（例如，如果我们只预测少数几个标签）。我们如何从巨大的维度空间转移到一个较小的维度空间？

我们能不能用独热编码而不是 ascii 表示？也就是说，我们将词 \(w\) 表示为

\[\overbrace{\left[ 0, 0, \dots, 1, \dots, 0, 0 \right]}^\text{|V| 个元素} \]

其中 1 位于 \(w\) 独有的位置。任何其他词将在其他位置有 1，其他地方为 0。

除了体积巨大之外，这种表示还有巨大的缺点。它基本上将所有词视为独立的实体，彼此之间没有关系。我们真正想要的是某种程度的词相似性。为什么？让我们看一个例子。

假设我们正在构建一个语言模型。假设我们看到了以下句子

数学家跑到商店。
物理学家跑到商店。
数学家解决了这个悬而未决的问题。

在我们的训练数据中。现在假设我们得到一个在训练数据中从未见过的新句子

物理学家解决了这个悬而未决的问题。

我们的语言模型在这个句子上可能表现尚可，但如果我们能使用以下两个事实，岂不是更好

我们看到数学家和物理学家在句子中扮演了相同的角色。它们之间在某种程度上存在语义关系。
我们看到数学家在这个新的未见过句子中扮演的角色与我们现在看到的物理学家相同。

然后推断物理学家实际上适合这个新的未见过句子？这就是我们所说的相似性概念：我们指的是语义相似性，而不仅仅是具有相似的拼写表示。这是一种通过连接我们所见与未见之间的点来应对语言数据稀疏性的技术。这个例子当然依赖于一个基本的语言学假设：出现在相似上下文中的词在语义上是相关的。这被称为分布假说。

获取密集词嵌入#

我们如何解决这个问题？也就是说，我们如何实际编码词的语义相似性？也许我们可以想出一些语义属性。例如，我们看到数学家和物理学家都可以跑步，所以也许我们给这些词在“能够跑步”的语义属性上打高分。想想其他属性，并想象一下您可能会根据这些属性对一些常用词进行评分。

如果每个属性都是一个维度，那么我们可能为每个词赋予一个向量，如下所示

\[ q_\text{mathematician} = \left[ \overbrace{2.3}^\text{能跑步}, \overbrace{9.4}^\text{喜欢咖啡}, \overbrace{-5.5}^\text{主修物理学}, \dots \right]\]

\[ q_\text{physicist} = \left[ \overbrace{2.5}^\text{能跑步}, \overbrace{9.1}^\text{喜欢咖啡}, \overbrace{6.4}^\text{主修物理学}, \dots \right]\]

然后我们可以通过以下方式获得这些词之间的相似性度量

\[\text{Similarity}(\text{physicist}, \text{mathematician}) = q_\text{physicist} \cdot q_\text{mathematician} \]

虽然更常见的是通过长度进行归一化

\[ \text{Similarity}(\text{physicist}, \text{mathematician}) = \frac{q_\text{physicist} \cdot q_\text{mathematician}} {\| q_\text{physicist} \| \| q_\text{mathematician} \|} = \cos (\phi)\]

其中 \(\phi\) 是两个向量之间的角度。这样，极其相似的词（其嵌入指向同一方向的词）的相似度为 1。极其不相似的词的相似度应为 -1。

您可以将本节开头处的稀疏独热向量视为这些新定义的向量的特例，其中每个词的相似度基本上为 0，并且我们为每个词赋予了独特的语义属性。这些新向量是密集的，也就是说，它们的项（通常）非零。

但是这些新向量很麻烦：您可以想到成千上万种可能与确定相似性相关的语义属性，而且您到底如何设定不同属性的值呢？深度学习的核心思想是神经网络学习特征的表示，而不是要求程序员自己设计它们。那么为什么不直接让词嵌入成为我们模型中的参数，然后在训练期间更新它们呢？这正是我们将要做的。我们将拥有一些潜在语义属性，网络原则上可以学习它们。请注意，词嵌入可能不会被解释。也就是说，虽然通过我们上面手工制作的向量，我们可以看到数学家和物理学家在喜欢咖啡方面是相似的，但如果我们允许神经网络学习嵌入并看到数学家和物理学家在第二个维度上都有一个大的值，那么这意味着什么并不清楚。它们在某个潜在的语义维度上是相似的，但这可能对我们没有解释。

总之，词嵌入是对词的*语义*的表示，有效地编码了与手头任务可能相关的语义信息。您也可以嵌入其他内容：词性标签、解析树，任何内容！特征嵌入的思想是该领域的中心。

PyTorch 中的词嵌入#

在我们进行实际示例和练习之前，先快速了解一下如何在 PyTorch 和一般的深度学习编程中使用嵌入。与我们在制作独热向量时为每个词定义唯一索引类似，在使用嵌入时我们也需要为每个词定义一个索引。这些将是查找表的键。也就是说，嵌入存储为 \(|V| \times D\) 矩阵，其中 \(D\) 是嵌入的维度，因此分配索引 \(i\) 的词的嵌入存储在矩阵的第 \(i\) 行。在我所有的代码中，词到索引的映射是一个名为 word_to_ix 的字典。

允许您使用嵌入的模块是 torch.nn.Embedding，它接受两个参数：词汇量大小和嵌入的维度。

要索引此表，您必须使用 torch.LongTensor（因为索引是整数，而不是浮点数）。

# Author: Robert Guthrie

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator object at 0x7f5e6c5849b0>

word_to_ix = {"hello": 0, "world": 1}
embeds = nn.Embedding(2, 5)  # 2 words in vocab, 5 dimensional embeddings
lookup_tensor = torch.tensor([word_to_ix["hello"]], dtype=torch.long)
hello_embed = embeds(lookup_tensor)
print(hello_embed)

tensor([[ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519]],
       grad_fn=<EmbeddingBackward0>)

示例：N-Gram 语言建模#

回想一下，在一个 N-Gram 语言模型中，给定一个词序列 \(w\)，我们想计算

\[P(w_i | w_{i-1}, w_{i-2}, \dots, w_{i-n+1} ) \]

其中 \(w_i\) 是序列中的第 i 个词。

在此示例中，我们将计算一些训练样本上的损失函数，并通过反向传播更新参数。

CONTEXT_SIZE = 2
EMBEDDING_DIM = 10
# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()
# we should tokenize the input, but we will ignore that for now
# build a list of tuples.
# Each tuple is ([ word_i-CONTEXT_SIZE, ..., word_i-1 ], target word)
ngrams = [
    (
        [test_sentence[i - j - 1] for j in range(CONTEXT_SIZE)],
        test_sentence[i]
    )
    for i in range(CONTEXT_SIZE, len(test_sentence))
]
# Print the first 3, just so you can see what they look like.
print(ngrams[:3])

vocab = set(test_sentence)
word_to_ix = {word: i for i, word in enumerate(vocab)}


class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs


losses = []
loss_function = nn.NLLLoss()
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(10):
    total_loss = 0
    for context, target in ngrams:

        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in tensors)
        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)

        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        model.zero_grad()

        # Step 3. Run the forward pass, getting log probabilities over next
        # words
        log_probs = model(context_idxs)

        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a tensor)
        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))

        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()

        # Get the Python number from a 1-element Tensor by calling tensor.item()
        total_loss += loss.item()
    losses.append(total_loss)
print(losses)  # The loss decreased every iteration over the training data!

# To get the embedding of a particular word, e.g. "beauty"
print(model.embeddings.weight[word_to_ix["beauty"]])

[(['forty', 'When'], 'winters'), (['winters', 'forty'], 'shall'), (['shall', 'winters'], 'besiege')]
[521.7561130523682, 519.0209546089172, 516.304979801178, 513.6080112457275, 510.9292929172516, 508.2673349380493, 505.61936259269714, 502.98635244369507, 500.36709904670715, 497.759113073349]
tensor([-1.1993, -0.0479, -2.0012, -0.4916, -1.5450, -0.1737,  0.7280,  0.0577,
        -0.7076, -0.5266], grad_fn=<SelectBackward0>)

练习：计算词嵌入：连续词袋模型#

连续词袋模型（CBOW）在自然语言处理深度学习中经常使用。它是一个模型，试图在给定目标词之前和之后几个词的上下文的情况下预测词。这与语言建模不同，因为 CBOW 不是顺序的，并且不必是概率性的。通常，CBOW 用于快速训练词嵌入，然后这些嵌入用于初始化更复杂模型的嵌入。通常，这被称为预训练嵌入。它几乎总是能将性能提高几个百分点。

CBOW 模型如下。给定目标词 \(w_i\) 和两侧各 \(N\) 个词的上下文窗口，\(w_{i-1}, \dots, w_{i-N}\) 和 \(w_{i+1}, \dots, w_{i+N}\)，将所有上下文词统称为 \(C\)，CBOW 试图最小化

\[-\log p(w_i | C) = -\log \text{Softmax}\left(A(\sum_{w \in C} q_w) + b\right) \]

其中 \(q_w\) 是词 \(w\) 的嵌入。

通过填写下面的类，在 PyTorch 中实现此模型。一些提示

考虑你需要定义哪些参数。
确保你知道每个操作期望的形状。如果需要重塑，请使用 .view()。

CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right
raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()

# By deriving a set from `raw_text`, we deduplicate the array
vocab = set(raw_text)
vocab_size = len(vocab)

word_to_ix = {word: i for i, word in enumerate(vocab)}
data = []
for i in range(CONTEXT_SIZE, len(raw_text) - CONTEXT_SIZE):
    context = (
        [raw_text[i - j - 1] for j in range(CONTEXT_SIZE)]
        + [raw_text[i + j + 1] for j in range(CONTEXT_SIZE)]
    )
    target = raw_text[i]
    data.append((context, target))
print(data[:5])


class CBOW(nn.Module):

    def __init__(self):
        pass

    def forward(self, inputs):
        pass

# Create your model and train. Here are some functions to help you make
# the data ready for use by your module.


def make_context_vector(context, word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    return torch.tensor(idxs, dtype=torch.long)


make_context_vector(data[0][0], word_to_ix)  # example

[(['are', 'We', 'to', 'study'], 'about'), (['about', 'are', 'study', 'the'], 'to'), (['to', 'about', 'the', 'idea'], 'study'), (['study', 'to', 'idea', 'of'], 'the'), (['the', 'study', 'of', 'a'], 'idea')]

tensor([25, 23, 26, 13])

脚本总运行时间： (0 分钟 0.586 秒)