1
00:00:00,040 --> 00:00:00,560
Okay.
好的。

2
00:00:00,600 --> 00:00:03,840
The time has arrived for us to cover vectors.
现在是我们讨论向量的时候了。

3
00:00:04,040 --> 00:00:05,880
This is the big idea.
这是一个伟大的想法。

4
00:00:05,880 --> 00:00:07,640
It's the big idea behind rag.
这是 rag 背后的伟大想法。

5
00:00:07,760 --> 00:00:11,600
It's the idea which allows us to do this sort of fuzzy lookup.
正是这个想法使我们能够进行这种模糊查找。

6
00:00:11,600 --> 00:00:15,680
Given a user's question, what might be the relevant context?
考虑到用户的问题，相关上下文可能是什么？

7
00:00:15,680 --> 00:00:18,200
That's got something to do with this question.
这和这个问题有关系。

8
00:00:18,240 --> 00:00:20,160
It all comes down to these things called vectors.
这一切都归结为这些称为向量的东西。

9
00:00:20,400 --> 00:00:24,600
I imagine some of you are super familiar with this already, in which case put me on two x and zip through
我想你们中的一些人已经对此非常熟悉了，在这种情况下，请让我使用两个 x 并快速通过

10
00:00:24,640 --> 00:00:24,960
it.
它。

11
00:00:24,960 --> 00:00:27,320
But for some of you, this is going to be fascinating.
但对于你们中的一些人来说，这将会很有趣。

12
00:00:27,480 --> 00:00:33,000
A whole new world as we launch into the world of vector embeddings and encoder.
当我们进入矢量嵌入和编码器的世界时，这是一个全新的世界。

13
00:00:33,640 --> 00:00:38,320
In the last four and a half weeks, we've experimented with a number of different llms with GPT, with
在过去的四个星期半中，我们用 GPT 尝试了许多不同的 llms，其中

14
00:00:38,360 --> 00:00:41,680
Claude, with Gemini, deep Seek, to name a few.
克劳德，与双子座，深度寻求，仅举几例。

15
00:00:41,840 --> 00:00:44,160
Grok are both Crocs.
Grok 都是 Crocs。

16
00:00:44,280 --> 00:00:49,600
So one of the things that they all have in common is that they are they are all just one particular
所以它们的共同点之一就是它们都只是一个特定的

17
00:00:49,600 --> 00:00:51,400
flavor of LLM.
LLM的味道。

18
00:00:51,440 --> 00:00:56,280
There are, in fact, two different types of LLM, and we've only been looking at one of them, which
事实上，有两种不同类型的法学硕士，我们只研究了其中一种，即

19
00:00:56,280 --> 00:00:58,880
is by far the most common, which is the one you see all over the place.
这是迄今为止最常见的一种，随处可见。

20
00:00:59,080 --> 00:01:04,080
And its official name is an Regressive LM.
它的正式名称是回归LM。

21
00:01:04,320 --> 00:01:05,120
It's a mouthful.
真是一口。

22
00:01:05,240 --> 00:01:11,200
Auto regressive LM means this is a model which is trained to do one particular job.
自回归 LM 意味着这是一个经过训练来完成一项特定工作的模型。

23
00:01:11,320 --> 00:01:16,440
It's trained to take an input sequence and to predict the next token.
它经过训练以获取输入序列并预测下一个标记。

24
00:01:16,440 --> 00:01:19,320
That should come after this input sequence.
它应该出现在这个输入序列之后。

25
00:01:19,440 --> 00:01:24,520
It's regressive in that it looks back over this input and predicts what should come next.
它是回归性的，因为它会回顾这个输入并预测接下来会发生什么。

26
00:01:24,520 --> 00:01:26,800
And as you know, it does it one token at a time.
如您所知，它一次执行一个令牌。

27
00:01:26,840 --> 00:01:28,760
It predicts the just the next token.
它预测下一个标记。

28
00:01:28,760 --> 00:01:33,440
And then the trick is we then feed in a new input that's the original input with that extra token on
然后技巧是我们输入一个新的输入，该输入是带有额外标记的原始输入

29
00:01:33,440 --> 00:01:35,840
the end of it, and we get it to generate the next token.
它的末尾，我们用它来生成下一个令牌。

30
00:01:35,840 --> 00:01:37,880
And that's repeated again and again.
这一次又一次地重复。

31
00:01:37,880 --> 00:01:41,720
And that is what an auto regressive LM does.
这就是自回归 LM 所做的事情。

32
00:01:42,040 --> 00:01:44,960
And the reason it does that is because that's how it's been trained.
它之所以这样做，是因为它就是这样接受训练的。

33
00:01:44,960 --> 00:01:46,920
That's the architecture supports that.
这就是架构支持的。

34
00:01:46,920 --> 00:01:51,200
And then it's been given lots of training data where there's an input sequence and the next token.
然后它会获得大量训练数据，其中有输入序列和下一个标记。

35
00:01:51,200 --> 00:01:53,160
And so it's trained to get good at that.
所以它经过训练可以擅长这一点。

36
00:01:53,160 --> 00:01:55,440
That's called an auto regressive LM.
这就是所谓的自回归 LM。

37
00:01:55,440 --> 00:02:00,560
And if you ever just told about an LM without without any framing, it's almost certainly an auto regressive
如果你在没有任何框架的情况下讲述了 LM，那么它几乎肯定是一个自回归

38
00:02:00,560 --> 00:02:01,080
LM.
LM。

39
00:02:01,120 --> 00:02:02,480
That's what they mostly are.
他们大多都是这样。

40
00:02:02,720 --> 00:02:06,720
But there is another kind And the other kind goes by many different names.
但还有另一种，另一种有许多不同的名称。

41
00:02:06,720 --> 00:02:09,160
You sometimes hear it called an encoder.
有时您会听到它被称为编码器。

42
00:02:09,200 --> 00:02:14,200
You sometimes hear them called an embedding model or a vector embedding model.
有时您会听到它们被称为嵌入模型或向量嵌入模型。

43
00:02:14,400 --> 00:02:16,880
But this is a model, an autoencoder.
但这是一个模型，一个自动编码器。

44
00:02:16,880 --> 00:02:24,280
It's an LM which does not generate the next token based on looking back, but rather it takes a full
它是一个 LM，它不会根据回顾生成下一个令牌，而是需要完整的

45
00:02:24,280 --> 00:02:31,120
input sequence, and it produces one output that is meant to reflect the full input.
输入序列，它会产生一个旨在反映完整输入的输出。

46
00:02:31,120 --> 00:02:35,960
It's some output based on the full input that's passed in.
它是基于传入的完整输入的一些输出。

47
00:02:36,480 --> 00:02:40,480
Not not just what's to come next, but something that reflects it all.
不仅仅是接下来会发生什么，而是反映这一切的事情。

48
00:02:40,960 --> 00:02:42,680
Okay, I'm being a bit abstract here.
好吧，我在这里有点抽象。

49
00:02:42,720 --> 00:02:44,080
What does that mean concretely?
这具体意味着什么？

50
00:02:44,120 --> 00:02:45,160
What would be examples.
什么是例子。

51
00:02:45,160 --> 00:02:46,600
Why would you want to do that?
你为什么要这么做？

52
00:02:46,640 --> 00:02:46,880
Okay.
好的。

53
00:02:46,920 --> 00:02:51,080
So to get concrete there are some very obvious things you might want to use this for.
因此，为了具体起见，您可能想用它来做一些非常明显的事情。

54
00:02:51,120 --> 00:02:55,000
This idea of of taking a big input and coming up with one thing.
这种接受大量投入并提出一件事的想法。

55
00:02:55,040 --> 00:02:59,000
And the most obvious application is classification.
最明显的应用是分类。

56
00:02:59,080 --> 00:03:01,920
Take take an input and then classify it.
获取输入，然后对其进行分类。

57
00:03:01,920 --> 00:03:04,280
This whole input to mean something.
这整个输入是有意义的。

58
00:03:04,360 --> 00:03:07,520
And one example of classification is sentiment analysis.
分类的一个例子是情感分析。

59
00:03:07,560 --> 00:03:12,880
Take take a whole sentence and say whether or not this is this is happy or sad, positive or negative,
拿一个完整的句子来说，无论这是快乐还是悲伤，积极还是消极，

60
00:03:12,880 --> 00:03:16,280
and other forms of classification like what does this pertain to?
以及其他形式的分类，例如这属于什么？

61
00:03:16,520 --> 00:03:22,280
Uh, anything like that would be an example of where you might use an auto encoding LM and there are
呃，任何类似的东西都是你可以使用自动编码 LM 的一个例子，并且有

62
00:03:22,280 --> 00:03:23,840
lots of other things like that.
还有很多类似的事情。

63
00:03:23,840 --> 00:03:28,440
Whenever you might want to come up with one output that reflects the entire input.
每当您可能想要提出一个反映整个输入的输出时。

64
00:03:28,440 --> 00:03:33,880
But in particular, there is one function that that is important for us, one application that we care
但特别是，有一项功能对我们来说很重要，一项我们关心的应用程序

65
00:03:33,880 --> 00:03:34,680
about right now.
大约现在。

66
00:03:34,840 --> 00:03:43,040
And that is when you use, uh, one of these autoencoder models to come up with a set of numbers that
那就是当你使用，呃，其中一个自动编码器模型来得出一组数字

67
00:03:43,040 --> 00:03:48,360
best represents this input, a bunch of numbers, maybe it's ten numbers.
最好代表这个输入，一堆数字，也许是十个数字。

68
00:03:48,360 --> 00:03:53,600
And these ten numbers are just meant to reflect the meaning of this input in some way.
而这十个数字只是为了以某种方式反映这个输入的含义。

69
00:03:53,800 --> 00:03:59,000
And and you could think of those ten numbers as just just a list of numbers that in some way is meant
你可以将这十个数字视为只是一个数字列表，这些数字在某种程度上意味着

70
00:03:59,000 --> 00:04:05,040
to represent the concept being covered here by this, this sentence, this sequence of tokens.
来表示这个、这个句子、这个标记序列所涵盖的概念。

71
00:04:05,280 --> 00:04:10,280
And when you do that because ten, ten different numbers, you could think of that as being being what
当你这样做时，因为有十个、十个不同的数字，你可以认为这是

72
00:04:10,280 --> 00:04:14,680
a mathematician would call a vector, a set of numbers that represents a point in space.
数学家将表示空间中的点的一组数字称为向量。

73
00:04:14,680 --> 00:04:20,400
If it were three numbers, it would literally be like an x, y, and a z coordinate to a point in space.
如果它是三个数字，那么它实际上就像空间中某个点的 x、y 和 z 坐标。

74
00:04:20,400 --> 00:04:26,520
If it's if it's ten numbers, then we'd say it's a it's a point in ten dimensional space, which sounds
如果它是十个数字，那么我们会说它是十维空间中的一个点，这听起来

75
00:04:26,520 --> 00:04:27,200
very fancy.
非常花哨。

76
00:04:27,200 --> 00:04:32,360
It's just saying it's ten different numbers that can be thought of as coordinates to this point.
它只是说可以将十个不同的数字视为此时的坐标。

77
00:04:32,560 --> 00:04:36,000
But but these this is called a vector embedding.
但是，这被称为向量嵌入。

78
00:04:36,000 --> 00:04:42,160
When you take a set of input sequence, a set of input tokens, and then you map that to a bunch of
当您获取一组输入序列、一组输入标记，然后将其映射到一堆

79
00:04:42,160 --> 00:04:42,920
numbers.
数字。

80
00:04:42,920 --> 00:04:45,600
And there are a number of examples of these used in practice.
并且有许多在实践中使用的例子。

81
00:04:45,600 --> 00:04:48,320
One of the very first it was called Bert.
第一个它被称为伯特。

82
00:04:48,360 --> 00:04:50,680
It was created by Google in 2018.
它是由谷歌于 2018 年创建的。

83
00:04:50,680 --> 00:04:51,840
It predates GPT.
它早于 GPT。

84
00:04:51,960 --> 00:04:53,000
It was way back.
那是很久以前的事了。

85
00:04:53,280 --> 00:04:55,520
It was before we even used the expression LLM.
那是在我们使用“LLM”一词之前。

86
00:04:55,960 --> 00:05:03,080
But but Bert stood for bidirectional encoding representations from transformers.
但是 Bert 代表来自 Transformer 的双向编码表示。

87
00:05:03,360 --> 00:05:05,280
And yeah, it was it was very popular then.
是的，当时它非常流行。

88
00:05:05,280 --> 00:05:08,880
I remember being amazed by it when I first first played with it back back then.
我记得当我第一次玩它时，我感到很惊讶。

89
00:05:08,880 --> 00:05:12,530
And it's still used today, but very, very popular.
它至今仍在使用，但非常非常受欢迎。

90
00:05:13,090 --> 00:05:18,010
OpenAI embeddings is a whole category of embedding models from OpenAI.
OpenAI 嵌入是 OpenAI 的一整类嵌入模型。

91
00:05:18,250 --> 00:05:23,170
The one that that I use is is called the text embedding.
我使用的一种称为文本嵌入。

92
00:05:23,570 --> 00:05:24,770
Three large.
三个大。

93
00:05:24,770 --> 00:05:26,930
And there's also a three small as well.
还有一个三小号。

94
00:05:27,290 --> 00:05:33,050
And then there's a very popular open source model that will be also experimenting with called the all
还有一个非常流行的开源模型也将进行试验，称为 all

95
00:05:33,090 --> 00:05:35,570
mini LLM, L6 v2.
迷你法学硕士，L6 v2。

96
00:05:35,690 --> 00:05:42,850
So these are all examples of of autoencoding llms, also known as embedding models that are able to
这些都是自动编码 LLMS 的示例，也称为嵌入模型，能够

97
00:05:42,890 --> 00:05:50,450
take an input, a series of tokens and turn it into a set of numbers that we call a vector.
接受一个输入，一系列标记，并将其转换为一组我们称为向量的数字。

98
00:05:50,450 --> 00:05:55,010
And I just want to tackle something that is a common source of confusion for people that first meet
我只是想解决一些对于初次见面的人来说很常见的困惑

99
00:05:55,010 --> 00:06:01,930
this, which is people say, okay, so what's the difference between tokens and vectors?
这就是人们说的，好吧，那么令牌和向量有什么区别呢？

100
00:06:02,290 --> 00:06:05,810
And they are super different and it's very easy to explain.
它们非常不同，而且很容易解释。

101
00:06:05,810 --> 00:06:07,730
So let me demystify that right away.
那么让我立即揭开这个神秘面纱。

102
00:06:07,730 --> 00:06:12,170
One way to put it is that tokens are inputs and vectors are outputs.
一种说法是，令牌是输入，向量是输出。

103
00:06:12,250 --> 00:06:20,730
But but basically, tokens is just a super simple way of saying we can't put words into a model.
但基本上，令牌只是一种超级简单的方式，表示我们不能将单词放入模型中。

104
00:06:20,770 --> 00:06:24,170
A model is a mathematical thing that adds things up and multiplies them.
模型是一种将事物相加和相乘的数学事物。

105
00:06:24,210 --> 00:06:25,530
It can't take a word.
不能用一句话来形容。

106
00:06:25,610 --> 00:06:30,410
So at the very beginning, we have to start by having like a number to represent each word, like the
所以一开始，我们必须先用一个数字来代表每个单词，比如

107
00:06:30,410 --> 00:06:35,530
word, and might be number one, the might be the word number, number two, and so on, so that we
单词，并且可能是第一，the 可能是单词编号，第二，依此类推，这样我们

108
00:06:35,530 --> 00:06:40,250
can map words to a single numbers and that that's what tokens are.
可以将单词映射到单个数字，这就是标记。

109
00:06:40,290 --> 00:06:43,250
They're just a numeric representation of the input.
它们只是输入的数字表示。

110
00:06:43,290 --> 00:06:45,450
It's just the input in another form.
这只是另一种形式的输入。

111
00:06:45,650 --> 00:06:50,490
And it turns out that if you try and map every word to one of these numbers, then you have some, some
事实证明，如果你尝试将每个单词映射到这些数字之一，那么你就会得到一些、一些

112
00:06:50,530 --> 00:06:55,130
issues with with with how many words like what happens when you run out of vocabulary.
问题是有多少单词，比如当你词汇量用完时会发生什么。

113
00:06:55,290 --> 00:06:57,930
Uh, what do you do with proper names and things like that?
呃，你如何处理专有名称之类的东西？

114
00:06:57,970 --> 00:07:02,890
At one point, we tried to have every character, every letter mapped to a number, but then things
在某一时刻，我们试图将每个字符、每个字母映射到一个数字，但后来事情

115
00:07:02,890 --> 00:07:04,930
got really complicated, really fast.
变得非常复杂，非常快。

116
00:07:04,930 --> 00:07:10,330
And so there's this, this nice, happy medium as we explored all the way back in week one, which is
所以这就是我们在第一周一直探索的这个美好而快乐的媒介，那就是

117
00:07:10,330 --> 00:07:15,770
taking fragments of words instead of words or letters, and that just turns out to work really, really
采用单词片段而不是单词或字母，事实证明这确实非常有效

118
00:07:15,770 --> 00:07:16,250
well.
出色地。

119
00:07:16,250 --> 00:07:21,890
So there is some some hocus pocus about the the way that you create these tokens from text.
因此，关于从文本创建这些标记的方式有一些诡计。

120
00:07:21,890 --> 00:07:27,210
But but it's basically just a sort of a mapping and a few rules, nothing fancy to it.
但是它基本上只是一种映射和一些规则，没有什么花哨的。

121
00:07:27,250 --> 00:07:33,970
Tokenizers are very simplistic bits of code, and the fact that you have to map text into tokens, it's
分词器是非常简单的代码，事实上，您必须将文本映射到标记中，这是

122
00:07:33,970 --> 00:07:39,890
just something that you go through to to come up with numbers that can be the inputs into a model that
只是你通过得出数字来输入模型的过程

123
00:07:39,890 --> 00:07:40,890
is tokens.
是代币。

124
00:07:40,890 --> 00:07:42,250
They are the inputs.
它们是输入。

125
00:07:42,570 --> 00:07:49,330
Vectors are what comes out the other side after a model, an LLM has taken in these simplistic tokens.
向量是模型之后从另一边出来的东西，LLM 接受了这些简单的标记。

126
00:07:49,330 --> 00:07:51,690
And it's it's got some some meaning from it.
它有一些意义。

127
00:07:51,730 --> 00:07:55,210
It's understood the relationship between these different words.
理解了这些不同单词之间的关系。

128
00:07:55,250 --> 00:07:59,570
It's understood what, what point you're trying to make, how to represent this information.
人们知道你想表达什么、想表达什么观点、如何表示这些信息。

129
00:07:59,570 --> 00:08:05,210
And it's turned it into a series of numbers that that reflect the meaning of the input.
它将其转化为一系列反映输入含义的数字。

130
00:08:05,210 --> 00:08:08,930
And it is an output of the model and that is a vector.
它是模型的输出，也是一个向量。

131
00:08:08,930 --> 00:08:10,610
So it always starts with tokens.
所以它总是以令牌开始。

132
00:08:10,650 --> 00:08:16,890
Even encoding models take tokens as their inputs, text goes to tokens, goes into the model and what
甚至编码模型也将标记作为输入，文本进入标记，进入模型以及什么

133
00:08:16,890 --> 00:08:19,050
comes out is the vector.
出来的就是向量。

134
00:08:19,370 --> 00:08:20,610
And and I know what you're thinking.
我知道你在想什么。

135
00:08:20,610 --> 00:08:25,610
The pros are thinking, okay, but it's more complicated than that because actually vectors aren't just
专业人士在想，好吧，但事情比这更复杂，因为实际上向量不仅仅是

136
00:08:25,650 --> 00:08:26,690
the outputs.
输出。

137
00:08:26,810 --> 00:08:33,970
Vectors are also what what happens inside an LM between the various layers, you can think of the information
向量也是 LM 内部各层之间发生的事情，你可以认为这些信息

138
00:08:33,970 --> 00:08:38,570
that's moving from one layer to another as being like vectors at each point.
从一层移动到另一层就像每个点的向量一样。

139
00:08:38,570 --> 00:08:45,250
And so, in fact, in many ways, the tokens that come in even to an autoregressive model, they start
因此，事实上，在很多方面，甚至进入自回归模型的标记，它们开始

140
00:08:45,250 --> 00:08:50,610
by being turned into vectors in terms of the internal representation of an LM.
通过将其转化为 LM 内部表示形式的向量。

141
00:08:50,730 --> 00:08:54,690
But that's a pro a pro thing that you don't need to worry about if you didn't follow that.
但这是一件非常专业的事情，如果您不遵循这一点，您无需担心。

142
00:08:54,690 --> 00:09:00,010
All you need to really understand is that tokens are just a simple way to represent text that comes
您需要真正理解的是，标记只是表示出现的文本的简单方法

143
00:09:00,010 --> 00:09:00,850
in the input.
在输入中。

144
00:09:01,090 --> 00:09:08,170
Vectors are a sort of advanced internal transformer representation of the information as a, as a,
向量是一种先进的内部变压器，将信息表示为 a、a、

145
00:09:08,170 --> 00:09:10,490
as a series of different numbers in a vector.
作为向量中的一系列不同数字。

146
00:09:10,730 --> 00:09:13,210
Those are vectors, those are outputs.
这些是向量，那些是输出。

147
00:09:13,210 --> 00:09:15,530
Or maybe there are intermediate outputs.
或者也许有中间输出。

148
00:09:15,530 --> 00:09:18,810
But that is the key difference between tokens and vectors.
但这是标记和向量之间的主要区别。