1
00:00:00,120 --> 00:00:01,240
Well, here we are.
好吧，我们到了。

2
00:00:01,280 --> 00:00:02,360
It is week seven.
现在是第七周。

3
00:00:02,360 --> 00:00:03,560
Day five.
第五天。

4
00:00:03,600 --> 00:00:06,000
Somehow it's the big reveal.
不知怎的，这是一个重大的揭露。

5
00:00:06,040 --> 00:00:07,480
The results day.
出结果的日子。

6
00:00:07,880 --> 00:00:09,440
Uh, but first, a reminder.
呃，但首先，提醒一下。

7
00:00:09,440 --> 00:00:14,920
You can already work with open source, closed source models, with calling tools, with with assistance
您已经可以使用开源、闭源模型、调用工具以及帮助

8
00:00:14,920 --> 00:00:21,320
with Rag five step strategy to solving a business problem, including curating data, fine tuning a
使用 Rag 五步策略来解决业务问题，包括整理数据、微调

9
00:00:21,320 --> 00:00:27,960
frontier model, and now running Q Laura for an open source model including hyperparameter optimization,
前沿模型，现在运行 Q Laura 用于开源模型，包括超参数优化，

10
00:00:28,000 --> 00:00:31,040
monitoring the training, and the fabulous weights and biases.
监控训练以及惊人的权重和偏差。

11
00:00:31,440 --> 00:00:36,720
Hopefully, hopefully you didn't get as overexcited as me, but at least you enjoyed the process of
希望，希望你没有像我一样过度兴奋，但至少你喜欢这个过程

12
00:00:36,760 --> 00:00:37,520
doing it.
做它。

13
00:00:38,160 --> 00:00:40,640
And today, of course, it's results day.
当然，今天是结果日。

14
00:00:40,680 --> 00:00:46,480
We're going to to run inference on a model which has been fine tuned.
我们将对经过微调的模型进行推理。

15
00:00:46,480 --> 00:00:48,200
So with Laura matrices and stuff.
劳拉矩阵之类的东西也是如此。

16
00:00:48,200 --> 00:00:53,280
So there's some some things to cover there and wrap up with, you know, so you're in a position that
所以有一些事情需要涵盖并总结，你知道，所以你处于这样的位置：

17
00:00:53,280 --> 00:00:55,280
you could confidently do all of this yourself.
您可以自信地自己完成这一切。

18
00:00:55,480 --> 00:00:56,920
That's the big part of it.
这是其中最重要的部分。

19
00:00:57,120 --> 00:00:59,560
Um, and I'm, I'm excited to show you the results.
嗯，我很高兴向您展示结果。

20
00:00:59,600 --> 00:01:05,760
Of course, what we're trying to, to get towards is get something that is close to the performance
当然，我们正在努力实现的是接近性能的东西

21
00:01:05,850 --> 00:01:07,290
of a frontier model.
的前沿模型。

22
00:01:07,290 --> 00:01:14,450
Recognizing we're working with llama 3.2, a 3 billion parameter model, a tiny model by by comparison
认识到我们正在使用 llama 3.2，这是一个拥有 30 亿参数的模型，相比之下，这是一个很小的模型

23
00:01:14,450 --> 00:01:20,330
to multi-trillion dollar frontier models, recognizing that we've quantized it down to four bits so
到数万亿美元的前沿模型，认识到我们已经将其量化为四位，因此

24
00:01:20,330 --> 00:01:23,490
it could run on a mobile phone pretty much for sure.
它几乎肯定可以在手机上运行。

25
00:01:23,650 --> 00:01:25,330
And so it's very small.
所以它很小。

26
00:01:25,370 --> 00:01:28,730
The question is, can we get close to frontier performance at least?
问题是，我们至少能接近前沿性能吗？

27
00:01:28,770 --> 00:01:29,850
Can it beat Ed?
能打败艾德吗？

28
00:01:29,890 --> 00:01:32,930
It should at least be able to beat my my abilities.
至少应该能够击败我的能力。

29
00:01:32,930 --> 00:01:36,450
Maybe it could get close to GPT four one nano or something like that.
也许它可以接近 GPT 四一纳米或类似的东西。

30
00:01:36,490 --> 00:01:41,090
That that's that's what we're striving for when you're trying to fine tune an open source model because
当您尝试微调开源模型时，这就是我们所努力的目标，因为

31
00:01:41,090 --> 00:01:43,290
it's free and it's something that that you could use.
它是免费的，并且是您可以使用的东西。

32
00:01:43,610 --> 00:01:44,090
Okay.
好的。

33
00:01:44,490 --> 00:01:51,290
But first, before we we do that before we run inference, I do want to to close the loop on something
但首先，在我们进行推理之前，我确实想关闭某些东西的循环

34
00:01:51,290 --> 00:01:57,090
that I've had hanging out there for a while, which is explaining what exactly is this loss calculation
我已经在那里闲逛了一段时间，这解释了这个损失计算到底是什么

35
00:01:57,090 --> 00:01:58,650
that we we keep covering.
我们一直在报道。

36
00:01:58,690 --> 00:02:00,450
What is the loss?
损失是什么？

37
00:02:00,450 --> 00:02:01,690
How is it calculated.
是如何计算的。

38
00:02:01,690 --> 00:02:04,050
Just a bit more detail on training.
只是关于训练的更多细节。

39
00:02:04,050 --> 00:02:07,490
So a few theory slides before we get into action okay.
在我们开始行动之前先先看一些理论幻灯片。

40
00:02:07,530 --> 00:02:08,290
Here we go.
开始了。

41
00:02:08,530 --> 00:02:11,100
You remember the four steps of training.
您记住训练的四个步骤。

42
00:02:11,100 --> 00:02:12,220
You know them well now.
你现在很了解他们了。

43
00:02:12,260 --> 00:02:13,900
We coded them one by one.
我们将它们一一编码。

44
00:02:14,180 --> 00:02:15,460
Uh, a week ago.
呃，一周前。

45
00:02:15,460 --> 00:02:18,580
But now we know that you first do a forward pass.
但现在我们知道你首先要向前传球。

46
00:02:18,660 --> 00:02:23,620
It's when you have some input data and you predict what will be the next token by going through that
当你有一些输入数据并且你通过这些数据来预测下一个标记是什么时

47
00:02:23,620 --> 00:02:29,380
input data, the loss calculation, some calculation that says how wrong were we.
输入数据，损失计算，一些计算表明我们是多么错误。

48
00:02:29,860 --> 00:02:35,100
You do the backward pass, which is when we say okay, if we tweak each of these parameters, does it
你进行向后传递，这就是我们说好的，如果我们调整每个参数，是吗？

49
00:02:35,100 --> 00:02:37,020
make the loss better or worse?
使损失变得更好还是更糟？

50
00:02:37,300 --> 00:02:42,780
And then we take a tiny step in the direction that reduces loss.
然后我们朝着减少损失的方向迈出一小步。

51
00:02:42,820 --> 00:02:44,460
It's called the optimizer.
它被称为优化器。

52
00:02:44,500 --> 00:02:49,540
You remember we used an optimizer called Adam Adam w I think we used which is very popular.
您还记得我们使用了一个名为 Adam Adam 的优化器，我认为我们使用的非常流行。

53
00:02:49,580 --> 00:02:53,700
The, the original one is called SGD stochastic gradient descent.
最初的一种称为 SGD 随机梯度下降。

54
00:02:53,740 --> 00:02:58,740
That's the one which is basically just saying take the learning rate times the gradients and going the
这基本上就是说用学习率乘以梯度，然后

55
00:02:58,740 --> 00:03:01,380
other way and negative learning rate gradients take a step.
其他方式和负学习率梯度采取一步。

56
00:03:01,380 --> 00:03:01,900
That's it.
就是这样。

57
00:03:02,300 --> 00:03:04,460
So those are the four steps in training.
这就是训练的四个步骤。

58
00:03:04,620 --> 00:03:06,500
Uh, you've experienced this a few times now.
呃，你已经经历过几次这样的事情了。

59
00:03:06,500 --> 00:03:09,780
And what we're going to dig into is this loss calculation.
我们要深入研究的是这个损失计算。

60
00:03:09,780 --> 00:03:12,860
And just a bit more perspective on this, this this whole idea.
对这整个想法有更多的看法。

61
00:03:13,100 --> 00:03:15,710
So again the first step is the forward pass.
因此，第一步是向前传球。

62
00:03:15,710 --> 00:03:20,070
So there's an input prompt input sequence like price is dollars.
所以有一个输入提示输入序列，比如价格是美元。

63
00:03:20,070 --> 00:03:21,950
And then that's the next token.
然后就是下一个标记。

64
00:03:21,950 --> 00:03:24,230
It's got to predict that is an input sequence.
它必须预测这是一个输入序列。

65
00:03:24,230 --> 00:03:26,190
It's turned into token IDs.
它变成了令牌 ID。

66
00:03:26,310 --> 00:03:34,790
Those token IDs are fed in to a a neural network which consists of a base llama 3.2 and then some Lora
这些令牌 ID 被输入神经网络，该网络由基础 llama 3.2 和一些 Lora 组成

67
00:03:34,790 --> 00:03:41,110
adapters which have some some things which will able to adapt, able to slightly tune llama that get
适配器有一些能够适应的东西，能够稍微调整美洲驼

68
00:03:41,110 --> 00:03:41,790
applied.
应用。

69
00:03:42,110 --> 00:03:46,230
What comes spitting out at the end is the predicted next token.
最后吐出的是预测的下一个标记。

70
00:03:46,230 --> 00:03:49,470
Let's say it says the price is $99.
假设它说价格是 99 美元。

71
00:03:49,590 --> 00:03:52,710
It predicts 99 is the most likely next token.
它预测 99 是最有可能的下一个标记。

72
00:03:52,710 --> 00:03:53,990
That's the forward pass.
这就是向前传球。

73
00:03:54,150 --> 00:03:56,630
The loss calculation comes next.
接下来是损失计算。

74
00:03:56,670 --> 00:04:00,510
The loss calculation take the predicted next token is 99.
损失计算取预测的下一个 token 为 99。

75
00:04:00,550 --> 00:04:04,110
Let's say this thing actually cost $89.
假设这个东西实际上花了 89 美元。

76
00:04:04,110 --> 00:04:05,990
We're off by $10.
我们优惠 10 美元。

77
00:04:06,270 --> 00:04:08,830
So we need to calculate some loss.
所以我们需要计算一些损失。

78
00:04:09,030 --> 00:04:15,030
And that loss has got to be related to how wrong were we if we were exactly right.
如果我们完全正确的话，这种损失一定与我们的错误程度有关。

79
00:04:15,030 --> 00:04:20,570
If it was a perfect prediction, then the loss should be zero representing no error.
如果这是一个完美的预测，那么损失应该为零，代表没有错误。

80
00:04:20,610 --> 00:04:21,610
It was perfect.
太完美了。

81
00:04:21,890 --> 00:04:29,810
If if the if it's a if we're more wrong, the worse we are, the higher the loss should be.
如果我们错得越多，我们的情况就越糟糕，损失就应该越大。

82
00:04:30,970 --> 00:04:31,530
Okay.
好的。

83
00:04:31,570 --> 00:04:33,410
A bigger number means worse result.
数字越大意味着结果越差。

84
00:04:33,450 --> 00:04:33,730
Sure.
当然。

85
00:04:33,730 --> 00:04:35,930
You're thinking so you might be thinking, okay, why not?
你可能会想，好吧，为什么不呢？

86
00:04:35,970 --> 00:04:37,610
How about we take the difference between them?
我们来看看它们之间的区别怎么样？

87
00:04:37,610 --> 00:04:38,610
The absolute error.
绝对错误。

88
00:04:38,610 --> 00:04:40,530
That's what we've been using all this time.
这就是我们一直在使用的。

89
00:04:40,730 --> 00:04:44,650
Or maybe the mean squared error that linear regression uses.
或者可能是线性回归使用的均方误差。

90
00:04:45,090 --> 00:04:46,890
But there's a problem with that that we will come to.
但我们会遇到一个问题。

91
00:04:47,250 --> 00:04:48,570
But you get the general idea.
但你已经了解了总体思路。

92
00:04:48,570 --> 00:04:49,930
That's what loss is all about.
这就是损失的意义所在。

93
00:04:50,250 --> 00:04:56,410
Then we do something called the backward pass, which again is saying, okay, let's go to every single
然后我们做一些叫做向后传递的事情，这又是说，好吧，让我们看看每一个

94
00:04:56,410 --> 00:05:01,490
parameter and say if we were to twiddle it a bit, would that make loss bigger or smaller.
参数并说如果我们稍微调整一下，损失会更大还是更小。

95
00:05:01,690 --> 00:05:07,850
And that of course is called calculating the gradients because it's like a calculus thing and that we
这当然被称为计算梯度，因为它就像微积分一样，我们

96
00:05:07,850 --> 00:05:12,210
only have to do that for the Laura adapters because we've frozen all of the base model.
只需对 Laura 适配器执行此操作，因为我们已经冻结了所有基本模型。

97
00:05:12,210 --> 00:05:18,170
So we're only doing this, this tweaking for these, these smaller Laura adapters, they are being subjected
所以我们只是这样做，对这些较小的劳拉适配器进行调整，它们正在受到影响

98
00:05:18,170 --> 00:05:20,770
to gradient based optimization.
基于梯度的优化。

99
00:05:20,770 --> 00:05:22,050
It's how you'd explain it.
这就是你要解释的方式。

100
00:05:22,050 --> 00:05:29,180
And doing this is something that would be very laborious and complex calculation involving tons and
这样做是一项非常费力且复杂的计算，涉及大量和

101
00:05:29,180 --> 00:05:30,460
tons of maths.
大量的数学。

102
00:05:30,460 --> 00:05:32,380
But there is a clever trick.
但有一个巧妙的技巧。

103
00:05:32,580 --> 00:05:38,980
A trick that's known as back prop or back propagation, which is a way to calculate the gradients,
一种称为反向传播或反向传播的技巧，这是一种计算梯度的方法，

104
00:05:39,020 --> 00:05:45,180
calculate how sensitive everything is to being just changed slightly by working from from the the the
计算一切对稍微改变的敏感度，从

105
00:05:45,220 --> 00:05:50,060
bottom of the neural network, working all the way back up to the inputs again, and calculating all
神经网络的底部，再次回到输入，并计算所有

106
00:05:50,060 --> 00:05:53,380
the gradients as a function of the gradients that came before.
梯度作为之前梯度的函数。

107
00:05:53,660 --> 00:05:58,420
And it's because it uses a mathematical trick called the chain rule, which is something that you might
这是因为它使用了一种称为链式法则的数学技巧，您可能会

108
00:05:58,420 --> 00:06:03,700
remember from high school, like a calculus thing, which is how you're able to calculate gradients
记得高中时，就像微积分一样，这就是你如何计算梯度的

109
00:06:03,700 --> 00:06:05,620
in terms of other gradients.
就其他梯度而言。

110
00:06:05,660 --> 00:06:11,860
And so it uses this approach of repeatedly applying the chain rule working backwards.
因此它使用了这种反复应用反向链式法则的方法。

111
00:06:11,980 --> 00:06:15,060
And this was something that was it was invented some time ago.
这是不久前发明的东西。

112
00:06:15,060 --> 00:06:21,620
But this this trick is one of the sort of secrets that has made, uh, made training neural networks
但这这个技巧是训练神经网络的秘密之一

113
00:06:21,620 --> 00:06:22,740
so effective.
如此有效。

114
00:06:22,740 --> 00:06:27,660
It's such an efficient way to do it that an operation that would have been very, very time consuming
这是一种非常有效的方法，以至于原本会非常非常耗时的操作

115
00:06:27,700 --> 00:06:29,300
can happen rapidly.
可能会迅速发生。

116
00:06:29,300 --> 00:06:32,630
And particularly it can happen very efficiently, in parallel rapidly.
特别是它可以非常高效、快速地并行发生。

117
00:06:32,750 --> 00:06:37,790
And so we're able to calculate the gradients in this thing called back propagation or back prop.
因此我们能够计算称为反向传播或反向传播的梯度。

118
00:06:37,790 --> 00:06:40,590
And all of that is known as the backward pass.
所有这些都称为向后传递。

119
00:06:40,750 --> 00:06:41,150
Okay.
好的。

120
00:06:41,190 --> 00:06:42,230
So that's backprop.
这就是反向传播。

121
00:06:42,230 --> 00:06:43,470
And you've probably heard that expression.
你可能听说过这个表达方式。

122
00:06:43,470 --> 00:06:47,270
If you want to know more about backprop then then it's easy to look it up and get more detail.
如果您想了解更多有关反向传播的信息，那么很容易查找并获得更多详细信息。

123
00:06:47,270 --> 00:06:49,190
But I don't want to get too into the gory detail here.
但我不想在这里过多讨论血淋淋的细节。

124
00:06:49,190 --> 00:06:51,990
It's actually the the algorithm itself is very old.
实际上算法本身已经非常古老了。

125
00:06:52,030 --> 00:06:59,470
I think it's from the 1970s or so, but it was in the 1980s, I think 1986 that that it was there was
我认为它是从 1970 年代左右开始的，但那是在 1980 年代，我认为 1986 年就在那里了

126
00:06:59,510 --> 00:07:00,470
the very famous paper.
非常有名的论文。

127
00:07:00,470 --> 00:07:05,110
One of the authors was Geoff Hinton, who is like a, you know, considered one of the godfathers of
其中一位作者是杰夫·辛顿 (Geoff Hinton)，他被认为是《科学》的教父之一。

128
00:07:05,150 --> 00:07:10,910
modern AI, uh, and that that really sort of brought it to the forefront of using it for neural networks
现代人工智能，呃，这确实将其带到了将其用于神经网络的最前沿

129
00:07:10,910 --> 00:07:11,510
in AI.
在人工智能中。

130
00:07:11,790 --> 00:07:13,750
Um, so backprop super essential.
嗯，所以反向传播非常重要。

131
00:07:13,790 --> 00:07:14,350
All right.
好的。

132
00:07:14,510 --> 00:07:16,390
And then optimization is the fourth step.
然后优化是第四步。

133
00:07:16,390 --> 00:07:21,230
And this is where we take all of the adapters and we slightly step them in the direction.
这就是我们取出所有适配器的地方，并稍微朝这个方向移动它们。

134
00:07:21,230 --> 00:07:24,630
That's going to mean that next time the loss will be less okay.
这意味着下次损失会更严重。

135
00:07:24,670 --> 00:07:25,150
You get it.
你明白了。

136
00:07:25,150 --> 00:07:26,790
You know about these four steps.
这四个步骤你都知道。

137
00:07:26,790 --> 00:07:34,790
So now let me go one level deeper deeper so that I can explain what exactly is the loss calculation
现在让我更深入地解释一下损失计算到底是什么

138
00:07:34,790 --> 00:07:35,270
itself.
本身。