1
00:00:00,160 --> 00:00:05,040
It is week for day two and we're getting straight into it already.
这是一周的第二天，我们已经开始了。

2
00:00:05,080 --> 00:00:06,800
You know how to code with frontier models.
您知道如何使用前沿模型进行编码。

3
00:00:06,800 --> 00:00:11,600
You know how to build solutions with hugging face and you know how to compare different benchmarks.
您知道如何构建解决方案，并且知道如何比较不同的基准。

4
00:00:11,640 --> 00:00:14,520
You know about the good parts and the bad parts.
你知道好的部分和坏的部分。

5
00:00:14,960 --> 00:00:18,640
Today you're gonna know how to navigate different leaderboards.
今天您将了解如何浏览不同的排行榜。

6
00:00:18,640 --> 00:00:20,360
I am such a fan of leaderboards.
我非常喜欢排行榜。

7
00:00:20,360 --> 00:00:26,440
In fact, when I was interviewed by my great friend John Crone on the Data Science podcast, he he described
事实上，当我在数据科学播客上接受我的好朋友 John Crone 的采访时，他描述了

8
00:00:26,440 --> 00:00:30,120
me as a leader boar because of how much I go on about leaderboards.
我之所以被称为“领头公猪”，是因为我对排行榜的关注度很高。

9
00:00:30,160 --> 00:00:31,960
And that is what we're gonna do today.
这就是我们今天要做的。

10
00:00:32,240 --> 00:00:38,080
Uh, and I'm going to give you real world use cases of LM solving commercial problems quickly as well.
呃，我还将向您提供 LM 快速解决商业问题的真实用例。

11
00:00:38,080 --> 00:00:43,280
And then we'll go through how you choose the right LM for your project.
然后我们将介绍如何为您的项目选择合适的 LM。

12
00:00:43,280 --> 00:00:44,400
Let's get started.
让我们开始吧。

13
00:00:44,400 --> 00:00:48,200
So I'm going to start by walking you through five different leaderboards.
因此，我将首先带您浏览五个不同的排行榜。

14
00:00:48,360 --> 00:00:49,600
I'll talk about what they all are.
我来谈谈它们都是什么。

15
00:00:49,600 --> 00:00:52,480
And then we're going to go and check them out and look at some of them.
然后我们将去检查它们并查看其中的一些。

16
00:00:52,720 --> 00:00:58,920
These change from time to time, but as of now and these pretty much for a while now, these are the
这些会不时发生变化，但到目前为止，这些已经有一段时间了，这些是

17
00:00:58,920 --> 00:01:04,490
frontrunners when it comes to understanding the different performance of different models out there,
在了解不同模型的不同性能方面处于领先地位，

18
00:01:04,530 --> 00:01:08,410
the first of them is called artificial analysis.
第一个称为人工分析。

19
00:01:08,450 --> 00:01:13,890
It's a West Coast AI company that runs this leaderboard, and it is unbelievably good.
运营这个排行榜的是一家西海岸人工智能公司，它的表现令人难以置信。

20
00:01:14,010 --> 00:01:15,450
It's unbelievably good.
这真是令人难以置信的好。

21
00:01:15,450 --> 00:01:22,850
It has it's so clear and it has such, such great way of dimensioning how you think about what makes
它是如此清晰，它有如此非常好的方式来衡量你如何思考什么使

22
00:01:22,890 --> 00:01:29,690
LMS good and bad across many things like intelligence and cost and speed, and you're going to be blown
LMS 在智能、成本和速度等许多方面都有好有坏，你会被震惊的

23
00:01:29,690 --> 00:01:31,410
away if you haven't seen it before, it's great.
如果您以前没有看过，那就离开吧，这太棒了。

24
00:01:31,410 --> 00:01:34,570
You should bookmark it or commit it to memory.
您应该将其添加为书签或将其牢记在心。

25
00:01:34,610 --> 00:01:43,010
Whatever artificial analysis that is where it is, is a nice New York based AI company that has a nice
无论人工分析在哪里，这是一家位于纽约的优秀人工智能公司，它拥有很好的技术

26
00:01:43,010 --> 00:01:46,810
leaderboard, that has a bunch of good features for comparing different LMS.
排行榜，它有很多用于比较不同 LMS 的好功能。

27
00:01:46,810 --> 00:01:52,010
But what's particularly useful is it has one place where you can see the API cost and context window
但特别有用的是它有一个地方可以让您查看 API 成本和上下文窗口

28
00:01:52,010 --> 00:01:55,690
side by side for all major providers, which is super useful.
对于所有主要提供商来说并排，这非常有用。

29
00:01:55,690 --> 00:01:58,010
So I look at vellum all the time.
所以我一直看着牛皮纸。

30
00:01:58,330 --> 00:02:04,140
I should probably mention about a month ago, the CEO of vellum contacted me on LinkedIn and said,
我可能应该提一下，大约一个月前，vellum 的首席执行官在 LinkedIn 上联系我并说：

31
00:02:04,140 --> 00:02:07,980
I hear you talk about our leaderboard on your course, which was cool.
我听到你谈论你的课程中的排行榜，这很酷。

32
00:02:08,180 --> 00:02:09,260
Uh, so thank you.
嗯，所以谢谢你。

33
00:02:09,420 --> 00:02:10,620
And yes, I do.
是的，我愿意。

34
00:02:10,660 --> 00:02:11,540
It is a good leaderboard.
这是一个很好的排行榜。

35
00:02:11,540 --> 00:02:16,540
And he also did say, by the way, you know, we're better known for having a whole product, uh, that
他也确实说过，顺便说一句，你知道，我们因拥有完整的产品而闻名，呃，

36
00:02:16,580 --> 00:02:21,660
runs AI and allows engineers to build it into production and have lots of good stuff there.
运行人工智能并允许工程师将其构建到生产中并在那里拥有很多好东西。

37
00:02:21,660 --> 00:02:29,060
So while looking at the leaderboard, check out their main product offering at vellum and then Scale
因此，在查看排行榜时，请先在 vellum 上查看他们的主要产品，然后在 Scale 上查看

38
00:02:29,060 --> 00:02:29,900
Comm.
通讯。

39
00:02:30,020 --> 00:02:37,180
Obviously, they're a very well known popular AI startup and now now company partly owned by meta,
显然，他们是一家非常知名的流行人工智能初创公司，现在公司部分由meta拥有，

40
00:02:37,380 --> 00:02:42,700
and they have a set of leaderboards called the seal leaderboards that are expert, very specialized
他们有一套排行榜，称为海豹排行榜，是专家级的、非常专业的

41
00:02:42,740 --> 00:02:47,100
leaderboards that we will enjoy looking at in a minute hugging face.
我们会喜欢在几分钟内拥抱的排行榜。

42
00:02:47,220 --> 00:02:49,420
Of course, they have their own set of leaderboards.
当然，他们有自己的一套排行榜。

43
00:02:49,420 --> 00:02:50,940
They're all hugging face spaces.
他们都拥抱着脸部空间。

44
00:02:51,300 --> 00:02:53,660
They used to be the go to leaderboard.
他们曾经是排行榜的常客。

45
00:02:53,660 --> 00:02:59,980
The hugging face open leaderboard was where everyone used to go, but they've stopped updating it and
拥抱脸开放排行榜是每个人过去常去的地方，但他们已经停止更新了

46
00:03:00,020 --> 00:03:03,980
for various reasons, possibly because people were gaming it too much.
由于各种原因，可能是因为人们玩游戏太多了。

47
00:03:04,060 --> 00:03:08,180
So no longer is that the one to go to, but they have lots of others.
所以不再是那个可以去的地方，但他们还有很多其他地方。

48
00:03:08,180 --> 00:03:11,340
And so we'll take a quick, quick browse through some of them.
因此，我们将快速浏览其中的一些内容。

49
00:03:11,340 --> 00:03:14,460
And then the last one I'm going to mention is called Live Bench.
我要提到的最后一个叫做 Live Bench。

50
00:03:14,460 --> 00:03:20,740
It's a particular leaderboard that focuses on that question about about data set leakage, data set
这是一个特定的排行榜，重点关注有关数据集泄漏、数据集的问题

51
00:03:20,740 --> 00:03:21,860
contamination.
污染。

52
00:03:21,860 --> 00:03:28,300
And they have a special approach for making sure that they are measuring true raw performance of models,
他们有一种特殊的方法来确保他们测量模型的真实原始性能，

53
00:03:28,300 --> 00:03:31,460
and they are not susceptible to this contamination issue.
而且它们不易受到这种污染问题的影响。

54
00:03:31,820 --> 00:03:34,740
So those are five great leaderboards.
这就是五个很棒的排行榜。

55
00:03:35,100 --> 00:03:37,100
I think we should go and take a look at them now.
我想我们现在应该去看看它们。

56
00:03:37,100 --> 00:03:40,780
It's generally never a good idea to pick favorites, but it's never stopped me before.
一般来说，挑选最喜欢的东西从来都不是一个好主意，但这从来没有阻止过我。

57
00:03:40,820 --> 00:03:42,220
My favorite leaderboard.
我最喜欢的排行榜。

58
00:03:42,340 --> 00:03:46,180
My favorite leaderboard is Artificial Analysis II.
我最喜欢的排行榜是人工分析 II。

59
00:03:46,540 --> 00:03:47,580
Here it is.
这里是。

60
00:03:47,740 --> 00:03:49,020
This is the leaderboard.
这是排行榜。

61
00:03:49,020 --> 00:03:53,420
And the first time I heard about this, when it only just created it was a student that reached out
我第一次听说这个，当它刚刚创建时，是一个学生伸出援手

62
00:03:53,420 --> 00:03:55,060
to me and said, have you seen this new leaderboard?
对我说，你看到这个新的排行榜了吗？

63
00:03:55,220 --> 00:03:59,020
Uh, and uh, yeah, it is absolutely phenomenal.
呃，呃，是的，这绝对是惊人的。

64
00:03:59,110 --> 00:04:01,510
I would say it's the most popular one out there.
我想说这是最受欢迎的一种。

65
00:04:01,790 --> 00:04:09,430
Uh, and, uh, this this is a company that independently analyzes LMS, and they have a number of charts
呃,还有,呃,这是一家独立分析LMS的公司,他们有很多图表

66
00:04:09,430 --> 00:04:13,390
that ranks them and places where you can dive in and learn about them.
对它们进行排名，以及您可以深入了解它们的地方。

67
00:04:13,390 --> 00:04:19,510
And there's some charts up at the top here, intelligence, speed and price, the sort of three big
顶部有一些图表，智力、速度和价格，三大要素

68
00:04:19,510 --> 00:04:21,110
ones that you want to start with.
那些你想开始的。

69
00:04:21,310 --> 00:04:24,590
Uh, and then each of them has a sort of breakdown with more detail.
呃，然后每个人都有更详细的细分。

70
00:04:24,590 --> 00:04:30,950
So we'll go for the more detailed ones, starting with the artificial Intelligence Index.
因此，我们将从人工智能指数开始，进行更详细的研究。

71
00:04:30,950 --> 00:04:34,510
It incorporates ten different evaluations.
它包含十种不同的评估。

72
00:04:34,510 --> 00:04:41,390
You remember we talked about MLU Pro, the difficult version of the uh, massive multitask language.
您还记得我们谈论过 MLU Pro，这是呃、大规模多任务语言的困难版本。

73
00:04:41,390 --> 00:04:46,310
Understanding a diamond is the particular one.
了解钻石是一件特别的事情。

74
00:04:46,310 --> 00:04:53,070
This is the Google proof Q&amp;A Humanities last exam we talked about live code bench.
这是我们上次谈论实时代码平台的谷歌证明问答人文学科。

75
00:04:53,270 --> 00:04:55,390
Uh, and we talked about Aim, I think.
呃，我想我们还讨论了 Aim。

76
00:04:55,430 --> 00:05:00,440
And the others are a bunch of other ones that all combined together to this overall score.
其他的都是一堆其他的，所有这些加在一起就得到了这个总分。

77
00:05:00,640 --> 00:05:04,720
That is how powerful is a particular LM.
这就是特定 LM 的强大程度。

78
00:05:04,720 --> 00:05:06,360
And this is the results as of now.
这是目前的结果。

79
00:05:06,360 --> 00:05:08,480
But you will have more up to date results.
但您将获得更多最新结果。

80
00:05:08,480 --> 00:05:09,720
There'll be more models since then.
从那时起将会有更多的模型。

81
00:05:09,720 --> 00:05:12,080
And so you should go here and shouldn't listen to what I say.
所以你应该去这里而不应该听我说的话。

82
00:05:12,080 --> 00:05:13,360
You should look at what you're seeing.
你应该看看你所看到的。

83
00:05:13,360 --> 00:05:19,760
But for me right now, the model that scores the highest in this sort of this assembly of different
但现在对我来说，在这种不同的组合中得分最高的模型

84
00:05:19,760 --> 00:05:24,040
metrics, which is sort of weighted towards a genetic coding a bit.
指标，有点偏向遗传编码。

85
00:05:24,120 --> 00:05:30,840
So it's not a huge surprise that the Codex version of GPT five, which is the one that's particularly
因此，GPT 5 的 Codex 版本并不令人意外，因为它是特别重要的版本。

86
00:05:30,840 --> 00:05:37,840
optimized for a genetic coding, is the one with the highest intelligence score, um, almost equal
针对遗传编码进行了优化，是智力得分最高的那个，嗯，几乎相等

87
00:05:37,840 --> 00:05:40,760
with GPT five high and high.
与GPT五高并高。

88
00:05:40,800 --> 00:05:45,480
There means that it's been asked to to reason with the highest mode set.
这意味着它被要求以最高模式集进行推理。

89
00:05:45,840 --> 00:05:49,840
And then comes Mr. Musk's Grok four, which was at the top.
然后是马斯克先生的 Grok 4，它位于顶部。

90
00:05:49,840 --> 00:05:52,880
It was in pole position before GPT five came along.
在 GPT 5 出现之前，它就处于领先地位。

91
00:05:53,080 --> 00:05:55,240
And then the new Claude 4.5 sonnet.
然后是新的克劳德 4.5 十四行诗。

92
00:05:55,240 --> 00:05:56,080
Well, it's new for me.
嗯，这对我来说是新的。

93
00:05:56,280 --> 00:06:02,130
Uh, that is a little bit it's I mean, it's still a very, very high score, but it is slightly behind
呃,就是一点点,就是说,还是非常非常高的分数,但是稍微落后了一点

94
00:06:02,170 --> 00:06:03,010
the others.
其他人。

95
00:06:03,410 --> 00:06:08,290
Then grok four and fast mode and then Gemini 2.5 Pro, which when it came out was in the top position.
然后是 grok 4 和快速模式，然后是 Gemini 2.5 Pro，它推出时处于顶部位置。

96
00:06:08,290 --> 00:06:10,690
You see how they each get replaced over time.
您会看到它们如何随着时间的推移而被替换。

97
00:06:10,850 --> 00:06:17,770
And then for one opus, the massive version of the next tier down of Claude, and then remarkably,
然后对于一部作品，下一级克劳德的大型版本，然后值得注意的是，

98
00:06:17,810 --> 00:06:22,570
absolutely remarkably, in the next place is the first open source model.
绝对引人注目的是，排在第二位的是第一个开源模型。

99
00:06:22,570 --> 00:06:26,810
All the ones that I've mentioned up to now have been have been paid frontier closed source models.
到目前为止我提到的所有模型都是付费的前沿闭源模型。

100
00:06:26,850 --> 00:06:35,330
The first open source model is GPT, opus one R20b, the big version of GPT OSS that I think we just
第一个开源模型是 GPT，opus one R20b，我认为我们只是 GPT OSS 的大版本

101
00:06:35,330 --> 00:06:42,010
saw winning a game of Connect Four, but but only just, uh, and, uh, it it comes in this place and
看到赢得了四子棋游戏，但只是，呃，呃，它出现在这个地方，

102
00:06:42,010 --> 00:06:45,250
it is of course, super fast as well and very cheap.
当然，它也超级快而且非常便宜。

103
00:06:45,490 --> 00:06:52,290
And then Quinn three from Alibaba Cloud comes next and then comes uh, upstart Deep Seek.
接下来是来自阿里云的 Quinn 三，然后是呃，新贵 Deep Seek。

104
00:06:52,530 --> 00:06:56,250
Uh V3 point two experimental is the latest version of it.
呃V3点二实验的是它的最新版本。

105
00:06:56,290 --> 00:06:58,050
Still an experimental prerelease Pre-release.
仍然是实验性预发布预发布。

106
00:06:58,050 --> 00:07:00,650
For me, it's probably out there for you.
对我来说，它可能就在你身边。

107
00:07:00,650 --> 00:07:02,490
Or maybe it's already a deep sea V4.
或者也许它已经是深海V4了。

108
00:07:02,530 --> 00:07:03,130
Who knows?
谁知道？

109
00:07:03,330 --> 00:07:05,450
Uh, so that is that comes next.
呃，那就是接下来的事情。

110
00:07:05,450 --> 00:07:05,810
All right.
好的。

111
00:07:05,810 --> 00:07:13,410
And so those those are the top most intelligent models that perform best across this range of different
因此，这些是最顶级的智能模型，在这些不同的范围内表现最佳

112
00:07:13,450 --> 00:07:14,450
benchmarks.
基准。

113
00:07:14,570 --> 00:07:21,210
And you can keep digging in and look at these to get a really good sense of what are the strongest models
您可以继续深入研究并查看这些模型，以真正了解什么是最强的模型

114
00:07:21,210 --> 00:07:22,370
on the planet.
在这个星球上。

115
00:07:22,610 --> 00:07:25,690
And then the next one is a kind of scary thing to look at.
然后下一个看起来有点可怕。

116
00:07:25,730 --> 00:07:32,810
This is showing you a timeline of how models have performed against these scores, looking back over
这向您展示了模型如何根据这些分数执行的时间表，回顾一下

117
00:07:32,850 --> 00:07:36,810
time, looking, uh, looking from from just a back over time.
时间，看，呃，从时间的背面看。

118
00:07:36,810 --> 00:07:42,570
I make it sound like it was like decades ago, all the way back to November 2022.
我让它听起来像是几十年前的事，一直追溯到 2022 年 11 月。

119
00:07:42,730 --> 00:07:47,410
Uh, and this is based on the same the same intelligence index.
呃，这也是基于同样的智力指数。

120
00:07:47,530 --> 00:07:53,010
And you can see how the models have, have grown over time against this index.
您可以根据该指数看到模型如何随着时间的推移而增长。

121
00:07:53,170 --> 00:07:58,740
And, you know, like people talk about model performance saturating and coming, coming, slowing down.
而且，你知道，就像人们谈论模型性能饱和并且即将到来、放缓一样。

122
00:07:58,740 --> 00:08:04,340
But you look at a diagram like that and I ask you if it looks like there's any signs of things slowing
但你看看这样的图表，我问你是否有任何放缓的迹象

123
00:08:04,340 --> 00:08:04,820
down.
向下。

124
00:08:05,020 --> 00:08:06,820
It's quite terrifying.
这是相当可怕的。

125
00:08:07,220 --> 00:08:14,260
It's also very interesting to keep in mind that that a lot of the early gains were due to training time
值得注意的是，早期的很多收获都是由于训练时间造成的

126
00:08:14,260 --> 00:08:14,780
techniques.
技术。

127
00:08:14,780 --> 00:08:20,500
A lot of the more recent gains are due to inference time techniques that we've started to see diminishing
最近的许多进展都归功于推理时间技术，我们已经开始看到这些技术正在减少

128
00:08:20,500 --> 00:08:22,300
returns from training time.
从训练时间返回。

129
00:08:22,300 --> 00:08:29,060
And that's why people, people often like, like to sort of talk down the GPT five release because it
这就是为什么人们经常喜欢贬低 GPT 5 版本，因为它

130
00:08:29,060 --> 00:08:34,820
didn't feel from chatting with it like there was a marked change from GPT 4.1.
与它聊天并没有感觉到 GPT 4.1 有明显的变化。

131
00:08:35,060 --> 00:08:40,020
And really the difference, the reason why these models are doing so much better comes down a lot to
真正的区别是，这些模型做得更好的原因很大程度上在于

132
00:08:40,060 --> 00:08:46,020
the inference time techniques that the reasoning, the tweaking, the prompts, the added tools.
推理时间技术包括推理、调整、提示、添加工具。

133
00:08:46,020 --> 00:08:51,540
It's this stuff that's really moving the needle recently and why this line is kept going with this steep
正是这个东西最近真正引起了人们的注意，以及为什么这条线一直保持如此陡峭的状态

134
00:08:51,540 --> 00:08:52,420
trajectory.
弹道。

135
00:08:52,540 --> 00:08:57,620
If we took out the inference time techniques, I think you'd absolutely see this kind of leveling off.
如果我们去掉推理时间技术，我想你绝对会看到这种趋于平稳的情况。