1
00:00:00,080 --> 00:00:01,480
Now there's a constraint.

2
00:00:01,480 --> 00:00:05,520
When working with LMS, you need to be aware of, which is known as the context window.

3
00:00:05,520 --> 00:00:11,080
And the context window is the maximum number of tokens that any particular model can fit that it can

4
00:00:11,120 --> 00:00:11,960
it can look back on.

5
00:00:11,960 --> 00:00:15,240
It can consider for when it's generating the next token.

6
00:00:15,240 --> 00:00:19,120
It's the maximum length of the input that it can that it can handle.

7
00:00:19,120 --> 00:00:22,240
And if you try and pass in more input than that, then it will fail.

8
00:00:22,240 --> 00:00:23,960
It will say bigger than my context window.

9
00:00:23,960 --> 00:00:24,840
Can't do it.

10
00:00:25,400 --> 00:00:30,600
Now here's something to keep in mind that that thing that gets that needs to fit within the context

11
00:00:30,600 --> 00:00:34,520
window isn't just your message you've just given to the LM.

12
00:00:34,760 --> 00:00:39,920
But as you now understand, it's what the actual input sequence isn't just the most recent message,

13
00:00:39,920 --> 00:00:41,240
it's the whole conversation.

14
00:00:41,240 --> 00:00:41,800
So far.

15
00:00:41,960 --> 00:00:43,320
It's the first thing you said.

16
00:00:43,360 --> 00:00:46,200
Hi, my name is Ed and then nice to meet you, Ed.

17
00:00:46,240 --> 00:00:47,600
And then what's my name?

18
00:00:47,800 --> 00:00:50,520
All of that has to fit in the context window.

19
00:00:50,960 --> 00:00:55,840
And actually it's a little bit more than that too, because you have to pass in all of that.

20
00:00:55,840 --> 00:01:02,470
And then GPT generates the most likely next token, which is perhaps the word your is going to say.

21
00:01:02,470 --> 00:01:05,710
Your name is ed your, it produces your.

22
00:01:05,910 --> 00:01:12,670
And then the whole of that input sequence is fed back in again to GPT with the word your at the end

23
00:01:12,670 --> 00:01:16,550
of it, and it generates the next token of name.

24
00:01:16,990 --> 00:01:24,390
And then in goes all the input with your name and it generates is and then your name is Ed.

25
00:01:24,790 --> 00:01:28,590
It generates tokens one at a time and the whole input is passed back in.

26
00:01:28,590 --> 00:01:34,030
And that means that what you need to be able to fit in the context window is the original input, the

27
00:01:34,030 --> 00:01:39,470
replies, the next input, the final message you're giving it, and then all of the generated tokens

28
00:01:39,470 --> 00:01:44,510
that it comes up with up until and not including the very last token that it generates.

29
00:01:44,550 --> 00:01:47,630
All of that has got to fit in the context window.

30
00:01:47,630 --> 00:01:55,150
And so obviously the context window governs how much the background the model can remember about references

31
00:01:55,190 --> 00:01:57,110
and content and context.

32
00:01:57,110 --> 00:01:58,790
And as I gave the example before.

33
00:01:58,830 --> 00:02:03,990
If you're shoving all of the ticket prices to different cities in the prompt, then it allows it to

34
00:02:04,030 --> 00:02:07,190
be able to remember a certain number of those references.

35
00:02:07,470 --> 00:02:13,390
And it's particularly important for a technique called multi-shot prompting, which is a technique where

36
00:02:13,630 --> 00:02:20,750
in the input you give a series of examples of example questions and answers for it to kind of draw from,

37
00:02:20,750 --> 00:02:23,390
as it's trying to figure out what sequence to generate.

38
00:02:23,390 --> 00:02:26,870
And it's also important for techniques like Rag that of course we will get to.

39
00:02:26,950 --> 00:02:32,070
But many of these inference time techniques we talked about about training time and inference time scaling,

40
00:02:32,070 --> 00:02:37,790
many of these inference time techniques make hefty use of the context window, and that's why it's important

41
00:02:37,790 --> 00:02:38,710
to keep it in mind.

42
00:02:38,710 --> 00:02:43,830
And people that have used Claude code say, are very aware of what happens as you start to fill up that

43
00:02:43,830 --> 00:02:45,110
context window.

44
00:02:45,870 --> 00:02:50,990
And if you happen to have a question on the complete works of Shakespeare, well, you would need a

45
00:02:50,990 --> 00:02:54,430
context window of like a million tokens to be able to handle that.

46
00:02:54,430 --> 00:02:59,820
And as we'll see at the moment, it's only models like Gemini that are able to handle that much of a

47
00:02:59,860 --> 00:03:01,220
context window.

48
00:03:01,420 --> 00:03:04,700
And that brings me to everybody's favorite topic API costs.

49
00:03:04,700 --> 00:03:12,380
So look, the chat products like ChatGPT typically have a free tier and a paid tier or many paid tiers.

50
00:03:12,500 --> 00:03:15,300
They range from $20 a month to $200 a month.

51
00:03:15,460 --> 00:03:17,740
And that gives you a monthly subscription.

52
00:03:17,780 --> 00:03:23,540
It means that you can use it with no charge per use, but some kind of rate limiting that is completely

53
00:03:23,540 --> 00:03:25,860
unrelated to using the API.

54
00:03:26,140 --> 00:03:32,020
If you use the API, it doesn't matter whether or not you have a subscription, you pay per use of that

55
00:03:32,020 --> 00:03:38,500
API, and the idea is that you could use that to to build your own product, your own ChatGPT to rival

56
00:03:38,540 --> 00:03:44,900
ChatGPT, and you'd be monetizing it, but you'd also be paying per API call because you have to pay

57
00:03:44,900 --> 00:03:46,020
the compute bills.

58
00:03:46,380 --> 00:03:49,940
These these API costs go into paying for the inference compute.

59
00:03:49,980 --> 00:03:52,300
There's trillions of calculations happening.

60
00:03:52,300 --> 00:03:59,570
And presumably a little bit goes towards paying OpenAI back for the $100 million plus that they spent

61
00:03:59,570 --> 00:04:00,890
on training this model.

62
00:04:01,010 --> 00:04:02,970
So that's why there are costs.

63
00:04:03,130 --> 00:04:09,290
Typically, the cost depends on how many input tokens you passed in and the output tokens that you generate.

64
00:04:09,290 --> 00:04:11,210
And there's two little catches in there.

65
00:04:11,250 --> 00:04:17,450
One of them, of course, is that the input tokens needs to include the full sequence so far, including

66
00:04:17,450 --> 00:04:23,530
all of the fake memory that you've inserted in there and anything else you put in there like rag.

67
00:04:23,770 --> 00:04:27,170
And some people think that that sounds unfair because costs will start to accumulate.

68
00:04:27,170 --> 00:04:30,610
But again, remember you need it to do this compute.

69
00:04:30,610 --> 00:04:35,970
You need the transformer to look back on all of this in order to predict the most likely tokens.

70
00:04:35,970 --> 00:04:39,570
So you should be happy to pay the bill because you want the result.

71
00:04:39,570 --> 00:04:41,890
If you wish, you could just pass in the most recent message.

72
00:04:41,890 --> 00:04:42,930
Forget the history.

73
00:04:42,970 --> 00:04:44,290
Don't pay for that compute.

74
00:04:44,490 --> 00:04:47,130
But then your your results are not going to be good.

75
00:04:47,130 --> 00:04:48,130
So that's why.

76
00:04:48,130 --> 00:04:49,330
And it does make sense.

77
00:04:49,610 --> 00:04:57,530
The other catch is that when you pay for these output tokens that Includes any reasoning that the model

78
00:04:57,530 --> 00:05:02,570
is doing for these reasoning models that generate output that describes their thought process.

79
00:05:02,610 --> 00:05:07,650
Those are output tokens that you need to pay for, and in the case of gpts models, you actually don't

80
00:05:07,650 --> 00:05:09,130
even get to see these outputs.

81
00:05:09,130 --> 00:05:13,330
It's all it's all happening behind the scenes, but you still pay for them.

82
00:05:13,370 --> 00:05:16,450
And again, people sometimes think that that sounds a bit mean.

83
00:05:16,930 --> 00:05:21,330
But at the bottom line is that compute needs to happen, that processing needs to happen.

84
00:05:21,330 --> 00:05:23,610
Those those calculations need to happen.

85
00:05:23,610 --> 00:05:30,210
And so it's only fair that we do need to pay the compute cost for calculating that reasoning trace.

86
00:05:30,210 --> 00:05:35,570
But it can lead to a little bit of unpredictability associated with costs when you don't always get

87
00:05:35,570 --> 00:05:37,530
to see the reasoning that's happening.

88
00:05:37,610 --> 00:05:42,090
So those are the two catches to watch out for as well with API costs.

89
00:05:42,330 --> 00:05:46,170
One of the things that I'm going to cover a lot are things called leaderboards, which is where you

90
00:05:46,170 --> 00:05:50,890
get to see ranked comparisons of different llms, particularly in week four.

91
00:05:50,890 --> 00:05:55,720
We do that, but one that I'll show you right away is a leaderboard called vellum which is at vellum

92
00:05:57,120 --> 00:05:57,720
leaderboard.

93
00:05:57,720 --> 00:05:58,840
You should go check it out.

94
00:05:59,040 --> 00:06:02,320
And this is a really convenient it's got lots of leaderboards there.

95
00:06:02,320 --> 00:06:03,800
Lots of great things to look at.

96
00:06:03,800 --> 00:06:08,320
We'll look at many of them later, but in particular there's a very handy one right in the middle which

97
00:06:08,320 --> 00:06:14,440
shows you the context window and the API costs of many major models.

98
00:06:14,440 --> 00:06:21,880
And you'll see there that I'm showing that GPT five, it has a context window of 400,000 tokens.

99
00:06:22,040 --> 00:06:22,920
That's big.

100
00:06:22,960 --> 00:06:24,920
That's a lot that you can cram in there.

101
00:06:25,200 --> 00:06:31,960
The input cost is one point is $1.25, and the output cost is $10.

102
00:06:32,200 --> 00:06:35,400
And you might think, okay, that's quite a lot $10.

103
00:06:35,400 --> 00:06:40,680
But do keep in mind that that is $10 per million output tokens.

104
00:06:40,680 --> 00:06:46,600
So there's a there's a lot there, uh, that you can fit in a million output tokens basically give or

105
00:06:46,600 --> 00:06:47,040
take.

106
00:06:47,080 --> 00:06:51,640
You generate the complete works of Shakespeare and you've spent $10.

107
00:06:51,800 --> 00:06:54,760
Uh, and that would be a lot of content to generate.

108
00:06:54,760 --> 00:07:01,040
And so, you know, it's definitely worth understanding that these things are they come at a cost.

109
00:07:01,040 --> 00:07:06,880
And if you're looking to operate this at scale, where you have lots of concurrent conversations happening,

110
00:07:06,880 --> 00:07:11,720
then you need to take a lot of care to understand what are going to be your unit costs for each of your

111
00:07:11,720 --> 00:07:13,800
users, and how do you factor that in.

112
00:07:13,960 --> 00:07:20,120
But when it comes to individuals using if you're just experimenting, prompting yourself some of these

113
00:07:20,120 --> 00:07:26,600
models with just hi, my name is Ed, then divide those numbers by like like almost a million by like

114
00:07:26,640 --> 00:07:28,840
100,000 for ten tokens.

115
00:07:28,840 --> 00:07:36,760
You can see that the costs involved in individual small API calls are really very small indeed.

116
00:07:36,800 --> 00:07:42,240
And so whilst people I understand it's frustrating to have to put in the $5 up front to OpenAI, that's

117
00:07:42,240 --> 00:07:43,000
annoying.

118
00:07:43,000 --> 00:07:48,920
But recognize that the actual costs that you make for everyday API calls if you if you're not building

119
00:07:48,920 --> 00:07:54,430
a bigger, scalable system or an agent loop which can eat through the tokens, then the costs involved

120
00:07:54,430 --> 00:07:58,870
are relatively small, so it's important to get a handle on that and feel comfortable with what are

121
00:07:58,870 --> 00:08:00,030
these API costs.

122
00:08:00,030 --> 00:08:03,910
And I also want to mention that when you see GPT five there, that is the big version.

123
00:08:03,910 --> 00:08:09,230
If you scale all the way down to the tiny version, GPT five nano, if you work with that, then the

124
00:08:09,230 --> 00:08:17,150
input cost per million tokens is $0.05 $0.05 per million tokens.

125
00:08:17,150 --> 00:08:19,230
So for each token is a millionth of that.

126
00:08:19,270 --> 00:08:24,670
It's a number too small for me to say, and the output cost is 0.4 of a dollar per million.

127
00:08:24,710 --> 00:08:28,030
It's $0.40 per million output tokens.

128
00:08:28,030 --> 00:08:33,430
So if you got GPT five nano to generate the complete works of Shakespeare, you'll be spending less

129
00:08:33,430 --> 00:08:34,430
than a dollar.

130
00:08:35,110 --> 00:08:36,470
A couple of other things to mention.

131
00:08:36,470 --> 00:08:44,150
There's also this this idea called caching, which is that if you send in the same input twice within

132
00:08:44,150 --> 00:08:46,550
a few minutes, then you pay less.

133
00:08:46,550 --> 00:08:50,190
It's cheaper because some of that information is cached with GPT.

134
00:08:50,340 --> 00:08:51,540
That's automatic.

135
00:08:51,540 --> 00:08:52,820
With Claude, it depends.

136
00:08:53,140 --> 00:08:54,100
It can be.

137
00:08:54,140 --> 00:08:55,740
It's more of a story.

138
00:08:55,740 --> 00:09:02,500
But basically in many cases, there's a way if you're sending the same inputs frequently to models,

139
00:09:02,500 --> 00:09:06,460
there's some ways to get those costs even smaller for the input costs.

140
00:09:06,460 --> 00:09:08,100
But there are some some tricks to be aware of.

141
00:09:08,100 --> 00:09:09,580
And we'll talk about that another day.

142
00:09:09,940 --> 00:09:13,100
Another thing to point out is just take a look at some of those context windows.

143
00:09:13,100 --> 00:09:16,100
You'll see GPT five again 400,000.

144
00:09:16,300 --> 00:09:20,580
The Claude range has got a 200,000 context window.

145
00:09:20,780 --> 00:09:27,540
You'll see that GPT OS, the open source model, is around the 130,000 and famously Gemini.

146
00:09:27,580 --> 00:09:34,860
Gemini 2.5 flash has a 1 million token context window, which means, again, that you could almost

147
00:09:34,860 --> 00:09:41,860
fit the complete works of Shakespeare in in one prompt to Gemini and say, tell me, give me a give

148
00:09:41,900 --> 00:09:47,540
me a quote about from this play and it will be able to look back and do that, which is just insane.

149
00:09:47,700 --> 00:09:53,180
So it's good to get that perspective and have that kind of sense of what is the range of context windows

150
00:09:53,220 --> 00:09:55,460
offered by these different LMS?

151
00:09:55,500 --> 00:09:59,580
Goodness gracious, somehow you're already 10% of the way through.

152
00:09:59,620 --> 00:10:00,740
How about that?

153
00:10:00,980 --> 00:10:01,500
Wow.

154
00:10:01,900 --> 00:10:05,940
So what you could already do, you can already write code to call OpenAI and a llama.

155
00:10:05,980 --> 00:10:06,860
Make a summary.

156
00:10:06,900 --> 00:10:09,580
You can contrast the leading frontier models.

157
00:10:09,780 --> 00:10:14,380
And now at this point, you've had this introduction to transformers, to tokens to context, windows

158
00:10:14,380 --> 00:10:19,780
to API costs, and you've just got this good grounding, this good foundations to everything that will

159
00:10:19,820 --> 00:10:24,620
be working on, including the illusion of memory by the end of the next time.

160
00:10:24,620 --> 00:10:30,500
At the end of this week, you'll be confident with the OpenAI API, the chat completions, API, chat

161
00:10:30,500 --> 00:10:30,940
completions.

162
00:10:31,780 --> 00:10:34,500
You'll you'll know about this thing called one shot prompting.

163
00:10:34,540 --> 00:10:38,300
You'll know about streaming markdown JSON results.

164
00:10:38,460 --> 00:10:43,420
You'll have implemented a business solution in a matter of a few minutes.

165
00:10:43,420 --> 00:10:44,780
It's going to be all practical.

166
00:10:44,780 --> 00:10:46,060
It's going to be sleeves rolled up.

167
00:10:46,060 --> 00:10:47,220
We're going to be building.

168
00:10:47,900 --> 00:10:48,900
I will see you then.