1
00:00:00,080 --> 00:00:00,720
Okay.

2
00:00:01,200 --> 00:00:06,400
We are now going to talk about abstraction layers also just called frameworks.

3
00:00:06,440 --> 00:00:09,240
LM frameworks, they go by lots of names.

4
00:00:09,240 --> 00:00:14,920
And by far the most famous of them is one that we will come back to for sure in week five.

5
00:00:15,160 --> 00:00:20,560
It goes by the name Lang Chain, and I imagine many people take this course just for long.

6
00:00:20,600 --> 00:00:20,920
Chain.

7
00:00:21,000 --> 00:00:22,160
Now here's the thing.

8
00:00:22,440 --> 00:00:28,320
I myself am a bit of a I'm a bit of a lang chain naysayer.

9
00:00:28,320 --> 00:00:30,320
I'm not a huge fan of it.

10
00:00:30,320 --> 00:00:31,560
I'm gonna go along with it.

11
00:00:31,560 --> 00:00:32,880
I will do work with it.

12
00:00:32,920 --> 00:00:33,440
We will.

13
00:00:33,600 --> 00:00:35,320
I will absolutely teach it and I will be.

14
00:00:35,440 --> 00:00:38,200
There's some parts about it I really like, and I'll be clear about them.

15
00:00:38,200 --> 00:00:42,520
And there's some things that I, that I find a little bit more of a headache and I'll tell you them

16
00:00:42,520 --> 00:00:42,640
too.

17
00:00:42,680 --> 00:00:45,120
But I know there are big die hard fans out there.

18
00:00:45,120 --> 00:00:45,400
So.

19
00:00:45,400 --> 00:00:47,600
So don't get angry with me if you're a fan, I get it.

20
00:00:47,600 --> 00:00:49,240
I get that it has positives too.

21
00:00:49,280 --> 00:00:50,040
It's powerful.

22
00:00:50,040 --> 00:00:50,720
It's mighty.

23
00:00:50,760 --> 00:00:52,280
It's quite heavyweight.

24
00:00:52,480 --> 00:00:54,160
Um, so this is what it looks like.

25
00:00:54,160 --> 00:00:58,240
We're going to just just just ask GPT five nano to tell us a joke.

26
00:00:58,280 --> 00:00:59,400
Oh, let's go with mini.

27
00:00:59,440 --> 00:01:02,770
Let's, let's spend a little bit more and get a better joke.

28
00:01:02,770 --> 00:01:06,250
Lang chain OpenAI Import chat OpenAI.

29
00:01:06,930 --> 00:01:14,610
We create a chat OpenAI object, and then we say responses LLM invoke and we pass in the tell a joke,

30
00:01:14,770 --> 00:01:17,650
uh, collection and then we'll print the results.

31
00:01:17,650 --> 00:01:19,250
This is the the list of dicks.

32
00:01:19,450 --> 00:01:21,530
Um, and you might say this doesn't look too heavy.

33
00:01:21,570 --> 00:01:21,850
Wait.

34
00:01:21,850 --> 00:01:22,530
And sure.

35
00:01:22,690 --> 00:01:23,370
Not yet.

36
00:01:23,410 --> 00:01:23,730
Right.

37
00:01:23,970 --> 00:01:25,010
Wait till week five.

38
00:01:25,130 --> 00:01:28,610
Uh, um, but yeah, you know, there is quite a lot to learn.

39
00:01:28,890 --> 00:01:31,490
It's a it's a it's a it's a big abstraction layer.

40
00:01:31,490 --> 00:01:36,610
There are a lot of important abstractions which allow you to work in the Lang chain environment with

41
00:01:36,610 --> 00:01:38,770
lots of different models, but there is a fair bit to learn.

42
00:01:38,810 --> 00:01:39,290
Okay.

43
00:01:39,610 --> 00:01:43,250
Uh, how many LLM engineering students does it take to change a light bulb?

44
00:01:43,610 --> 00:01:50,330
One to ask the model for 10,000 possible ways, one to filter the outputs and the rest to argue over

45
00:01:50,330 --> 00:01:51,610
the exact prompt wording?

46
00:01:52,010 --> 00:01:53,010
That's not bad at all.

47
00:01:53,050 --> 00:01:59,450
I had that that had an unexpected twist at the end that I was all getting ready to be like, ah, but

48
00:01:59,450 --> 00:02:01,530
it was quite funny at the end of it.

49
00:02:01,730 --> 00:02:02,090
Uh.

50
00:02:02,460 --> 00:02:03,380
Very nice.

51
00:02:03,420 --> 00:02:03,780
Good.

52
00:02:03,820 --> 00:02:06,100
Good for, uh, GPT five mini.

53
00:02:06,260 --> 00:02:06,900
Uh, of course.

54
00:02:06,900 --> 00:02:08,500
It's got nothing to do with Lang Chain.

55
00:02:08,500 --> 00:02:10,340
Just that Lang chain helped us make the call.

56
00:02:10,340 --> 00:02:14,260
But quite respectable performance from GPT five mini.

57
00:02:14,380 --> 00:02:19,100
And this is your first look at using Lang Chain to call an LLM.

58
00:02:19,300 --> 00:02:24,860
And of course, the thing for Lang Chain is that you could then import other providers with Lang chain,

59
00:02:24,860 --> 00:02:30,140
underscore and another provider, and then create different objects here that you would connect to.

60
00:02:30,900 --> 00:02:37,700
But by contrast with lang chain, almost the opposite extreme is an abstraction layer called light LM.

61
00:02:38,060 --> 00:02:40,900
And a light LM does what it says on the tin.

62
00:02:40,940 --> 00:02:42,860
It is very light.

63
00:02:42,900 --> 00:02:48,380
It is a lightweight abstraction layer that just gives you a simple interface to any model.

64
00:02:48,380 --> 00:02:52,500
And personally, I really love light lm I use it a lot myself.

65
00:02:52,540 --> 00:02:55,700
It gives you such an easy way to switch between different models.

66
00:02:55,740 --> 00:03:00,620
I mean, you can also just use the OpenAI client library, but light LM is just particularly simple.

67
00:03:00,860 --> 00:03:02,140
So this is how it looks.

68
00:03:02,260 --> 00:03:04,590
You just import this one thing.

69
00:03:04,630 --> 00:03:05,550
Completion.

70
00:03:05,550 --> 00:03:08,350
And then you just say response is completion.

71
00:03:08,510 --> 00:03:10,910
It's a bit like OpenAI, but just the word completion.

72
00:03:10,910 --> 00:03:17,710
You pass in the model and the messages and then you get back the same as OpenAI response choices, zero

73
00:03:17,950 --> 00:03:19,470
message content.

74
00:03:19,790 --> 00:03:26,070
And with what you pass in here, you pass in a particular format, which is a provider slash, a model

75
00:03:26,070 --> 00:03:27,390
name and light.

76
00:03:27,390 --> 00:03:30,470
LM has all of the lists of all the different providers.

77
00:03:30,470 --> 00:03:34,990
And the great thing about this is that if you if you're interested in using bedrock, if you're using

78
00:03:34,990 --> 00:03:40,030
AWS and you want to call a model on bedrock, you can just use a different format here with bedrock

79
00:03:40,030 --> 00:03:42,030
Slash and called bedrock models.

80
00:03:42,030 --> 00:03:47,150
If you're using Azure, it's the same you can and vertex for Google.

81
00:03:47,150 --> 00:03:51,550
So for any of the managed services you can call all of them through light LM as well.

82
00:03:51,550 --> 00:03:53,150
It lets you connect to anything.

83
00:03:53,150 --> 00:03:55,350
So that makes it so convenient.

84
00:03:55,590 --> 00:03:57,350
Uh, so anyways, I should stop talking.

85
00:03:57,390 --> 00:04:02,270
And uh, this is, uh, going with the big GPT 4.1.

86
00:04:02,270 --> 00:04:04,880
So, uh, the prior model.

87
00:04:05,760 --> 00:04:08,880
Why did the LM engineering student break up with their language model?

88
00:04:08,880 --> 00:04:12,720
Because it kept predicting their next move.

89
00:04:13,200 --> 00:04:13,560
Yeah.

90
00:04:13,600 --> 00:04:14,320
That's right.

91
00:04:14,360 --> 00:04:15,240
The next token.

92
00:04:15,240 --> 00:04:16,400
Next move, I see.

93
00:04:16,440 --> 00:04:17,560
Yeah, that's fair enough.

94
00:04:17,560 --> 00:04:22,640
That's I, I would say that it's not done as well as five minutes, but that's respectable.

95
00:04:22,760 --> 00:04:23,160
Uh.

96
00:04:23,520 --> 00:04:24,520
Not bad.

97
00:04:24,880 --> 00:04:25,440
Okay.

98
00:04:25,480 --> 00:04:31,440
But the reason, the reason that I want to show you later is because in addition to doing this, it

99
00:04:31,440 --> 00:04:38,400
can also have this super useful utility which will tell you how many input tokens and output tokens

100
00:04:38,560 --> 00:04:40,920
and the cost of what you just did.

101
00:04:41,160 --> 00:04:49,440
So the cost of that little run that we just did right then was 0.023 $0.04.

102
00:04:49,600 --> 00:04:57,280
And just in case you're getting confused and you think I'm saying $0.0234 and that this, this API call

103
00:04:57,320 --> 00:04:59,600
cost, you think it cost $0.02.

104
00:04:59,600 --> 00:05:00,440
It didn't.

105
00:05:00,480 --> 00:05:10,370
It costed it costed uh uh two To, uh, hundredths of a cent it cost 2/10,000 of a dollar.

106
00:05:10,770 --> 00:05:12,290
Absolutely tiny.

107
00:05:12,450 --> 00:05:18,170
So people do tend to get quite, quite, uh, focused on API costs.

108
00:05:18,290 --> 00:05:22,610
And it's important to remember, as I said before, when you're thinking about the unit economics of

109
00:05:22,610 --> 00:05:27,610
a big set of APIs, then it matters to understand how much each one costs.

110
00:05:27,930 --> 00:05:32,810
But when you're dealing yourself individually with individual calls, even with a big model like GPT

111
00:05:32,850 --> 00:05:35,850
4.1, it's absolutely tiny.

112
00:05:35,890 --> 00:05:38,050
The numbers involved, you should always check them.

113
00:05:38,090 --> 00:05:44,690
Light LM gives you a way to check every single call, but you should always also monitor the websites

114
00:05:44,690 --> 00:05:46,050
the platforms themselves.

115
00:05:46,290 --> 00:05:51,010
But but do keep in mind that the typical costs are like per million tokens.

116
00:05:51,170 --> 00:05:54,850
They add up to be very small for your everyday API call.

117
00:05:55,050 --> 00:06:01,450
Okay, now I want to use light LM to to show off a professional feature a pro feature.

118
00:06:01,450 --> 00:06:06,460
So the pros you can set up, you can you can turn me down from two x to 1.5 x.

119
00:06:06,500 --> 00:06:07,980
I've got something here for you.

120
00:06:08,020 --> 00:06:08,900
Look at this.

121
00:06:09,180 --> 00:06:15,580
So let's suppose that I that I check out a file that I've got locally that happens to be called hamlet

122
00:06:15,580 --> 00:06:16,300
dot txt.

123
00:06:16,420 --> 00:06:17,740
I wonder what it might contain.

124
00:06:18,100 --> 00:06:22,860
It contains, of course, the complete, uh, play Hamlet by William Shakespeare.

125
00:06:23,140 --> 00:06:27,060
So I'm just going to read that into a variable called hamlet.

126
00:06:27,100 --> 00:06:32,740
And I just want to print out a little bit of it, which is when, uh, lettuce says, uh, speak, man.

127
00:06:32,900 --> 00:06:33,660
Uh, sorry, sorry.

128
00:06:33,700 --> 00:06:34,740
Uh, King says speak, man.

129
00:06:34,740 --> 00:06:36,260
And later says, where is my father?

130
00:06:36,260 --> 00:06:37,580
And the king says, dead.

131
00:06:38,020 --> 00:06:41,740
Uh, so that's just a little just an extract from it.

132
00:06:41,900 --> 00:06:44,260
So let's ask a question to a model.

133
00:06:44,420 --> 00:06:48,500
We'll say in hamlet, when Laertes asks, where is my father?

134
00:06:48,540 --> 00:06:49,820
What is the reply?

135
00:06:50,140 --> 00:06:55,100
So we're going to start by we're going to use, uh, the light lm completion.

136
00:06:55,260 --> 00:07:00,420
We're going to ask Gemini 2.5 flash light, exactly that question.

137
00:07:00,420 --> 00:07:01,860
And let's see what it says.

138
00:07:01,860 --> 00:07:02,980
Here comes the answer.

139
00:07:03,700 --> 00:07:07,140
He says A father blessed that hath a knowing son.

140
00:07:07,180 --> 00:07:08,220
Well, that's not true.

141
00:07:08,260 --> 00:07:08,820
Ha ha ha!

142
00:07:09,300 --> 00:07:10,060
It is dead.

143
00:07:10,300 --> 00:07:10,660
Uh.

144
00:07:11,100 --> 00:07:12,700
And, uh, the, uh.

145
00:07:12,740 --> 00:07:18,860
So so back comes a, uh, super confusing but confident answer.

146
00:07:18,860 --> 00:07:21,900
And of course, this is a warning sign for you.

147
00:07:21,900 --> 00:07:27,660
Recognize that when llms get it wrong, when they're backed into a corner, they have to answer something

148
00:07:27,660 --> 00:07:28,740
and they don't know.

149
00:07:28,780 --> 00:07:35,020
They will typically tend to confidently give the wrong answer, because that seems like the most likely

150
00:07:35,020 --> 00:07:35,940
next token.

151
00:07:35,940 --> 00:07:39,180
And so this is an example of a confident hallucination.

152
00:07:39,500 --> 00:07:44,140
Uh, while we're here, we can just quickly look at how many tokens are being used, and you'll see

153
00:07:44,140 --> 00:07:46,500
that a total number of tokens was 122.

154
00:07:46,540 --> 00:07:49,260
It was 19 input and 103 output tokens.

155
00:07:49,260 --> 00:07:53,580
And that total cost was 0.004 $0.03.

156
00:07:53,820 --> 00:07:55,020
Very cheap indeed.

157
00:07:55,180 --> 00:07:57,140
Uh, it was uh, yeah.

158
00:07:57,180 --> 00:08:03,260
Uh, this is, uh, not an expensive question to have asked, but that is about to change.

159
00:08:03,300 --> 00:08:03,580
Okay.

160
00:08:03,580 --> 00:08:10,310
What we're now going to do is I'm going to add to the question for context, here is the entire text

161
00:08:10,310 --> 00:08:11,070
of hamlet.

162
00:08:11,070 --> 00:08:13,230
And I'm going to shove in all of hamlet.

163
00:08:13,230 --> 00:08:19,030
So it's now a rather bigger question, and we're going to call Gemini Flashlight again.

164
00:08:19,190 --> 00:08:22,270
It's going to take a little bit longer and it gets the right answer.

165
00:08:22,310 --> 00:08:22,830
Dead.

166
00:08:22,950 --> 00:08:26,270
And this is an example of what happens when you put in the right context.

167
00:08:26,310 --> 00:08:33,390
If I print out what happened here, we had that many input tokens 23 output tokens and it costs half

168
00:08:33,430 --> 00:08:33,990
a cent.

169
00:08:34,030 --> 00:08:37,270
It's still not going to break the bank, but it did cost half a cent.

170
00:08:37,670 --> 00:08:44,630
Now I'm just going to run that same question a second time again and print the tokens again.

171
00:08:44,830 --> 00:08:49,590
And this time, when I print it again, you'll see that there's something different.

172
00:08:49,750 --> 00:08:52,550
It costs about five times less.

173
00:08:52,710 --> 00:08:57,390
It costs about 0.1% instead of 0.5 of a cent.

174
00:08:57,590 --> 00:09:01,150
And when I printed out the number of tokens, you'll see there's a difference there.

175
00:09:01,430 --> 00:09:05,310
Uh, the the same number of input tokens, 53,000.

176
00:09:05,430 --> 00:09:11,600
But look, Thousand 200 of them are called cached tokens.

177
00:09:11,760 --> 00:09:18,240
That means that it detected that we made a second call to the model within a few minutes, and it had

178
00:09:18,240 --> 00:09:23,720
already cached inputting those into the model, and as a result, you pay a lot less.

179
00:09:23,760 --> 00:09:26,800
In fact, about five times less on the input tokens.

180
00:09:27,000 --> 00:09:30,160
And so this is a pro feature called prompt caching.

181
00:09:30,160 --> 00:09:36,760
And it's something which can automatically allow you to pay less if you prompt with the same input context

182
00:09:36,800 --> 00:09:38,400
or similar input context.

183
00:09:38,520 --> 00:09:42,040
Um, several times in a row within a certain period of time.

184
00:09:42,040 --> 00:09:47,400
So there is a pro feature for you, and I've given a little bit more explanation about the different

185
00:09:47,400 --> 00:09:47,920
rules.

186
00:09:47,920 --> 00:09:51,840
There's some fine print for OpenAI and Anthropic and Gemini.

187
00:09:52,200 --> 00:09:59,080
The the important thing to know is that typically and certainly for OpenAI, the beginning of the prompt

188
00:09:59,080 --> 00:10:05,320
must match identically all the way up until the point where you no longer need to cache.

189
00:10:05,400 --> 00:10:09,330
So if you have something like today's date that you insert in your Certain your prompt.

190
00:10:09,330 --> 00:10:14,210
If you insert that the date and time at the very beginning, you will never get prompt caching because

191
00:10:14,210 --> 00:10:16,730
it will be different from the very start of your prompt.

192
00:10:16,730 --> 00:10:21,170
So you want to put that kind of information that might change at the end of your prompt, not at the

193
00:10:21,170 --> 00:10:24,610
beginning, and put your static content at the start.

194
00:10:24,610 --> 00:10:29,450
So if you have a number of different questions on hamlet, you shouldn't put the question and then all

195
00:10:29,490 --> 00:10:33,250
of Hamlet's text, you should always do it with Hamlet's text.

196
00:10:33,250 --> 00:10:37,970
And then the question, because that way this will all be cached every single time.

197
00:10:38,010 --> 00:10:41,490
So that's a trick to be aware of, to know about, or a trap, really.

198
00:10:41,650 --> 00:10:44,930
Um, with anthropic it doesn't work the same way.

199
00:10:44,970 --> 00:10:46,970
It won't automatically prompt cache.

200
00:10:47,010 --> 00:10:50,210
You have to tell it that you want to be doing caching.

201
00:10:50,210 --> 00:10:55,890
You want to prime the cache and you pay more when you prime the cache 25% more.

202
00:10:56,290 --> 00:11:00,730
But then when you reuse from the cache, you pay ten times less.

203
00:11:00,730 --> 00:11:06,130
It's a much bigger saving, so there's a little upfront cost and a much bigger saving.

204
00:11:06,370 --> 00:11:08,210
And Gemini actually supports both.

205
00:11:08,250 --> 00:11:12,780
There's like an implicit mode and an explicit mode, and I'm not going to go into it now.

206
00:11:12,780 --> 00:11:15,220
You can read all about it through these links.

207
00:11:15,220 --> 00:11:20,980
So if you're very cost conscious and you're sending big prompts with lots of input, context, and you're

208
00:11:20,980 --> 00:11:25,740
thinking carefully about the unit economics of the calls that you're making, then you should look into

209
00:11:25,780 --> 00:11:27,020
prompt caching.

210
00:11:27,060 --> 00:11:32,020
It's a great way to reduce the costs of your input context.

211
00:11:32,500 --> 00:11:33,100
All right.

212
00:11:33,100 --> 00:11:35,100
That's the end of the pro piece.

213
00:11:35,140 --> 00:11:40,900
And it's particularly useful because light LM makes it so easy to see what's going on, to see how many

214
00:11:41,260 --> 00:11:45,940
tokens you're using from the cache and to track the costs of each call.

215
00:11:45,980 --> 00:11:51,980
And I've built systems, production systems where I use light LM and I keep track because you can switch

216
00:11:51,980 --> 00:11:53,180
between different models.

217
00:11:53,180 --> 00:11:57,180
It keeps track of the tokens and spends for every user request.

218
00:11:57,180 --> 00:12:02,460
So we can have a sort of admin function where we can watch and see how much we're spending on each of

219
00:12:02,460 --> 00:12:03,340
our clients.

220
00:12:03,340 --> 00:12:09,420
And we can compare that to, to the, the, the revenues, and use that to be able to make sure that,

221
00:12:09,460 --> 00:12:11,620
that our unit economics are favorable.