1
00:00:00,150 --> 00:00:03,180
-: As I mentioned at the beginning of this section,

2
00:00:03,180 --> 00:00:05,760
whether you know how it works or not,

3
00:00:05,760 --> 00:00:09,060
you can still get quality prompts

4
00:00:09,060 --> 00:00:12,330
and quality outputs from ChatGPT,

5
00:00:12,330 --> 00:00:16,590
but understanding these nuances and these subtleties

6
00:00:16,590 --> 00:00:20,130
is just gonna give you a more well-rounded education

7
00:00:20,130 --> 00:00:24,090
on how to effectively use it for your own personal life.

8
00:00:24,090 --> 00:00:27,090
So let's now talk about

9
00:00:27,090 --> 00:00:30,600
how does Chad GBT actually answer your questions.

10
00:00:30,600 --> 00:00:35,600
So GPT stands for Generative Pre-Trained Transformer,

11
00:00:37,020 --> 00:00:40,590
and it doesn't read words in the way that humans do,

12
00:00:40,590 --> 00:00:42,900
but it processes them

13
00:00:42,900 --> 00:00:47,100
through a series of computational steps.

14
00:00:47,100 --> 00:00:50,490
And this is gonna be an overview of how it computes

15
00:00:50,490 --> 00:00:54,360
and handles these words from your prompts.

16
00:00:54,360 --> 00:00:57,750
First, we have tokenization.

17
00:00:57,750 --> 00:01:01,680
Initially, the text that is inputted from a user

18
00:01:01,680 --> 00:01:04,950
is broken down into smaller pieces,

19
00:01:04,950 --> 00:01:09,450
much like splitting a sentence into individual words

20
00:01:09,450 --> 00:01:11,670
or even smaller parts.

21
00:01:11,670 --> 00:01:14,730
This process is called tokenization.

22
00:01:14,730 --> 00:01:19,260
For example, the sentence, Chatgpt is great,

23
00:01:19,260 --> 00:01:22,440
might be split down into three pieces,

24
00:01:22,440 --> 00:01:26,190
ChatGPT, is, and great.

25
00:01:26,190 --> 00:01:29,940
Each of these pieces is referred to as a token,

26
00:01:29,940 --> 00:01:33,540
and this token breakdown makes it easier for the model

27
00:01:33,540 --> 00:01:37,890
to analyze the text and understand its structure.

28
00:01:37,890 --> 00:01:41,160
In short, it's taking a large amount of information

29
00:01:41,160 --> 00:01:43,530
and breaking it down into smaller chunks

30
00:01:43,530 --> 00:01:46,710
that it can dissect and further understand

31
00:01:46,710 --> 00:01:48,990
by looking at things individually

32
00:01:48,990 --> 00:01:51,960
and then looking at how these different parts and pieces

33
00:01:51,960 --> 00:01:53,670
relate to each other.

34
00:01:53,670 --> 00:01:56,760
This leads to the embedding phase.

35
00:01:56,760 --> 00:02:00,150
After the text is broken down into smaller pieces,

36
00:02:00,150 --> 00:02:04,560
each piece is then converted into a numerical form,

37
00:02:04,560 --> 00:02:06,540
like a unique code.

38
00:02:06,540 --> 00:02:10,229
So this way, the computer can understand

39
00:02:10,229 --> 00:02:12,810
and work with it more efficiently.

40
00:02:12,810 --> 00:02:15,330
This process is known as embedding.

41
00:02:15,330 --> 00:02:17,670
And it's like translating the words

42
00:02:17,670 --> 00:02:20,070
that you inputted into a language

43
00:02:20,070 --> 00:02:21,690
that the model can understand,

44
00:02:21,690 --> 00:02:24,840
where each word or individual piece

45
00:02:24,840 --> 00:02:28,770
gets its own unique numerical identifier.

46
00:02:28,770 --> 00:02:31,710
This way, GPT can process the text

47
00:02:31,710 --> 00:02:35,880
in a format that's suitable for mathematical operations,

48
00:02:35,880 --> 00:02:38,190
which are used in the later steps

49
00:02:38,190 --> 00:02:42,270
to analyze the text and generate a response.

50
00:02:42,270 --> 00:02:44,160
This is necessary because these models

51
00:02:44,160 --> 00:02:48,810
do not understand words in the same way a human mind does.

52
00:02:48,810 --> 00:02:51,120
So with these unique codes,

53
00:02:51,120 --> 00:02:55,470
the numerical values are pertaining to semantic

54
00:02:55,470 --> 00:02:58,500
and syntactic information.

55
00:02:58,500 --> 00:03:03,150
Semantic information relates to the meaning of words

56
00:03:03,150 --> 00:03:06,420
and their relationship with other words.

57
00:03:06,420 --> 00:03:09,390
For instance, consider the words doctor,

58
00:03:09,390 --> 00:03:11,940
nurse, and hospital.

59
00:03:11,940 --> 00:03:14,520
Semantically, these words are related

60
00:03:14,520 --> 00:03:17,820
because they all pertain to healthcare.

61
00:03:17,820 --> 00:03:19,440
In the embedding space,

62
00:03:19,440 --> 00:03:22,470
they might be represented by different vectors

63
00:03:22,470 --> 00:03:24,360
that are close to each other,

64
00:03:24,360 --> 00:03:27,360
indicating their related meanings to each other.

65
00:03:27,360 --> 00:03:30,120
On the other hand, a word like pineapple,

66
00:03:30,120 --> 00:03:32,340
which is unrelated to healthcare,

67
00:03:32,340 --> 00:03:37,200
might have a vector far away from the words doctor, nurse,

68
00:03:37,200 --> 00:03:38,460
and hospital.

69
00:03:38,460 --> 00:03:42,720
So it's a Way of grouping information and relationships

70
00:03:42,720 --> 00:03:44,850
that words have with each other.

71
00:03:44,850 --> 00:03:46,892
For syntactic information,

72
00:03:46,892 --> 00:03:51,510
this is about the arrangements of the words and phrases

73
00:03:51,510 --> 00:03:55,470
to create formed sentences in a language.

74
00:03:55,470 --> 00:03:58,530
It's like the framework or rules that dictate

75
00:03:58,530 --> 00:04:01,470
how sentences are constructed.

76
00:04:01,470 --> 00:04:05,400
In English, the typical word ordering for a sentence

77
00:04:05,400 --> 00:04:10,400
is Subject-Verb-Object, or SVO.

78
00:04:10,410 --> 00:04:14,190
So in the sentence, John eats apples,

79
00:04:14,190 --> 00:04:18,089
John is the subject, eats is the verb,

80
00:04:18,089 --> 00:04:20,670
and apples is the object.

81
00:04:20,670 --> 00:04:24,510
Syntactic information helps identify

82
00:04:24,510 --> 00:04:29,510
the correct order of words that make meaningful sentences.

83
00:04:29,730 --> 00:04:32,910
In combination, these two types of information

84
00:04:32,910 --> 00:04:37,910
help GPT to understand the context and nuances

85
00:04:38,520 --> 00:04:40,680
of a user's input,

86
00:04:40,680 --> 00:04:44,910
which leads to step three, which is the encoding process.

87
00:04:44,910 --> 00:04:48,330
This is where tokens are fed into the GPT model

88
00:04:48,330 --> 00:04:52,980
and are essentially moving through a multi-layered network

89
00:04:52,980 --> 00:04:57,120
that refines the understanding of each token

90
00:04:57,120 --> 00:04:59,970
based on the tokens around it.

91
00:04:59,970 --> 00:05:02,880
So let's break down what this actually means.

92
00:05:02,880 --> 00:05:04,860
So first, we have our layers,

93
00:05:04,860 --> 00:05:07,980
and GPT has a series of these layers

94
00:05:07,980 --> 00:05:11,220
that are similar to the floors of a building,

95
00:05:11,220 --> 00:05:15,090
and each layer performs a special operation

96
00:05:15,090 --> 00:05:17,220
on the incoming tokens.

97
00:05:17,220 --> 00:05:20,814
So imagine each layer as a kind of workshop

98
00:05:20,814 --> 00:05:24,480
where tokens get refined, reshaped,

99
00:05:24,480 --> 00:05:29,100
based on the surrounding context of information.

100
00:05:29,100 --> 00:05:34,100
GPT-3, for example, has 175 billion parameters

101
00:05:35,190 --> 00:05:40,190
distributed across 93 transformer layers.

102
00:05:41,010 --> 00:05:45,150
Now, in each of these layers, tokens are transformed

103
00:05:45,150 --> 00:05:48,900
through very complex mathematical operations,

104
00:05:48,900 --> 00:05:52,500
and you can think of this as kind of a translation

105
00:05:52,500 --> 00:05:54,600
from one language to another,

106
00:05:54,600 --> 00:05:59,600
helping to build a richer understanding of each token.

107
00:06:00,090 --> 00:06:03,120
We then have the attention mechanism.

108
00:06:03,120 --> 00:06:07,680
So within each layer, there's a mechanism called attention,

109
00:06:07,680 --> 00:06:09,780
which allows each token

110
00:06:09,780 --> 00:06:14,780
to kind of look at other tokens in the input

111
00:06:15,210 --> 00:06:18,360
and adjust its own representation

112
00:06:18,360 --> 00:06:21,660
based on what it, quote, unquote, "sees,"

113
00:06:21,660 --> 00:06:23,880
from looking at other tokens.

114
00:06:23,880 --> 00:06:28,880
For instance, the word bank might adjust its representation

115
00:06:29,160 --> 00:06:32,250
based on whether the surrounding words

116
00:06:32,250 --> 00:06:36,300
relate to a financial institution bank

117
00:06:36,300 --> 00:06:39,570
or the side of a river bank.

118
00:06:39,570 --> 00:06:41,790
This type of attention mechanism

119
00:06:41,790 --> 00:06:44,461
is incredibly important for GPT

120
00:06:44,461 --> 00:06:49,461
to understand how all of the inputted words

121
00:06:49,950 --> 00:06:51,360
relate to each other.

122
00:06:51,360 --> 00:06:56,040
This is all leading to what's called contextual adjustment.

123
00:06:56,040 --> 00:06:58,770
Now, the attention mechanism

124
00:06:58,770 --> 00:07:03,270
is helping in adjusting the representation

125
00:07:03,270 --> 00:07:06,540
of each token based on its context.

126
00:07:06,540 --> 00:07:11,540
So this is to make sure that the model understands each word

127
00:07:11,760 --> 00:07:13,500
in a way that makes sense,

128
00:07:13,500 --> 00:07:17,070
given all of the surrounding words around it.

129
00:07:17,070 --> 00:07:20,970
So now these more refined tokens

130
00:07:20,970 --> 00:07:23,460
move up to the next layer,

131
00:07:23,460 --> 00:07:26,130
and they go through a similar process

132
00:07:26,130 --> 00:07:27,900
again and again and again,

133
00:07:27,900 --> 00:07:32,220
with each of these layers that it passes through,

134
00:07:32,220 --> 00:07:35,490
there's a deeper understanding of the context

135
00:07:35,490 --> 00:07:38,463
and the information that the model gains

136
00:07:38,463 --> 00:07:40,500
through this process.

137
00:07:40,500 --> 00:07:43,890
So as these tokens are moving through these layers,

138
00:07:43,890 --> 00:07:48,150
they aggregate and gather more and more information

139
00:07:48,150 --> 00:07:50,040
surrounding the original tokens,

140
00:07:50,040 --> 00:07:53,640
which helps to continuously build a rich

141
00:07:53,640 --> 00:07:55,260
and contextual understanding

142
00:07:55,260 --> 00:07:58,410
of the entire output as a whole.

143
00:07:58,410 --> 00:08:01,800
So by the time the tokens reach the final layer,

144
00:08:01,800 --> 00:08:04,770
they have been significantly refined

145
00:08:04,770 --> 00:08:09,150
and carry a detailed understanding of the input text

146
00:08:09,150 --> 00:08:12,360
provided by the surrounding words.

147
00:08:12,360 --> 00:08:14,250
Each layer has contributed

148
00:08:14,250 --> 00:08:17,357
to building this deep understanding of the input,

149
00:08:17,357 --> 00:08:20,700
making the tokens ready for the next step,

150
00:08:20,700 --> 00:08:23,160
which is the output generation.

151
00:08:23,160 --> 00:08:25,470
After this multi-layered process

152
00:08:25,470 --> 00:08:28,440
of processing the user's input,

153
00:08:28,440 --> 00:08:33,440
GPT begins the task of creating a response.

154
00:08:33,480 --> 00:08:36,990
It starts with the tokens provided in the input

155
00:08:36,990 --> 00:08:41,990
and then works to predict what should come next

156
00:08:42,299 --> 00:08:44,520
one token at a time.

157
00:08:44,520 --> 00:08:47,100
So let's dive into this process.

158
00:08:47,100 --> 00:08:51,390
So we start with our prediction and selection.

159
00:08:51,390 --> 00:08:52,950
For each new token,

160
00:08:52,950 --> 00:08:56,850
GPT looks at all the tokens that have come before it,

161
00:08:56,850 --> 00:08:59,220
which includes the user's input

162
00:08:59,220 --> 00:09:01,710
on top of any other tokens

163
00:09:01,710 --> 00:09:05,820
that were previously generated inside that specific chat,

164
00:09:05,820 --> 00:09:10,410
and this is gonna help predict the most likely next token.

165
00:09:10,410 --> 00:09:13,350
It makes this prediction based on patterns

166
00:09:13,350 --> 00:09:16,080
it has learned from the vast amount of data

167
00:09:16,080 --> 00:09:17,670
it was trained on,

168
00:09:17,670 --> 00:09:21,720
and then GPT selects the one that it calculates

169
00:09:21,720 --> 00:09:24,900
to be the most likely next token

170
00:09:24,900 --> 00:09:28,260
based on its understanding from the training data.

171
00:09:28,260 --> 00:09:31,650
So when processing the input phrase,

172
00:09:31,650 --> 00:09:35,383
Albert Einstein was the world's most...,

173
00:09:36,299 --> 00:09:38,640
GPT would go through the following steps

174
00:09:38,640 --> 00:09:41,790
to predict and select the next word.

175
00:09:41,790 --> 00:09:44,160
First, it would go through the tokenization process

176
00:09:44,160 --> 00:09:45,180
that we mentioned.

177
00:09:45,180 --> 00:09:48,540
The input phrase is tokenized into individual tokens,

178
00:09:48,540 --> 00:09:52,170
aka Albert is separate, Einstein is separate,

179
00:09:52,170 --> 00:09:57,170
was, the, world's, and most are all individually looked at.

180
00:09:57,480 --> 00:09:58,770
Then we have embedding,

181
00:09:58,770 --> 00:10:02,370
where each token is converted into a numerical value.

182
00:10:02,370 --> 00:10:05,940
Those values are then processed through the multiple layers

183
00:10:05,940 --> 00:10:07,530
of the GPT model.

184
00:10:07,530 --> 00:10:09,900
Through the self-attention mechanism,

185
00:10:09,900 --> 00:10:12,510
the model then identifies the relationship

186
00:10:12,510 --> 00:10:14,040
between those tokens,

187
00:10:14,040 --> 00:10:17,220
understanding, for instance, that Albert Einstein

188
00:10:17,220 --> 00:10:21,510
is referring to a notable individual in history.

189
00:10:21,510 --> 00:10:24,570
We then have the probability distribution.

190
00:10:24,570 --> 00:10:26,400
Based on the context,

191
00:10:26,400 --> 00:10:29,760
the model computes a probability distribution

192
00:10:29,760 --> 00:10:34,590
over the vocabulary for the next token

193
00:10:34,590 --> 00:10:36,540
after the word most.

194
00:10:36,540 --> 00:10:39,900
Some common continuations for this probability

195
00:10:39,900 --> 00:10:43,710
would likely be most famous, most influential,

196
00:10:43,710 --> 00:10:45,000
most brilliant.

197
00:10:45,000 --> 00:10:48,840
These would all receive high probabilities.

198
00:10:48,840 --> 00:10:52,020
Now, what exactly is a probability distribution?

199
00:10:52,020 --> 00:10:54,510
It's simply put is like a rule

200
00:10:54,510 --> 00:10:58,530
that tells you how likely different outcomes are

201
00:10:58,530 --> 00:11:03,530
in a situation involving some sort of chance or randomness.

202
00:11:03,900 --> 00:11:07,620
So imagine you have a jar of colored candies,

203
00:11:07,620 --> 00:11:10,170
red, blue, and green.

204
00:11:10,170 --> 00:11:13,830
If the candies are mixed together all completely equally,

205
00:11:13,830 --> 00:11:16,110
the chance of you pulling out one color

206
00:11:16,110 --> 00:11:18,840
is gonna be the same, which is one out of three,

207
00:11:18,840 --> 00:11:23,220
or 33.33333333%.

208
00:11:23,220 --> 00:11:26,640
Now, if you write down these chances for each color,

209
00:11:26,640 --> 00:11:29,760
we get a simple probability distribution.

210
00:11:29,760 --> 00:11:34,760
The chance of red is 33.3, the chance of blue is 33.3,

211
00:11:35,160 --> 00:11:38,538
and the chance of green is 33.3.

212
00:11:38,538 --> 00:11:42,450
In the case of GPT, generating a word,

213
00:11:42,450 --> 00:11:46,200
think of it like the model has a huge jar of words,

214
00:11:46,200 --> 00:11:49,380
and some words are more likely to be picked

215
00:11:49,380 --> 00:11:52,170
based on the previous words.

216
00:11:52,170 --> 00:11:56,010
The probability distribution is the rule that tells GPT

217
00:11:56,010 --> 00:11:59,040
how likely each word is to be picked next,

218
00:11:59,040 --> 00:12:01,980
Simply put, the probability distribution

219
00:12:01,980 --> 00:12:04,320
is the rule that tells GPT

220
00:12:04,320 --> 00:12:07,980
how likely each word is to be picked next.

221
00:12:07,980 --> 00:12:12,390
So GPT goes through this process, it chooses the next word,

222
00:12:12,390 --> 00:12:14,100
and then we have continuation,

223
00:12:14,100 --> 00:12:16,890
where the model continues this process

224
00:12:16,890 --> 00:12:21,630
for each subsequent token using the growing context

225
00:12:21,630 --> 00:12:25,590
to inform the prediction and selection process

226
00:12:25,590 --> 00:12:26,880
of the next token.

227
00:12:26,880 --> 00:12:31,170
For instance, after choosing the word influential,

228
00:12:31,170 --> 00:12:34,020
it might predict next that the proper word

229
00:12:34,020 --> 00:12:37,170
would be physicist, with a high probability,

230
00:12:37,170 --> 00:12:39,720
given that the context of the prompt

231
00:12:39,720 --> 00:12:42,030
is about Albert Einstein.

232
00:12:42,030 --> 00:12:44,820
With this, the model continues generating tokens

233
00:12:44,820 --> 00:12:47,640
until a stopping criteria is met,

234
00:12:47,640 --> 00:12:50,700
such as the end of a sentence, punctuation,

235
00:12:50,700 --> 00:12:54,090
or the maximum token limit has been reached.

236
00:12:54,090 --> 00:12:55,830
This process continues

237
00:12:55,830 --> 00:12:59,010
until the model would likely create the phrase,

238
00:12:59,010 --> 00:13:03,150
Albert Einstein was the world's most influential physicist

239
00:13:03,150 --> 00:13:04,770
of the 20th century,

240
00:13:04,770 --> 00:13:07,650
whose work revolutionize our understanding

241
00:13:07,650 --> 00:13:10,800
of the fundamental laws of the universe.

242
00:13:10,800 --> 00:13:12,720
So it's really incredible to know

243
00:13:12,720 --> 00:13:16,290
that when you put in a prompt and instantly get your output,

244
00:13:16,290 --> 00:13:19,560
this is the process that's going on underneath the hood

245
00:13:19,560 --> 00:13:22,290
that involves billions of parameters,

246
00:13:22,290 --> 00:13:24,570
training data and information,

247
00:13:24,570 --> 00:13:27,990
that is synthesizing all this complex information

248
00:13:27,990 --> 00:13:30,453
and giving you your desired output.

