1
00:00:00,040 --> 00:00:04,640
And we will continue our exploration of Tokenizers by looking at a few others we're going to look at

2
00:00:04,640 --> 00:00:14,320
five for from Microsoft Deep SEQ, 3.1 from Chinese upstart deep Sky and encoder 2.5 from Alibaba Cloud.

3
00:00:14,320 --> 00:00:16,440
So I will load in those variables.

4
00:00:16,480 --> 00:00:21,880
As always, it's the hugging face username Microsoft followed by the name of their model five for mini

5
00:00:21,880 --> 00:00:22,880
instruct.

6
00:00:22,880 --> 00:00:28,000
So it's an Instruct variant and then we will take some text.

7
00:00:28,200 --> 00:00:35,440
I am curiously excited to show hugging Face Tokenizers in action to my LLM engineers.

8
00:00:35,560 --> 00:00:40,400
Let's turn that into tokens or really token IDs and then decode them.

9
00:00:40,400 --> 00:00:43,000
And we'll do it for Lama and then for five four.

10
00:00:43,000 --> 00:00:47,720
So for Lama you can see here there's like a beginner text token.

11
00:00:47,720 --> 00:00:52,920
And then I am curiously excited to show her hugging face.

12
00:00:52,960 --> 00:00:54,000
Tokenizers.

13
00:00:54,400 --> 00:00:58,680
And if you look down here, what you'll notice we're looking at the five four version.

14
00:00:58,680 --> 00:01:00,880
So the numbers are completely different.

15
00:01:00,920 --> 00:01:03,520
That's the first thing to notice, just different IDs.

16
00:01:03,520 --> 00:01:06,910
It's been trained with different numbers and that doesn't matter at all.

17
00:01:06,950 --> 00:01:09,790
So whatever numbers it's been trained with have to be consistent.

18
00:01:10,190 --> 00:01:12,550
And it doesn't start with a special token.

19
00:01:12,550 --> 00:01:15,750
So they've got a different approach, a different structure.

20
00:01:15,750 --> 00:01:19,070
They've experimented, they've decided this suits their model better.

21
00:01:19,430 --> 00:01:22,990
And you'll also notice that mostly the words matter the same way.

22
00:01:23,030 --> 00:01:26,630
Except look at that hugging face when it came to llamas.

23
00:01:26,630 --> 00:01:34,790
Tokenizer was hugging face, but when it came to Phi four it was hugging face.

24
00:01:35,070 --> 00:01:36,790
So different.

25
00:01:36,790 --> 00:01:39,390
Different tokenizers slightly different behavior.

26
00:01:39,550 --> 00:01:43,910
And it's just worth appreciating that they're different but not worrying about it.

27
00:01:44,110 --> 00:01:49,270
Uh, and uh, this is showing the apply chat template.

28
00:01:49,510 --> 00:01:53,870
Uh llama we get these beginner text start header.

29
00:01:53,910 --> 00:01:57,670
You'll notice llama has something that they also insert in there.

30
00:01:57,830 --> 00:02:00,910
Uh but then and then there's these these headers.

31
00:02:01,230 --> 00:02:03,590
The five for one is quite a lot simpler.

32
00:02:03,590 --> 00:02:09,270
It has system end user and assistant.

33
00:02:10,030 --> 00:02:11,380
And then the response.

34
00:02:11,380 --> 00:02:12,940
So that's the difference between them.

35
00:02:12,940 --> 00:02:15,780
The apply chat templates look completely different.

36
00:02:15,780 --> 00:02:20,300
And all that matters is that this was used consistently during training okay.

37
00:02:20,340 --> 00:02:23,420
And now we look at deep six tokenizer.

38
00:02:23,620 --> 00:02:29,460
Um and so what I've got here is just showing side by side llamas tokens.

39
00:02:29,740 --> 00:02:31,340
And this is five four tokens.

40
00:02:31,340 --> 00:02:34,740
And this is deep six tokens for the same sentence.

41
00:02:34,780 --> 00:02:38,020
Just to illustrate again the numbers are completely different.

42
00:02:38,020 --> 00:02:40,460
Different vocab different IDs.

43
00:02:40,700 --> 00:02:43,660
All that matters is the consistency with the training data.

44
00:02:43,940 --> 00:02:48,100
Again if you look at deep six chat template it looks totally different.

45
00:02:48,220 --> 00:02:53,220
It has a special token for beginning of sentence and then user and then assistant.

46
00:02:53,420 --> 00:02:57,020
And the assistant prompt is just comes at the front of the whole thing.

47
00:02:57,180 --> 00:02:58,700
That's how it's organized.

48
00:02:58,700 --> 00:03:02,380
For deep seek you can see these subtle differences between them.

49
00:03:02,740 --> 00:03:06,260
And then the final one I'm going to show you is a tokenizer for code.

50
00:03:06,420 --> 00:03:12,060
So this here is the tokenizer for for um Quinn coder.

51
00:03:12,300 --> 00:03:17,800
And so we bring in the tokenizer I've got some code as def hello world person print.

52
00:03:17,840 --> 00:03:18,960
Hello, person.

53
00:03:19,160 --> 00:03:21,960
Uh, a very complicated Python program there.

54
00:03:22,160 --> 00:03:29,280
And we are going to then tokenize it and then just print out separately the, uh, the results.

55
00:03:29,280 --> 00:03:29,600
Here you go.

56
00:03:29,640 --> 00:03:35,000
You see, what I've done here is I've put the name of a token and then what words it represents.

57
00:03:35,320 --> 00:03:42,200
And you can see that, uh, for example, underscore world is something that maps to some token.

58
00:03:42,440 --> 00:03:46,720
And yeah, you can see that there are some constructs like, uh, close brackets.

59
00:03:46,720 --> 00:03:48,400
Colon gets one token.

60
00:03:48,400 --> 00:03:51,240
So and that seems to be quite, quite an early up token.

61
00:03:51,240 --> 00:03:57,960
And I haven't done a deep analysis on this, but I rather imagine that there are much more common tokens

62
00:03:57,960 --> 00:04:04,240
in this tokenizer that are related to common coding constructs than there might be in a tokenizer that's

63
00:04:04,240 --> 00:04:06,080
not designed for a coding model.

64
00:04:06,080 --> 00:04:12,200
So they might have optimized their tokens to make sure that they can get the most meaning associated

65
00:04:12,200 --> 00:04:15,040
with the token in the way that they tokenize.

66
00:04:15,040 --> 00:04:19,760
So just worth understanding that the tokenizer is associated with the model.

67
00:04:19,800 --> 00:04:26,590
Different models might have a different strategy for how you convert natural language into these token

68
00:04:26,590 --> 00:04:30,870
IDs, because its token IDs that models understand the numbers.

69
00:04:31,030 --> 00:04:36,630
It's not like language or even lists of dicts get passed in to these models.

70
00:04:36,870 --> 00:04:38,750
So today was a bit of a gentle day.

71
00:04:38,790 --> 00:04:42,630
We just got through that pretty quickly, but I will make up for it tomorrow.

72
00:04:42,630 --> 00:04:47,030
Tomorrow will not be gentle, but what you can do now, you can code with frontier models.

73
00:04:47,030 --> 00:04:49,950
You can build a multi-modal AI assistant.

74
00:04:50,150 --> 00:04:55,870
But importantly, not only can you use hugging face pipelines, but you now know about Tokenizers and

75
00:04:55,870 --> 00:05:01,390
how they turn language into numbers, words into numbers, and fragments of words into numbers.

76
00:05:01,950 --> 00:05:05,430
Next time we then dig into hugging face.

77
00:05:05,430 --> 00:05:06,910
We use the models.

78
00:05:06,950 --> 00:05:11,990
We use hugging face models to generate some text, and it's basically doing the same thing as we did

79
00:05:11,990 --> 00:05:12,710
with the pipelines.

80
00:05:12,710 --> 00:05:19,150
But now we are actually driving the model inference ourselves, which is really fun.

81
00:05:19,350 --> 00:05:21,750
And there's other stuff for me to show you tomorrow.

82
00:05:21,750 --> 00:05:25,710
So, so prepare for a big day and I will see you on the other side.