Original link:

https://www.youtube.com/watch?v=OLg2BKhwaSc

2023-08-28 16:17:10

Run a ChatGPT-like AI on Your Laptop Using LLaMA and Alpaca

video content Image generated by Wilowrid

Hello there .

My name's Gary Sims , and this is Gary explains now , obviously , we're going through quite a peak period with a generative , uh , a I we've got , you know , chat G BT you've got Bing you've got , which was a video I did yesterday .

And the thing is , all these , uh , chat models , all of these , uh uh uh machine model run up in the cloud A and all these kind of big GP U and everything , but did you know it is possible to run a pretty good , uh , to kind of chat GP T equivalent on your laptop ?

No need for any special graphics cards .

No need for any special hardware .

Just a normal laptop .

So if you want to find out more , please let me explain .

So I'm gonna take a quick look at the current language models and the new models that can be run on a laptop and then gonna give you a demonstration , it running so you can see how good it actually is .

And then right then we're gonna talk about why this is important .

It's more .

Trust me .

It's more than just an interesting exercise that somebody is able to hack together .

There's actually some important consequences for this .

OK , let's dive in to have a look .

OK , so to get these , uh , generative pretrained transformers GP T these language models running on a PC , not on a huge supercomputer today , we're gonna be looking at Lama and alpaca .

We'll get into those in a moment .

Let's just cover some of the basics .

All machine learning models .

Neural networks are used in two phases .

First , the training stage where the model is fed huge amounts of data which train the model to understand whatever it is that's being trained for , you know ?

Is this a cat ?

Is it not a cat ?

Is this a whole dog is on a hot dog ?

And then secondly , there's the inference stage where the model is interrogated or queried , and it gives a response , infers an answer based on the input that you can give it .

So , up until now , that's mainly been about classification .

So is this a picture of a cat ?

Yes or no ?

But of course , we also have things that , uh , generate an an output .

So you translate an audio into text .

So when we speak to the various assistants that there are on Android and on , you know , on these home smart home devices and so on .

At some point , there has to be a translation from the audio into some kind of text or some kind of internal representation so that the device can reply .

And now what is the real hot topic at the moment , of course , is these , uh , models that can generate text generate images .

I'm sure we're going to get other types of generation , uh , coming out , uh , all the time .

Now , uh , I saw a recent thing about how you can generate , you know , kind of robotic movements , uh , for , you know , factories for picking up boxes and things based on an input .

And then it generates the right sequence of movement .

So generates something based on the input .

Now , training is a single user task in that while the model is being trained , it isn't required to do anything else .

It just sits there and learns .

So there's one model .

It's just being trained .

It could take thousands of hours .

It can take crazy amounts of hardware .

You know , some of the numbers floating around for I think this was GP T three are like 285,000 CPU calls over 10,000 .

Uh , GP U calls where NVIDIA GP U calls have 400 gigabit per second network connectivity and these are all humming along for hours and hours and hours , days and days and days as all of this stuff is fed into the system and it is trained now , obviously , you're not going to be doing that on a PC of any kind because this is obviously a huge thing .

Now , inference is a multi user task in that lots of queries are happening at once over multiple copies of the trained model .

This requires lots of hardware , mainly because of scale .

Also , because of the the , you know , just the size of these models that they create 100 and 75 billion parameters .

I think we mentioned that a bit later in a further slide , uh , chat GP T is often unavailable or has been in the past due to demand because it just needs huge amounts of hardware to run this just because millions of people are using it now , uh , this is a common example of what would be used for both training and also for inference .

And just as a side note , NVIDIA really are coming out on top of this whole current , you know , uh , era of , uh , GP TS and that kind of stuff because they're all using NVIDIA , GP , US and NVIDIA have these various things like this is the , uh , NVIDIA H 100 or 606 140 gigabytes of total GP U memory .

Uh , and it's got four NVS switches so it can connect all these different bits inside of it .

56 core fourth generation intel processors , 30 terabytes of of SSD .

And this is one pod that you can buy like this .

And then these can be built together into a huge kind of supercomputer , and there could be many rows of these .

Now , of course , this is all happening somewhere in a data warehouse .

This is becomes available to companies like open A I via you know , Microsoft Azure .

Or , you know , other services that are offering these , uh , in the cloud .

Now la large language models chat GP T 3.5 GP T four are examples of what they call LL MS large language models 175 billion parameters in chat GP T 3.5 A parameter means a weight and a bias , the number of parameters in the neural network directly to the number of neurons and the number of connections between them .

So the more parameters there are , the the denser , the thicker , the greater it is .

And you need Supercomm .

As I said for both training , Uh , and for inference , 175 billion .

But now there are some new language models that have appeared .

First , we're going to talk about Llama , So this is the one from me .

That's the Facebook people , a collection of foundational language models ranging from 7 billion to 65 billion parameters .

So that's at least 2.5 times smaller than chat .

GP T 3.5 says that Lama , with a 13 billion parameters , outperformed chat GP T while being 10 times smaller .

So this is the point if we can .

We've got these huge language models and they're working surprisingly well .

Can we optimise in quote and bring these down to smaller sizes ?

Can we fine tune them ?

Can we cut off anything that's not really needed ?

And so that we get a more efficient , uh , model .

Now , if it's 10 times smaller , then you know , even just using simple math , you're gonna need a computer that's 10 times smaller .

So maybe you don't need a supercomputer to run , uh , one of these language models .

So there is an open source project called lama dot CPP .

It's a It's a llama C++ programme that allows you to query the llama model .

And the goal of the author was to run the model on a MacBook .

Uh , so it's a plain C implementation without any funny dependencies .

Uh , the Apple silicon is a first class citizen .

However , it does also support things like a VX two for X 86 archit .

I've tried this personally on some jet and hardware .

The Jets and or in the Jetson , Xavier NX works great on those , uh , and so it also runs only on the CPU .

So it will be interesting to see if people actually take this and , uh , make a GP U version of it .

This only runs on the CPU , and the way it can run on the CPU is by using four bit quantization .

So what is four bit quantization ?

So let's have a look in the next slide .

This is a process of reducing the precision of the parameters again , that's the weights and the biases and the connections we talk about so that they consume less memory .

In other words , the process of quantization is the process of taking a neural network , which generally uses 32 bit floats to represent the parameters and instead to convert them to a smaller representation .

In our case , four bit integers obviously a lot , lot less precision there , you know ?

Just think .

If you want to think about , you know , back in the days when we used to have computers , uh , back back in the 18 94 bit colour , if you show showed a picture of four bit colour , it look pixelated .

The colours were not very good .

Nowadays , of course , we use 24 bit and 32 bit colour in just about everything , so that's not a problem .

But so to imagine that kind of idea .

But take it down to a neural network , uh , more blocky , you know , less less colours .

It doesn't look so good , so it it's not a good it's a good representation .

However , if you see a four bit colour picture of a cat .

You still recognise it as a cat , and the neural network can still work now .

The result is it's smaller , faster and more efficient .

But the drawbacks are that it's an accuracy , degradation and lower precision , which means you can also lose , uh , information .

So when we come to the test in a minute , you have to remember that this is one that's been reduced down to four bits , and also it's running on a PC and see what we can achieve .

So here's a question for you .

What case do you use on your phone ?

I recommend Phoenix cases .

They are ultra slim .

Why put on a big , thick , chunky case ?

Run all that design that the manufacturer has spent so much time perfecting ?

You really should check out Phoenix cases , and there's a link in the description below .

If you do use it , it's an affiliate link , which means you help out this channel and the other one we're gonna In fact , this is one I'm gonna use for the demo .

Is Alpaca lama alpaca ?

They're all kind of going along and say now alpaca is a is a is a is the llama model , but it's been fine tuned to train to make it a kind of , uh , an instruction follow model .

So you give it an instruction and it follows it .

So it's trained on 52,000 instruction follow demonstrations .

And so there's another project called alpaca dot CBP , a set of modifications to llama dot CPP to work with the alpaca and to add a chat chat interface .

All the links to these github projects will be , uh , in the description below .

They're really easy to set up if you know anything about just compiling a piece of code from chat GP T to downloading the files that you need for the models or the weights and the parameters we were just talking about .

It's actually pretty easy to get yourself up and running , and the instructions are pretty good .

OK , so we're gonna do this now .

We're going to run out of Packer , and I want to just point out that this is just running on an A laptop , not a very expensive one .

It's got a rise on 55 55,000 500 U at six core 12 threads , 16 gigs of Ram .

It's running Linux , and in fact , when it's running , it uses about four gigabytes of Ram .

Now , when you use some of the other llama models , they can go up quite big .

I did run the 30 billion parameter one on a jet and or in , uh , because it has 32 gigabytes of memory and that did work on there .

But just just to show you how good alpaca is on an Asus laptop , let's do that now .

OK , so here we are inside of the project , you just some C files and some binary files and some other things that you need here .

And basically there's a little programme called Chat , so you just run chat and it will go ahead and load up the model into memory .

As I said , it takes about four gigs , and then it will give you a prompt .

So now we can type .

So this is just like chat GP T in that I can just ask it questions , So I prepared some questions .

Let's go with the first one .

What are the best places to visit in New York ?

Remember ?

I'm asking this of a language model , which is running on a laptop in four gigabytes of memory .

So if this can come up with anything , that's even reasonable English , you know , and even has something , you know , that's actually quite good .

This is astounding .

So here we go .

The top destination for tourists include Times Square , Central Park , the Metropolitan Museum of Art , among others Where they go .

Not a very long , but that is a correct answer .

And so I'm running this on a laptop , not on a supercomputer .

Somewhere up in the cloud .

OK , let's see .

Let's make these but more difficult explain what we're gonna ask you about .

Pinocchio is something I've been doing in the other ones .

Explain out my other videos on on chat , GP , T and Bar and so on .

Explain the plot of Pinocchio in one sentence .

Let's see now it doesn't I haven't told you what Pinocchio is .

I'm assuming it knows what Pinocchio is .

It's got the story built into it .

There .

Pinocchio is a wooden pan , becoming a home but prove himself worth by competing various tasks before he can transform to Real boy again .

The story follows his adventures and Miss Avengers , where as the lessons that I learned along the way .

Absolutely right .

So there you go .

I've just This is so amazing .

He's running on my laptop .

I've just I don't need you know , all these huge things running up in these huge , uh , cloud things .

Um , I'll talk more about why this is important .

At the very end .

I do stick around to the end because this is the this is really important to understand where where these generative models are going and what it means for us as consumers .

Anyway , let's ask another one .

What is WiFi six E Now , uh , I don't know what the cut off date for this is .

How much does it know about WiFi six E WiFi with the registered trademark six or 802 to 11 .

A X was released on the six and represents upgrading and wireless technology , offering faster speeds .

And now that's all true .

But of course , the six E version is what six a hour ago is an extension that has support for up to eight hours antennas on each access point .

Of course , the thing about six years if it runs in six gigahertz , not 2.4 gigahertz or five gigahertz .

So OK , the model didn't know that .

But the tech generated is readable English .

It's about WiFi .

It's got some interesting facts in there .

Of course .

Remember , this is just running in four gigabytes of memory .

So I'm impressed .

Of course , this isn't good enough for , you know , a final product , but this just shows you the direction things are going .

As I said , Wait till the end .

We'll talk more about what this is .

It's 5 a.m. You're asleep and was allowed to knock at the door at a friend who has come over to have breakfast .

In your fridge are eggs , bread , milk , juice , a jar of honey .

What do you open first ?

And I've done this on the other videos about chat GP T bar and so on .

Of course , one is that you should open the front door first because your friend standing , this is a kind of a kid run .

Other people reply .

You should open your eyes first because you are asleep .

So anyway , let's just see what this is .

The point is , if this can even understand a little bit about what I'm saying here , this is just absolutely amazing .

It may not get the trick .

The joke about the fact that you should open your eyes first or open the door first .

However , if you can understand right , opening up the fridge for food can be tricky when there's so much variety in one place .

Absolutely .

If I had to choose between opening an egg or some freshly baked bread from from from my friend's visit , it would have to go with eggs .

Egg whites are a great source of protein and vitamins , blah , blah , blah , blah , blah .

OK , so , uh , now it's spewing out a bit of extra tech here that's come from this inference process instruction .

Write down three things you can do today in order to make your life easier tomorrow and it's look , it's it's running on Luke so clearly there's some work to be done here about not letting this thing , you know , kind of wax lyrical when because it's just generating text and it can kind of trick itself into into generating more text , which is what we're seeing now .

OK , the final test for this model is write a list of five names for characters in a sci-fi novel .

Let's see whether it can understand what a sci-fi novel is , whether it can generate names for me or not .

One doctor , Catherine Smith PhD , a brilliant scientist who had dedicated her life to finding the cure for an epidemic that is plaguing rage ravaging humanity's population centre .

Around there , she works over .

OK , so yeah , she's a good character .

Brilliant doctor John MD , Katherine's husband .

Oh , look , they done a relationship between the first character and the second character .

He's also trying to to find a cure .

This is very interesting .

So , uh , it's interesting They connected the two together There .

Three .

Captain James Jones , a brave captain of And now remember , this is all coming out like T .

GP does a word at a time , but this is all coming out my laptop .

Dr .

Sarah Smith PhD .

Catherine's daughter .

We've all got a whole family connection thing going on here .

She's quite clever .

It's me up a lot for maybe this is a real book , I don't know , but some of you sci-fi fans who've read more than I am maybe these are real characters .

I don't know .

This is quite interesting .

Um , and who we got ?

Number five .

Commander John Jones , a brave commander of the United States to explore New World .

Fantastic .

I think that's absolutely amazing .

For what you know , we expect so much from chat GP T .

But this is running on my laptop anyway .

Let's talk about that now .

Why is it important ?

This is just running on my laptop .

OK , so why is this important ?

Well , in the kind of general history of computing , we always get this swinging back and forth between running things locally and running things remotely .

You know , way , way back when we had kind of main frames , there was always terminals .

And then there were workstations that connected to the main frames .

And then ultimately we had the kind of the era of the PC where the power kind of came back to , you know , to the local user , and then as we go on , we've now got cloud services , and so everything's done up in the cloud .

But even things like that with Google's pixel phones , they want to bring a lot of the stuff back to the actual phone itself and not up to the cloud .

So there's always this swinging back and forth now , at the moment we're in this great boom period for generative a IS , and it's all running in the cloud .

But if we can actually generate models that are small enough , or particularly models that are good for a particular domain , so they're only experts in one little area .

And having those running on a smartphone on a laptop as an assistant , even in some kind of consumer gadget would be really , really good .

And the way this research is going , it's showing that it may not only be that these models run up in the cloud , but we could actually run them usefully and actually , uh , use them effectively on a local device and I find that really interesting .

OK , that's it .

My name is Gary Sims .

This is Gary explains .

I really hope you enjoyed this video .

If you did , please do give it a thumbs up .

If you like these kind of videos , why not stick around by subscribing to the channel ?

OK , that's it .

I'll see you in the next one

Partnership

Attention YouTube vloggers and media companies!
Are you looking for a way to reach a wider audience and get more views on your videos?
Our innovative video to text transcribing service can help you do just that.
We provide accurate transcriptions of your videos along with visual content that will help you attract new viewers and keep them engaged. Plus, our data analytics and ad campaign tools can help you monetize your content and maximize your revenue.
Let's partner up and take your video content to the next level!
Contact us today to learn more.