粉丝17.9万获赞63.4万


如果说大模型是一个数字大脑,那么分词器就是它的翻译官。看左边,原始文本像古老的卷轴一样展开,但 ai 并不直接阅读文字。它会把文本送进这个 bpe 压缩引擎。 在这里,引擎根据频率规则,像液压机一样把长单词拆解重组,变成最高效的子词单元。 这些拆好的资源整齐地排列在传送带上。 cls 是 序列的发令枪, set 是 句子的分界线。遇到不认识的生僻词没关系,它们会被送进回收处理器进行二次拆解。 最后,这些资源被印刻在一张张数字打孔卡上,转化为大模型唯一能认出的数字 id。 到这一步,人类的感性语言正式变成了计算机的理性数据。

一分钟搞懂大语言模型、 ai agent 以及 skills 的 区别。 ai am 也就是大语言模型,是靠海量文本书记为出来的高级人工智能,主要用来理解和生成人类语言。当你用文心语言豆包或者 kim 的 时候,你就是在用 ai am。 而 ai agent 它是一个软件系统,能感知周围的环境,自己做决定并采取行动,有一定自主性的去完成特定的目标。想通一灵码定定的 ai 助理都是很好的例子。 再聊聊他们的用途。 alim 特别擅长根据训练学到的模式来生成文本,你给个指令,它就给你个回复,一来一回啊,默认是单次交互,而且不带记忆的。而 ai agent 就 不一样了,它其实把大约模型当成自己的一个核心大脑,同时又给自己加上各种新的能力,它们能自己分步骤去行动,去完成目标。 大约模型只能在自己吃进肚子里的数据和预设功能里打转。而这里提到的外部工具和资源,就是所谓的 skills。 如果说 l a m 是 大脑, agent 是 用这个大脑的数字员工,那 skills 就是 这个员工手里掌握的工具包了。单靠 l a m 是 没法直接帮你发微信领高铁票的,它必须调用特定的 skills, 才能打破自身数据的限制,跟真实的外部世界打交道。所以 一句话总结啊,大约模型负责跟你聊天,生成文字, skills 是 用来干预现实世界的具体工具,而 ai agent 则是负责拿着工具,但是大脑真正帮你把事搞定的。


okay, so let's begin first of all what is a large language model really well, a large language model is just two files right, there will be two files in this hypothetical directory so for example we're going with the specific example of the lama 270b model this is a large language model released by meta ai and this is basically the llama series of language models the second iteration of it and this is the 70 billion parameter model of a of the series so there's multiple models belonging to the llamato series, 7 billion, 13 billion, 34 billion and 70 billion is the biggest one now many people like this model specifically because it is probably today the most powerful open weights model so basically the weights and the architecture and a paper was all released by metas, so anyone went can work with this model very easily by themselves this is unlike many other language models that you might be familiar with for example if you're using chatupt or something like that the model architecture was never released it is owned by open ai and you're allowed to use the language model through a web interface, but you don't have actually access to that model so in this case the llama270b model is really just two files on your file system, the parameters file and the run some kind of a code that runs those parameters so the parameters are basically the weights or the parameters of this neural network that is the language model we'll go into that in a bit because this is a seventy billion parameter model uh, every one of those parameters is stored as two bites and so therefore the parameters file here is hundred and forty gigabytes and it's two bites because this is a float 16 number as the data type now in addition to these parameters that's just like a large list of parameters for that neural network you also need something that runs that neural network and this piece of code is implemented in our run file now this could be a c file or a python file or any other programming language really it can be written any arbitrary language but c is sort of like a very simple language just to give you a sense and it would only require about 500 lines of c with no other dependencies to implement the neural network architecture and that uses basically the parameters to run the model so it's only these two files you can take these two files and you can take your macbook and this is a fully self contained package this is everything that's necessary you don't need any connectivity to the internet or anything else you can take these two files you can pile your c code you get a binary that you can point at the parameters and you can talk to this language model so for example you can send it text like for example write a poem about the company scale ai and this language model will start generating text and in this case it will follow the directions and give you a point about scale ai now the reason that i'm picking on scale ai here and you're going to see that throughout the talk is because the event that i originally presented this talk with was run by scale ai and so i'm picking on them throughout throughout the slides a little bit just in an effort to make it concrete so this is how we can run the model just requires two files just requires a macbook i'm slightly cheating here because this was not actually in terms of the speed of this video here this was not running a seventy billion primary model it was only running a seven billion primary model a seventy b would be running about ten times slower but i wanted to give you an idea of sort of just a text generation and what that looks like so not a lot is necessary to run the model this is a very small package but the computational complexity really comes in when we'd like to get those parameters so how do we get the parameters and where are they from because whatever is in the run that sea file, the neural network architecture and serve the forward pass of that network everything is algorithmically understood and open and so on, but the magic really is in the parameters and how do we obtain them so to obtain the parameters basically the model training as we call it is a lot more involved than model inference which is the part that i showed you earlier so model inference is just running it on your macbook model training is a competitionally very involved process so basically what we're doing doing can best be served understood as kind of a compression of a good chunk of internet so because llama27b is an open source model we know quite a bit about how it was trained because meta released that information in paper so these are some of the numbers of what's involved you basically take a chunk of the internet that is roughly you should be thinking ten terabytes of text this typically comes from like a crawl of the internet so just imagine uh just collecting a tons of text from all kinds of different websites and collecting it together so you take a large aquam internet then you procure a gpu cluster and these are very specialized computers intended for very heavy competitional workloads like training of neural artworks you need about six thousand gpus and you would run this for about twelve days to get a llama two seventy b and this would costs you about two million dollars and what this is doing is basically it is compressing this large chunk of text into what you can think of as a kind of a zip file so these parameters that i showed you in an earlier slide are best kind of thought of as like a zip file of the internet and in this case what would come out are these parameters hundred and forty gigabytes so you can see that the compression ratio here is roughly like a hundred x roughly speaking but this is not exactly a zip file because a zip file is lossless compression what's happening here is a lossy compression we're just kind of like getting a kind of a gestalt of the text that we trained on we don't have an identical copy of it in these parameters and so it's kind of like a lossy compression you can think about it that way the one more thing to point out here is these numbers here are actually by today's standards in terms of state of the art rookie numbers so if you want to think about state of the art neural networks like say what you might use in chat apt or clod or bard or something like that these numbers are off by factor of 10 or more so you would just go in and you would just like start multiplying by quite a bit more and that's why these training runs today are many tens or even potentially hundreds of millions of dollars at very large clusters, very large data sets and this process here is very involved to get those parameters once you have those parameters running the neural network is fairly competitionally cheap okay so what is this neural network really doing right i mentioned that there are these parameters this neural network basically is just trying to predict the next word in a sequence you can think about it that way so you can feed in a sequence of words for example catsat on a this feeds in into a neural net and these parameters are dispersed throughout this neural network and there's neurons and they're connected to each other and they all fire in a certain way you can think about it that way and outcomes a prediction for what word comes next so for example in this case this neural network might predict that in this context of four words the next word will probably be a mat with say ninety seven percent probability so this is fundamentally the problem that the neural network is performing and this you can show mathematically that there's a very close relationship between prediction and compression which is why i serve allude to this neural network as a kind of training it is kind of like a compression of the internet because if you can predict sort of the next word very accurately you can use that to compress the data set so it's just the next word prediction your network you give it some words it gives you the next word now the reason that what you get out of the training is actually quite a magical artifact is that basically the next word perdition task you might think is very simple objective but it's actually a pretty powerful objective because it forces you to learn a lot about the world inside the parameters of the neural network so here i took a random web page at the time when i was making this talk i just grabbed it from the main page of wikipedia and it was about ruth handler and so think about being the neural network and you're given some amount of words and trying to predict the next word in a sequence well in this case, i'm highlighting here in red some of the words that would contain a lot of information and so for example in if your objective is to predict the next word presumably your parameters have to learn a lot of this knowledge you have to know about ruth and handler and when she was born and when she died who she was what she's done and so on and so in the task of next war production you're learning a ton about the world and all this knowledge is being compressed into the weights the parameters now how do we actually use these neural networks well once, we've trained them i showed you that the model inference is a very simple process we basically generate what comes next we sample from the model so we pick a word and then we continue feeding it back in and get the next word and continue feeding that back in so we can iterate this process and this network then dreams internet documents so for example if we just run the neural network or as we say perform inference we would get sort of like web page dreams you can almost think about that way right because this network was trained on web pages and then you can sort of like let it loose so on the left we have some kind of a java code dream it looks like in the middle, we have some kind of a what looks like almost like an amazon product dream and on the right we have something that almost looks like wikipedia article focusing for a bit on the metal one as an example the title, the author, the isb and number everything else this is all just totally made up by the network, the network is dreaming text from the distribution that it was trained on it's just mimicking these documents, but this is all kind of like hallucinated so for example the eyes be in number this number probably, i would guess almost certainly does not exist uh the model network just knows that what comes after eyes be in column is some kind of a number of roughly this length and it's got all these digits and it just like puts it in it just kind of like puts in whatever looks reasonable so it's parating the train dataset distribution on the right the black nose dace, i looked it up and it is actually a kind of fish and what's happening here is this text for betham is not found in a training set documents, but this information if you actually look it up is actually roughly correct with respect to this fish and so the network has knowledge about this fish it knows a lot about this fish it's not going to exactly parrot documents that it saw in the training set, but again in some kind of a lot some kind of a lossy compression of the internet it kind of remembers to get styled it kind of knows the knowledge and it just kind of like goes and it creates the form it creates kind of like the correct form and fills it with some of that's knowledge and you're never hundred percent sure if what it comes up with is as we call hallucination or like an incorrect answer or like a correct answer necessarily, so some of the stuff could be memorized and some of it is not memorized and you don't exactly know, which is which but for the most part this is just kind of like hallucinating or like dreaming internet text from its data distribution okay let's now switch gears to how does this network work? how does it actually perform this next word prediction task what goes on inside it? well, this is where things complicated a little bit this is kind of like the schematic diagram of the neural network if we kind of like zoom in into the toy diagram of this neural net this is what we call the transformer neural network architecture and this is kind of like a diagram of it now what's remarkable about this neural net is we actually understand in full detail the architecture we know exactly what mathematical operations happen at all the different stages of it the problem is that these one hundred billion parameters are dispersed throughout the entire neural network and so basically these billing parameters of billions of parameters are throughout the neural lot and all we know is how to adjust these parameters alternatively to make the network as a whole better at the next word prediction task, so we know how to optimize these parameters we know how to adjust them over time to get a better next word prediction, but we don't actually really know what these hundred billion parameters are doing we can measure that it's getting better at the next work prediction, but we don't know how these parameters collaborate to actually perform that we have some kind of models that you can try to think through on a high level for what the network might be doing so we kind of understand that they build and maintain some kind of a knowledge database but even this knowledge database is very strange and imperfect and weird so a recent viral example is what we call the reversal course so as an example if you go to chat gpt and you talk to gpt for best language model currently available you say who is tom cruise mother it will tell you it's merely fifer which is correct but if you say who is merely fifer son it will tell you it doesn't know so this knowledge is weird and it's kind of one dimensional and you have to sort of like this knowledge isn't just like stored and can be accessed in all the different ways if sort of like ask it from a certain direction almost and so that's really weird and strange and fundamentally we don't really know because all you can kind of measure is whether it works or not and with what probability so long story short think of llms as kind of like mostly mostly inscrutable artifacts they're not similar to anything else we might built in an engineering discipline like they're not like a car where we serve understand all the parts there are these neuralness that come from a long process of optimization and so we don't currently understand exactly how they work although there's a field called interpretability or mechanistic interpretability trying to kind of go in and try to figure out like what all the parts of this neural net are doing and you can do that to some extent, but not fully right now, but right now we kind of treat them mostly as empirical artifacts we can give them in some inputs and we can measure the outputs we can basically measure their behavior we can look at the text that they generate in many different situations and so i think this requires basically correspondingly sophisticated evaluations to work with these models because they're mostly empirical, so now let's go to how we actually obtain an assistant so far we've only talked about these internet document generators right and so that's the first stage of training we call that stage pre training we're now moving to the second stage of training which we call fine tuning and this is where we obtain what we call an assistant model because we don't actually really just want document generators that's not very helpful for many tasks we want to give questions to something and we wanted to generate answers based on those questions so we really want an assistant model instead and the way you obtain these assistant models is fundamentally through the following process we basically keep the optimization identical, so the training will be the same as just the next work prediction task, but we're going to swap out the data set on which we are training so it used to be that we are trying to train on internet documents we're going to now swap it out for datasets that we collect manually and the way we collect them is by using lots of people so typically a company will hire people and they will give them labeling instructions and they will ask people to come up with questions and then write answers for them so here's an example of a single example that might basically make it into your training set so there's a user and it says something like can you write a short introduction about the relevance of the term on up sunny and economics and so on and then there's assistant and again the person fills in what the ideal response should be and the ideal response and how that is specified and what it should look like all just comes from labeling documentations that that we provide these people and the engineers at a company alike open ai or anthropic or whatever else will come up with these lead labeling documentations now the pretreating stage is about a large quantity of text, but potentially low quality because it just comes from the internet and there's tens of hundreds of terabyte tech offict and it's not all very high quant quality but in this second stage we prefer quality over quantity so we may have many fewer documents for example hundred thousand, but all of these documents now are conversations and there should be very high quality conversations and fundamentally people create them based only a bling instructions so we swap out the data set now and we train on these q amp。 a documents we and this process is called fine tuning once you do this you obtain what we call an assistant model so this assists the model now subscribes to the form of its new training documents so for example if you give it a question like can you help me with this code it seems like there's a bug print hello world even though this question specifically was not proud of the training set the model after its fine tuning understands that it should answer in the style of a helpful assistant to these kinds of questions and it will do that so it will sample word by word again from left to right from top to bottom all these words that are the response to this query and so it's kind of remarkable and also kind of empirical and not fully understood that these models are able to sort of like change their formatting into now being helpful assistance because they've seen so many documents of it in the fine training stage, but they're still able to access and somehow utilize all of the knowledge that was built up during the first stage the pre training stage so roughly speaking pre training stage is um training on trains on a ton of internet and is about knowledge and the fine training stage is about what we call alignment it's about sort of giving um it's about changing the formatting from internet documents to question and answer documents in kind of like a helpful system manner so roughly speaking here are the two major parts of obtaining something like chat gpt there's the stage one pre training the end stage 2 fine training in the pre training stage you get a ton of text from the internet you need a cluster of gpus so these are special purpose sort of computers for these kinds of peril processing workloads this is not just things that you can buy and best buy these are very expensive computers and then you compress the text into this neural network into the parameters off it typically this could be a few sort of millions of dollars and then this gives you the base model because this is a very competitionally expensive part this only happens inside companies maybe once a year or once after multiple months because this is kind of like very expensive, very expensive to actually perform once you have the base model you enter definering stage which is computationally a lot cheaper in the stage you write out some legal instructions that basically specify how your assistant should behave then you hire people so for example scalei is a company that actually would would work with you to actually basically create documents according to your labeling instructions you collect 100 thousand as an example high quality ideal q amp。 a responses and then you would fine tune the base model on this data this is a lot cheaper this would only potentially take like one day or something like that instead of a few months or something like that and you obtain what we call an assistant model then you run a lot of evaluations you deploy this and you monitor collect misbehaviors and for every misbehavior you want to fix it and you go to step on and repeat and the way you fix the misbehaviors roughly speaking is you have some kind of a conversation where the assistant gave an incorrect response so you take that and you ask a person to fill in the correct response and so the person overwrites the response with the correct one and this is then inserted as an example into your training data and the next time you do the fine tuning stage the model will improve in that situation so that's the iterative process by which you improve this because fine tuning is a lot cheaper you can do this every week every day or so on and companies often will iterate a lot faster on the fine tuning stage instead of the pre training stage one other thing to point out is for example, i mentioned the llama two series the llama two series actually when it was released by meta contains both the base models and the assistant models so they release both of those types the base model is not directly usable because it doesn't answer questions with answers it will if you give it questions, it will just give you more questions or it will do something like that because it's just an internet document sampler, so these are not super helpful or they are helpful is that meta has done the very expensive part of these two stages they've done the stage one and they've given you the result and so you can go off and you can do your own fine tuning and that gives you a ton of freedom but meta and addition has also released assistant models so if you just like to have a question answer you can use that assistant model and you can talk to it okay so those are the two major stages now see how in stage two i'm saying end or comparisons i would like to briefly double click on that because there's also a stage three of fine tuning that you can optionally go to or continue to in stage three of fine tuning you would use comparison labels so let me show you what this looks like the reason that we do this is that in many cases it is much easier to compare candidate answers than to write an answer yourself if you're a human labler so consider the following concrete example suppose that the question is to write a haiku about paperclips or something like that from the perspective of a labeler, if i'm asked to write a haiku that might be a very difficult task right like i might not be able to write a haiku, but suppose you're given a few candidate hikus that have been generated by the assistant model from stage 2 well, then as a labeler you could look at these hikus and actually pick the one that is much better and so in many cases it is easier to do the comparison instead of the generation and there's a stage three of fine tuning that can use these comparisons to further fine tune the model and i'm not gonna go into the full mathematical detail of this at open ai, this process is called a reinforcement learning from human feedback or rlhf, and this is kind of this optional stage three that can gain you additional performance in these language models and it utilizes these comparison labels, i also wanted to show you very briefly one slide showing some of the labeling instructions that we give to humans so this is an excerpt from the paper instruct gpd by opening eye and it just kind of shows you that we're asking people to be helpful, truthful and harmless these labeling documentations though can grow to you know tens or hundreds of pages and can be pretty complicated, but this is roughly speaking what they look like one more thing that i wanted to mention is that i've described the process and evenly as humans doing all of this manual work, but that's not exactly right and it's increasingly less correct, and and that's because these language models are simultaneously getting a lot better and you can basically use human machine sort of collaboration to create these labels with increasing efficiency and correctness and so for example you can get these language models to sample answers and then people sort of like cherry pick parts of answers to create one sort of single best answer or you can ask these models to try to check your work or you can try to ask them to create the comparisons and then you're just kind of like in an oversight roll over it so this is kind of a slider that you can determine and increasingly at these models are getting better words moving the slider sort of to the right okay finally i wanted to show you a leaderboard of the current leading larger lockage models out there so this for example is a chatbot arena it is managed by team at berkeley and what they do here is they rank the different language models by their elo rating and the way you calculate elo is very similar to how you would calculate in chess so different chess players play each other and uh you depending on the wind rates against each other you can calculate the eat their elos course you can do the exact same thing with language models so you can go to this website you enter some question you get responses from two models and you don't know what models they were generated from and you pick the winner and then depending on who wins and who loses you can calculate the elo scores so the higher the better so what you see here is that crowding up on the top you have the proprietary models these are closed models you don't have access to the weights they are usually behind a web interface and this is gpt series from open ai and a clod series from anthropic and there's a few other series from other companies as well, so these are currently the best performing models and then right below that you are going to start to see some models that are open weights so these weights are available a lot more is known about them there are typically papers available with them and so this is for example the case paulama two series from meta or in the bottom you'll see zefri 7b beta that is based on the mystrial series from another startup in france, but roughly speaking what you're seeing today in the ecosystem is that the closed models work a lot better, but you can't really work with them fine tune them download them etc you can use them through a web interface and then behind that are all the open source models and the entire open source ecosystem and all the stuff works worse, but depending on your application that might be good enough and so importantly i would say the open source ecosystem is trying to boost performance and sort of chase the proprietary ecosystems and that's roughly the dynamic that you see today in the industry。


隔壁寝室的 harry 是 个海王,他手机里同时聊了一百个学妹。有一天,学妹 a 突然发来一句,今天发烧了,好难受。 harry 的 大脑开始疯狂计算,在这个语境下,如果回复多喝热水,那么被拉黑的概率是百分之九十九。 回复吃药了吗?那么被发好人卡的概率是百分之八十。如果回复开门,我在你宿舍楼下给你带了布洛芬,那么今晚拿下的概率是百分之九十九。 harry 根本不爱这个学妹,他只是一个无情的概率计算器。这就是现在的大语言模型,他们没有灵魂,也没有真正的思想,他们只是阅读了人类互联网上亿万字的聊天记录, 因此能够极其准确的变态的预测出上一句话说完之后,下一个接哪一个词汇的概率最高,最能让你爽。

大家在和 ai 聊天的时候,是不是经常遇到 ai 说胡话的情况?其实呢, ai 的 本质是一个超级鹦鹉,它会根据海量的训练数据来预测下一句话应该说什么。它其实并不会思考, 因为绝大部分聊天机器人都是基于 large language model 大 圆模型进行训练的,它追求的是更加流畅和更像人,而不是更加准确正确。

大模型参数到底是什么?很多人听说大模型有几千亿参数,就觉得特别玄乎,那它到底是个什么东西?其实说白了,参数就是一个数字,可能是三点一二四五,也可能是负零点零零零九二这种小数。比如像 deepseek 这种大模型,最大的版本大概有六百七十一。 b 的 参数 b 的 意思就是十亿,你可以把它想象成一张巨大的网格, 每个格子里面都存着一个数字,这些数字占了模型体积的百分之九十以上。但问题来了,一堆数字为什么能回答?如果你学过初中数学,其实早就见过参数这种东西。 比如一条直线, y 等于 x 加 b, 这里的 a 和 b 其实就是参数。只要知道 a 和 b, 这条直线就完全被描述出来了。比如我们知道两个点,一二和三六,又能算出来 a 等于二, b 等于零,于是这条直线就是 y 等于二 x。 从此以后,给你任何一个 x, 你 都能立刻算出 y。 也就是说,整条直线的规律被压缩进两个参数里了。 这个过程在数学里有个名字叫你和。什么意思呢?就是用少量参数去概括大量数据的分布规律。原本你有很多很多数据点,但通过计算,你只需要 a 和 b 两个数字,就能大致描述这些点的整体趋势。换句话说,参数其实是对数据规律的一种压缩。 那问题来了,如果直线只需要两个参数,为什么大模型要几千亿参数?因为它拟合的东西完全不一样。直线拟合的是二维平面上的点,而大模型要拟合的是这个世界的数据规律,包括文字、图片、声音、视频、代码,这些数据之间的关系比一条直线复杂太多,所以模型需要一个非常庞大的网络结构去捕捉这些复杂。 那这些参数是怎么来的?答案是训练出来的。一开始模型其实什么都不懂,所有参数都是随机出式化的,可能是零点零零一,也可能是负零点四。然后模型会不断做一件事,预测计算误差,调整参数,一个过程会重复几万亿次,每次调整一点点,慢慢的那些参数就开始记录数据里的规律,最后形成一个能够理解语言、写代码、回答问题的大模型。 所以最后总结一下,如果你只想描述一条直线,两个参数就够了。但如果你想描述人类语言、世界、知识、图像、结构、复杂语义,那几个参数肯定不够。于是模型就需要几千亿个参数去存储这些复杂的模式。所以大模型的本质其实可以用一句话总结,参数就是世界规律的压缩文件。