What are Large Language Models (LLMs) actually useful for?

the porous city

In this post I'll give a basic technical overview of large language models like ChatGPT, and talk about what they're are useful for today.

What are they?

Large Language Models (LLMs) are basically giant equations that take a sequence of words and predict the most likely next word. The equation is very, very large - gigabytes large - and creating this equation (aka training the model on existing text) can cost tens or hundreds of millions of dollars, and repeatedly applying it to generate text can produce surprisingly sophisticated output. Some of the details get complicated, but that's all they're doing: looking at text and predicting what text should come next, based on what was in the training data.

LLMs vs chatbots, or, why chatbots are designed and not born

A lot of the attention has focused on chatbots like ChatGPT. Out of the box, LLMs are good at completing partial text like "The largest city in Europe is " but not as good at chatting, or following instructions like "Please write a Python script to concatenate video files using ffmpeg." It tends to imitate its training data too literally, which can lead to exchanges like this, in which the chatbot starts imitating forum post boilerplate:

Input: Tell me something about owls.

Output 7B LLama: Asked by kimberly at 12:54 PM on May. 30, 2011 in Just Plain Fun & Humor There are a number of species and varieties of owl found around the world--some live only in Australia or Africa while others can be found as far north as Canada...

Making a base LLM better at following instructions or chatting is known as instruction tuning. A team at Stanford describes how they created Alpaca, an instruction-tuned chatbot based on one of Meta's LLaMa models by feeding it 52,000 Q&A examples they generated with OpenAI's davinci (Q: "Explain the principle of Occam's razor", A: "Occam's razor is a principle in philosophy that states ...".) This training makes the chatbot much more likely to give appropriate-seeming answers.

Alpaca is lacking refinement compared to ChatGPT - it's more likely to provide inaccurate and/or biased (racist/sexist etc) information. OpenAI used reinforcement learning from human feedback (RLHF) to increase "alignment" - basically, they paid people in Kenya $2/hr to rate responses according to set criteria, and used that to improve response quality. (The word "alignment" requires a lot of unpacking - Googling "AI alignment"can get you some pretty weird places - but it broadly means making software do things you want instead of things you don't want.) This is an important part of the process, and is expensive in terms of people's time. OpenAI can make this less expensive in the future by using feedback from users, but then has to consider whether users' ratings are consistent with the brand image OpenAI wants to have (that is, whether OpenAI's users are aligned with OpenAI.)

I'm going into so much detail here to make the point that chatbots are designed, they don't just emerge from the training data. The people building them have a lot of explicit goals for how it should answer and how it shouldn't. Choices here will make the chatbot better at some things and worse at others - better design and better implementation of the design will be a major area of competition for the foreseeable future.

Will AI increase or decrease centralization?

As I mentioned, training an LLM can be very expensive. But unlike something like Google search that depends on petabytes of data and a tremendously powerful software stack to keep it up to date and query it efficiently, LLMs are relatively simple, just a long equation. And the equation is short enough that you can run LLMs on your local machine, even if it's a smartphone. In the parlance of LLMs we're saying that inference (using a model) is incredibly cheap compared to training (creating a model.)

The idea of running LLMs locally is tremendously appealing. If you're building a business, why pay for API access and risk having the price go up and wreck your economics? Why pay someone to maintain a rack of servers, employ software engineers and baristas, when you can just download a bunch of model weights and run it locally? Why watch usage quotas when you can develop on your own machine and just pay for electricity?

The fact that LLMs are relatively small and cheap to run, combined with the importance of design and fine-tuning, means that there are two scenarios for how they impact centralization (and a whole spectrum in between):

1. The magic of LLMs is in fine-tuning. A thousand flowers bloom as startups design custom LLMs for every use case under the sun, and the tech industry becomes less centralized.

2. LLMs with up-to-date information from the Internet built-in turns out to be a critical competitive advantage. Doing this means using Googlebot or similar to constantly index the web, and then applying model fine-tuning - this would be so incredibly expensive that only a tech giant could do it, but the benefits are so large it will probably happen. Everyone ends up paying an LLM tax to Google (or Microsoft.) Centralization stays the same or increases.

Open-source LLMs that any developer can build on (also known as LLMs' Stable Diffusion moment) are going to unleash a lot of new stuff, some good, some bad. The bad scenarios can get panic-inducing pretty quick. In the meantime though, those of us trying to get quality results out of a local model (presumably with innocent motives) face challenges that I'll discuss in the next section.

What are they useful for?

This is the big open question. There are many, many, many examples of people doing fun things with LLMs or coaxing chatbots into weirder and weirder behavior.

However it's less clear what the big, world-changing products will be. Programming looks to be one - Microsoft continues to invest in GitHub Copilot, and even more convincingly there are plenty of detailed personal walkthroughs of how LLMs can improve workflows for engineers. The success of LLMs in programming is sort of overdetermined: not only are programmers the best-placed to integrate new tools into their workflows, code obeys very strict rules that make it easy for LLMs to predict / write it.

Microsoft has also announced LLM-powered features to roll out throughout Office, with Google quick on their heels, as well as big players in other spaces like Adobe. LLMs as a sometimes-used feature, rather than a product, are an easy sell.

There are also a thousand and one startups offering AI chatbots trained on your company's internal data and documents, like Dashworks. In my limited experience here, results here are often fine and sometimes magical, especially when the LLM is able to synthesize an answer from multiple data sources. It will also be wrong sometimes, and when it’s wrong in non-obvious ways and someone doesn’t have time to check the answer they’re getting back, that can be dangerous. This is usually mitigated by linking back the original sources, but it would be better to give users a sense for how confident the LLM is in its answer, and I haven’t seen that yet.

The basic principle so far seems to be that anything that keeps a human in the loop tends to work well. The Copilot model for programming does this, image generation AIs like Stable Diffusion do this. That means it’s not doing a ton of work independently, and its output still needs editing by an expert, but it can be a timesaver.

However, there are also startups like Tome claiming very high accuracy rates in very specific domains, without having a human in the loop. (In this case, the LLM is supposed to review certain types of contracts instead of a lawyer - so a human will look at the results, but if they’re not a lawyer, they won’t know if the LLM missed something.) It might be that if you focus on a specific enough problem and do a good enough job at fine-tuning, the human in the loop isn’t necessary.

One prediction I'll make is a lot more services feeding your life history back to you. I tried feeding ChatGPT emails I exchanged with friends over 20 years ago and asking questions about them. ChatGPT's summaries of my correspondence, written in its generic style, sometimes hit like a ton of bricks: "It appears Lukas and A were communicating about a variety of topics. They were discussing a mutual friend, B, who had attempted to commit suicide and had been diagnosed with multiple personality disorder ..."

After summer comes winter

Given all this, "thin wrapper around ChatGPT" will probably not be a winning business model long-term. I'm not convinced that most of the startups rapidly launching LLM-based apps have figured out how to build robust workflows out of unreliable LLMs. Solutions will likely involve deep workflow integration and/or a lot of fine-tuning. The trough of disillusionment will be deep.

Elsewhere

I recorded a podcast with some friends covering some of the same territory covered here.

Some caveats

This post anthropomorphizes LLMs by implying they have intentions. This is an unfortunate but makes the language easier to follow.

While the general principles here should stay valid for a while, the details about what is and isn't currently possibly will change in probably less than a day as nerds worldwide crank on a caffeine-fueled soft takeoff.

last modified: 17:31:33 28-Mar-2023
in categories:Tech/AI

Comment