A language model is a probability distribution over words or word sequences. Learn more about different types of language models and what they can do. Extracting information from textual data has changed dramatically over the past decade. As the term natural language processing has overtaken text mining as the name of the field, the methodology has changed tremendously, too. One of the main drivers of this change was the emergence of language models as a basis for many applications aiming to distill valuable insights from raw text. How can a machine produce an article that mimics a journalist?
A language model uses machine learning to conduct a probability distribution over words used to predict the most likely next word in a sentence based on the previous entry. Language models learn from text and can be used for producing original text, predicting the next word in a text, speech recognition, optical character recognition and handwriting recognition.
Being said differently, a language model is a probability distribution over words or word sequences. In practice, it gives the probability of a certain word sequence being “valid.” Validity in this context does not refer to grammatical validity. Instead, it means that it resembles how people write, which is what the language model learns. This is an important point. There’s no magic to a language model like other machine learning models, particularly deep neural networks, it’s just a tool to incorporate abundant information in a concise manner that’s reusable in an out-of-sample context.
Yes, but "better" is nuanced. They can produce faster, more accurate code, yet human oversight is essential for catching subtle bugs and understanding context. Engineering expertise is key to distinguishing good from great code, as LLMs can't yet differentiate performance levels.
The abstract understanding of natural language, which is necessary to infer word probabilities from context, can be used for a number of tasks. Lemmatization or stemming aims to reduce a word to its most basic form, thereby dramatically decreasing the number of tokens. These algorithms work better if the part-of-speech role of the word is known. A verb’s postfixes can be different from a noun’s postfixes, hence the rationale for part-of-speech tagging (or POS-tagging), a common task for a language model.
With a good language model, we can perform extractive or abstractive summarization of texts. If we have models for different languages, a machine translation system can be built easily. Less straightforward use-cases include answering questions (with or without context, see the example at the end of the article). Language models can also be used for speech recognition, OCR, handwriting recognition and more. There’s a whole spectrum of opportunities.
There are two types of language models:
Probabilistic methods | Neural network-based modern language models |
Probabilistic language model: a simple probabilistic language model is constructed by calculating n-gram probabilities. An n-gram is an n word sequence, n being an integer greater than zero. An n-gram’s probability is the conditional probability that the n-gram’s last word follows a particular n-1 gram (leaving out the last word). It’s the proportion of occurrences of the last word following the n-1 gram leaving the last word out. This concept is a Markov assumption. Given the n-1 gram (the present), the n-gram probabilities (future) does not depend on the n-2, n-3, etc grams (past). There are evident drawbacks of this approach. Most importantly, only the preceding n words affect the probability distribution of the next word. Complicated texts have deep context that may have decisive influence on the choice of the next word. Thus, what the next word is might not be evident from the previous n-words, not even if n is 20 or 50. A term has influence on a previous word choice: the word United is much more probable if it is followed by States of America. Let’s call this the context problem. On top of that, it’s evident that this approach scales poorly. As size increases (n), the number of possible permutations skyrocket, even though most of the permutations never occur in the text. And all the occurring probabilities (or all n-gram counts) have to be calculated and stored. In addition, non-occurring n-grams create a sparsity problem, as in, the granularity of the probability distribution can be quite low. Word probabilities have few different values, therefore most of the words have the same probability. |
Neural network-based language models: Neural network based language models ease the sparsity problem by the way they encode inputs. Word embedding layers create an arbitrary sized vector of each word that incorporates semantic relationships as well. These continuous vectors create the much needed granularity in the probability distribution of the next word. Moreover, the language model is a function, as all neural networks are with lots of matrix computations, so it’s not necessary to store all n-gram counts to produce the probability distribution of the next word. |
Machine learning and deep learning are sub-disciplines of AI, and deep learning is a sub-discipline of machine learning.
Both machine learning and deep learning algorithms use neural networks to ‘learn’ from huge amounts of data. These neural networks are programmatic structures modeled after the decision-making processes of the human brain. They consist of layers of interconnected nodes that extract features from the data and make predictions about what the data represents.
Machine learning and deep learning differ in the types of neural networks they use, and the amount of human intervention involved. Classic machine learning algorithms use neural networks with an input layer, one or two ‘hidden’ layers, and an output layer. Typically, these algorithms are limited to supervised learning: the data needs to be structured or labeled by human experts to enable the algorithm to extract features from the data.
Deep learning algorithms use deep neural networks—networks composed of an input layer, three or more (but usually hundreds) of hidden layers, and an output layout. These multiple layers enable unsupervised learning: they automate extraction of features from large, unlabeled and unstructured data sets. Because it doesn’t require human intervention, deep learning essentially enables machine learning at scale.
The primary differences between probabilistic methods and neural network-based modern language models lie in their underlying principles, complexity, capabilities, and performance. Here's a detailed comparison:
Underlying Principles | |
---|---|
Probabilistic Methods
| Neural Network-Based Models
|
Complexity | |
Probabilistic Methods
| Neural Network-Based Models
|
Capabilities | |
Probabilistic Methods
| Neural Network-Based Models
|
Performance | |
Probabilistic Methods
| Neural Network-Based Models
|
Training Data Requirements | |
Probabilistic Methods
| Neural Network-Based Models
|
Summary of the differences between Probabilistic methods and Neural network-based modern language models:
In summary, probabilistic methods offer simplicity and efficiency for less complex tasks but are limited by their inability to capture long-range dependencies and deeper language structures. Neural network-based modern language models, on the other hand, provide a much richer and more nuanced understanding of language, excelling in performance and versatility but requiring significant computational resources and large datasets.
Even though neural networks solve the sparsity problem, the context problem remains. First, language models were developed to solve the context problem more and more efficiently — bringing more and more context words to influence the probability distribution. Secondly, the goal was to create an architecture that gives the model the ability to learn which context words are more important than others.
The first model, which I outlined previously, is a dense (or hidden) layer and an output layer stacked on top of a continuous bag-of-words (CBOW) Word2Vec model. A CBOW Word2Vec model is trained to guess the word from context. A Skip-Gram Word2Vec model does the opposite, guessing context from the word. In practice, a CBOW Word2Vec model requires a lot of examples of the following structure to train it: the inputs are n words before and/or after the word, which is the output. We can see that the context problem is still intact.
Recurrent neural networks (RNNs) are an improvement regarding this matter. Since RNNs can be either a long short-term memory (LSTM) or a gated recurrent unit (GRU) cell based network, they take all previous words into account when choosing the next word. AllenNLP’s ELMo takes this notion a step further, utilizing a bidirectional LSTM, which takes into account the context before and after the word counts.
The main drawback of RNN-based architectures stems from their sequential nature. As a consequence, training times soar for long sequences because there is no possibility for parallelization. The solution for this problem is the transformer architecture.
The GPT models from OpenAI and Google’s BERT utilize the transformer architecture, as well. These models also employ a mechanism called “Attention,” by which the model can learn which inputs deserve more attention than others in certain cases.
In terms of model architecture, the main quantum leaps were firstly RNNs, specifically, LSTM and GRU, solving the sparsity problem and reducing the disk space language models use, and subsequently, the transformer architecture, making parallelization possible and creating attention mechanisms. But architecture is not the only aspect a language model can excel in.
Finally, I’d like to review the T5 model from Google. Previously, language models were used for standard NLP tasks, like part-of-speech (POS) tagging or machine translation with slight modifications. With a little retraining, BERT can be a POS-tagger because of its abstract ability to understand the underlying structure of natural language.
With T5, there is no need for any modifications for NLP tasks. If it gets a text with some <M> tokens in it, it knows that those tokens are gaps to fill with the appropriate words. It can also answer questions. If it receives some context after the questions, it searches the context for the answer. Otherwise, it answers from its own knowledge. Fun fact: It beat its own creators in a trivia quiz.
LLMs are best understood as that one know-it-all friend: eager to help and with inexhaustible patience, but more likely to bullshit you than admit "I don't know" once you pass beyond the limit of their learning. Very useful if approached with skepticism; dangerous if with faith. — Edward Snowden
LLMs don’t do formal reasoning - and that is a persistent problem. — Gary Marcus
LLMs are natural language computers - trained by natural language, programmable by natural language. The best way to program LLMs may simply be to communicate clearly and precisely.
Ashton (Teradata): "I foresee a massive productivity leap forward through GenAI, especially in technology and software. It’s getting more cost-effective to get into GenAI, and there are lots more solutions available that can help improve GenAI solutions. Conversations will gravitate to GenAI, ethics, and what it means to be human."
SuperAnnotate: "As models become more powerful, it's increasingly important to ensure they behave well when deployed and their outputs align with human preferences. The true magic of RLHF lies in its potential to combine the best of both worlds: the machine's computational prowess and the human's intuitive understanding."
Cohere (Command Models): "The Command R model enhances capabilities with a focus on retrieval-augmented generation, allowing the Command R model to provide more accurate and contextually appropriate outputs by integrating retrieval mechanisms that pull information from external databases or documents."
Google (Gemini Models): "Gemini's sophisticated reasoning capabilities make it proficient at extracting insights from vast amounts of data in fields from science to finance. Gemini excels in explaining complex reasoning, especially in subjects like math and physics."
Meta AI (Llama Models): "Llama 3 models deliver superior performance across various tasks. Designed for versatility and efficiency, they offer reliable solutions for natural language processing and complex problem-solving scenarios."
Instinctools: "Generative AI can cast its nets even wider by feeding on multiple variables and local context factors to produce detailed, localized forecasting of supply chain performance in a given environment."
Zapier: "LLMs have burst out of the lab and into the real world. The first generations of large multimodal models, which handle other input and output modalities like images, audio, and video, as well as text, are starting to be widely available."
SuperAnnotate: "Multi-task fine-tuning is about training the model on a mixed dataset so that it can improve performance on multiple tasks simultaneously, thus avoiding the issue of catastrophic forgetting."
SuperAnnotate: "Transfer learning offers a higher learning rate and accuracy after training by taking a model pre-trained on general-purpose datasets and fine-tuning it on task-specific data."
AI Magazine: "Talk around customer experiences will be dominated by AI’s impact, and we’ll see less focus on business intelligence. A resurgence in IoT discussions is expected due to AI, with near-sentient robots doing things better and faster."
LLMs are robust systems trained on vast data sets, but while general-purpose LLMs excel at diverse language tasks, they often lack the specific knowledge or behavioral capacities required for enterprise applications.
For truly useful applications of generative AI, applying industry, corporate, and personal data is where value is maximized.
In specialized fields such as finance, healthcare, or customer support, developing an AI strategy means needing a model or system that is tailored to a specific purpose. However, this is highly dependent on the nature of the data an organization is working with—It all comes down to your data.
AI's effectiveness is fundamentally tied to data, and enterprises need to assess if their data is static (unchanging over time) or dynamic (frequently updated). To help simplify this, let’s imagine for a moment that I’m giving you a new task to learn about a new subject or skill. You have two options here: you could either learn and teach yourself the content or outsource it to others. When we talk about fine tuning versus RAG, it’s a similar concept.
There may be foundational, domain specific knowledge and processes that would be helpful for a model to understand, but what if information is dynamic and evolves quickly? In that case, simply providing it to an LLM on demand can be more effective. However, let’s take a closer look at real-world examples of these approaches being used and understand when to take each approach.
Ilya Sutskever has confirmed that LLMs scaling has plateaued. Compute continues to scale, but data is not keeping up. New or synthetic data hasn’t significantly improved performance. What’s next: Like the human brain, which stopped growing in size while humanity advanced through tools and knowledge, progress will come from agents and tools built on top of LLMs. Key areas of focus include: Sequence-to-sequence learning, Agentic behavior, Teaching self-awareness. This shift is similar to the evolution of the iPhone: after hardware improvements plateaued, the focus moved to applications. 2025 will be the year of agents.