Beginner's Mind


Deducing LLM Architecture from First Principles

The goal is simple:

Given existing tokens, move the last token vector closer to the vector for the correct next token.

For example:

"the cat sat" -> likely next token: "on"

We start with two assumptions:

First, every token is represented as a vector in a high-dimensional space.

$$ \text{"the"} \to x_{\text{the}}, \quad \text{"cat"} \to x_{\text{cat}}, \quad \text{"sat"} \to x_{\text{sat}} $$

Second, prediction is done by scoring the current state against possible next-token vectors.

Each possible next token has a vector in the model’s vocabulary space. Comparing the current state with each token vector produces one logit per token.

For "the cat sat", the model transforms the vector at "sat" into a state closer to the vector for "on" than to other possible next tokens.

The simplest way to get a scalar score from two vectors is a dot product:

$$ \mathrm{score}(\text{next token}) = h_{\text{sat}} \cdot x_{\text{next token}} $$

The dot product measures vector alignment: how close or compatible two vectors are in the learned space.

So instead of saying the model “understands” language, we begin with a more mechanical goal:

Move token states around in the model’s embedding space until the final state is closer to the correct next-token vector.

1. Start With the Most Direct Transformation

Now we need a function that turns the existing token vectors into a new state for the last token.

The most direct form is:

$$ h_{\text{sat}} = F(x_{\text{the}}, x_{\text{cat}}, x_{\text{sat}}) $$

This says:

Look at the whole context, then produce a better vector for "sat".

One concrete way to implement $F$ is to concatenate the vectors and pass them through a nonlinear function:

$$ h_{\text{sat}} = \mathrm{MLP}([x_{\text{the}}; x_{\text{cat}}; x_{\text{sat}}]) $$

This function can directly learn:

the + cat + sat -> on

This makes sense. An MLP is a universal approximator: with enough width, it can represent every general feature combinations by repeatedly folding and reshaping the vector space. For a deeper visual explanation of what an MLP is, I recommend this video.

For example, a nonlinear MLP can learn combinations like:

animal-like subject + sitting action -> likely location/surface preposition

So far, this is a reasonable first attempt. It uses all the information we have and produces the kind of vector we want.

But it has a structural problem.

It treats each input position as a fixed slot:

slot 1 = the
slot 2 = cat
slot 3 = sat

If the same phrase appears later:

yesterday the cat sat

now the pattern appears at different slots:

slot 2 = the
slot 3 = cat
slot 4 = sat

A plain whole-context MLP has to relearn the same pattern for many positions.

So the first pressure is:

We need reusable computation across positions.

2. Make the Computation Reusable

A natural fix is to apply the same function at every position, like a local pattern detector:

$$ f(x_i, x_{i+1}) $$

Now the same detector can recognize "cat sat" wherever it appears.

This is the same basic principle behind CNNs: use shared local filters so the same feature detector can work across different positions. In images, this is useful because many important features start as local patterns such as edges, corners, and textures.

For language, local sharing is a good step, but fixed windows create a new problem.

If the window size is 2, we miss 3-token patterns:

New York City
as soon as
not only but

If the window size is 3, we still miss longer patterns.

Language has dependencies at many scales:

  • 2-token phrase
  • 3-token idiom
  • sentence-level relation
  • paragraph-level topic
  • long-range reference

So the next pressure is:

Each token needs access to any other relevant token, not just a fixed local window.

3. Let Every Token Contribute to the Update

Return to the example:

the cat sat

We care about updating the "sat" vector so it predicts "on".

Instead of one giant function:

$$ h_{\text{sat}} = F(x_{\text{the}}, x_{\text{cat}}, x_{\text{sat}}) $$

we decompose the update:

$$ \begin{aligned} h_{\text{sat}} = x_{\text{sat}} &+ \mathrm{contribution}(\text{the} \to \text{sat}) \\ &+ \mathrm{contribution}(\text{cat} \to \text{sat}) \\ &+ \mathrm{contribution}(\text{sat} \to \text{sat}) \end{aligned} $$

More generally:

$$ h_i = x_i + \sum_j \mathrm{contribution}(j \to i) $$

This means:

Each token updates itself by receiving learned influence from other tokens.

This is the first major structural step.

4. Separate What to Read From How Much to Read

Now we need to decide what each source token contributes to the target token.

The simplest useful factorization is:

contribution(source -> target)
=
what the source sends
*
how much the target uses it

So:

$$ \mathrm{contribution}(j \to i) = r(i, j) m_j $$

where:

$$ \begin{aligned} r(i, j) &= \text{a scalar saying how much token } j \text{ contributes to token } i \\ m_j &= \text{the vector representation sent by token } j \end{aligned} $$

Then:

$$ h_i = x_i + \sum_j r(i, j)m_j $$

For "sat":

$$ \begin{aligned} h_{\text{sat}} = x_{\text{sat}} &+ r(\text{sat}, \text{the})m_{\text{the}} \\ &+ r(\text{sat}, \text{cat})m_{\text{cat}} \\ &+ r(\text{sat}, \text{sat})m_{\text{sat}} \end{aligned} $$

This is natural:

Some tokens matter more than others. The source token sends a vector of information. The target decides how much to receive.

5. Turn Scores Into Stable Weights

Now we need a stable way to choose how much each source token contributes.

If every token can add an arbitrary amount, long sequences can create noisy or unstable sums. The simplest fix is to first compute raw contribution scores:

$$ s(i, j) $$

Then normalize those scores with a softmax:

$$ r(i, j) = \mathrm{softmax}_j(s(i, j)) $$

Now the contribution weights compete:

$$ \sum_j r(i, j) = 1 $$

So each token receives a weighted mixture of source messages, with the total weight kept under control.

$$ h_i = x_i + \sum_j \mathrm{softmax}_j(s(i, j))m_j $$

6. Produce the Message Vector

Now we need to define the vector that token $j$ sends into the mixture.

What vector does token $j$ send?

The most direct answer is to send its current position in the embedding space:

$$ m_j = x_j $$

But then the token has no separate message form. The same vector must be both the state it keeps and the information it sends.

The smallest natural upgrade is to produce a new vector from $x_j$ before sending it. In a neural network, the simplest way to do that is a learned matrix:

$$ m_j = Mx_j $$

Imagine all these vectors living in the same high-dimensional space. In the movement view, $M$ moves $x_j$ to a learned position used for mixing.

Within one layer, the same $M$ is reused for every source token:

$$ Mx_{\text{the}}, \quad Mx_{\text{cat}}, \quad Mx_{\text{sat}} $$

This shared matrix might learn a common feature such as “noun-like token,” “semantic object,” or “grammatical role.” That is only an example; whether the learned feature is human-interpretable is a separate question.

The matrix is not manually told what feature to produce. It learns useful movements because training minimizes next-token prediction error. Different layers can learn different movements.

$$ m_{\text{cat}} = Mx_{\text{cat}} $$

This gives:

$$ h_i = x_i + \sum_j r(i, j)Mx_j $$

7. Produce Scores Against All Source Vectors

Now we need scalar scores that say how much "sat" takes from each source token:

$$ s(\text{sat}, \text{the}), \quad s(\text{sat}, \text{cat}), \quad s(\text{sat}, \text{sat}) $$

These are raw scores. After softmax, they become the weights used in the mixture.

The natural way to get a scalar score from two vectors is a dot product. But using the original embedding-space positions directly is too static:

The same $x{\text{sat}}$ and $x{\text{cat}}$ would produce the same score every time. So we learn new comparison positions first.

For the "sat" target, first transform "sat":

$$ a_{\text{sat}} = Ax_{\text{sat}} $$

Then apply a second learned matrix to all source tokens:

$$ \begin{aligned} b_{\text{the}} &= Bx_{\text{the}} \\ b_{\text{cat}} &= Bx_{\text{cat}} \\ b_{\text{sat}} &= Bx_{\text{sat}} \end{aligned} $$

Now we have one transformed "sat" vector and three transformed source vectors. Dot them to get three scores:

$$ \begin{aligned} s(\text{sat}, \text{the}) &= a_{\text{sat}} \cdot b_{\text{the}} \\ s(\text{sat}, \text{cat}) &= a_{\text{sat}} \cdot b_{\text{cat}} \\ s(\text{sat}, \text{sat}) &= a_{\text{sat}} \cdot b_{\text{sat}} \end{aligned} $$

The same $B$ is reused for every source token. This keeps the comparison rule shared across positions, while the outputs still differ because $x{\text{the}}$, $x{\text{cat}}$, and $x_{\text{sat}}$ differ.

In the full layer, this same process happens for every target position in parallel. We focus on "sat" because that is the position used to predict the next token in this example.

After softmax:

$$ r(\text{sat}, j) = \mathrm{softmax}_j(s(\text{sat}, j)) $$

the scores become mixing weights. Then "sat" receives a weighted mixture of the message vectors:

$$ h_{\text{sat}} = x_{\text{sat}} + \sum_j r(\text{sat}, j)m_j $$

So far, we need three learned matrices:

  • $A$ matrix: applied to every target position. In this example, we looked at the "sat" target.
  • $B$ matrix: applied to every source position so sources can be scored against each target.
  • $A$ and $B$ together: produce the scalar scores by dot product.
  • $M$ matrix: transforms all source tokens into the vectors used for mixing.

8. Add Position to the Scores

Our mixing rule now knows which tokens exist, but it does not yet know where they are.

For "the cat sat", when "sat" is the target, we compare it against:

sat vs the
sat vs cat
sat vs sat

But without position, the score only sees token identity. If we keep "sat" fixed and swap the previous tokens:

the cat sat
cat the sat

the same source vectors get transformed, scored, and mixed. The order changed, but the set of vectors did not.

The next pressure is:

The scalar scores need to depend on order, not just token identity.

The simplest first idea is absolute position information. Give each slot a learned position signal:

position 1, position 2, position 3, ...

Then the score can use both content and absolute slot:

$$ \mathrm{score}(i, j) = a_i \cdot b_j + \mathrm{bias}(i, j) $$

This works, but it is not ideal. It makes the model learn position-specific patterns:

"cat" at position 2 before "sat" at position 3
"cat" at position 50 before "sat" at position 51

These are the same relative relation, but absolute positions make them look different.

What we really want is relative position:

how far is token j from token i?

So the goal is clear:

Make the scalar score depend on content and relative position.

The constraints are also clear:

  • keep the final operation as a dot product
  • avoid learning a separate pattern for every absolute slot
  • preserve reusable structure across positions

At this point, $a_i$ and $b_j$ are just two vectors. Their dot product measures alignment:

$$ a_i \cdot b_j $$

To make that alignment position-aware while keeping the dot product, we can change the vectors before the dot product:

$$ \mathrm{score}(i, j) = (P_i a_i) \cdot (P_j b_j) $$

Now position enters through $P_i$ and $P_j$.

RoPE chooses $P$ to be a rotation. In 2D, rotating a vector counterclockwise by angle $\phi$ is:

$$ R(\phi) \begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} x\cos\phi - y\sin\phi \\ x\sin\phi + y\cos\phi \end{bmatrix} $$

RoPE uses the position to choose the angle. For position $i$, use angle $i\theta$:

$$ R_i = \begin{bmatrix} \cos(i\theta) & -\sin(i\theta) \\ \sin(i\theta) & \cos(i\theta) \end{bmatrix} $$

For a tiny example, if $a = [1, 0]$ and $\theta = 30^\circ$:

  • at position 0, $R_0a = [1, 0]$
  • at position 1, $R_1a = [\cos 30^\circ, \sin 30^\circ]$
  • at position 2, $R_2a = [\cos 60^\circ, \sin 60^\circ]$

The vector length stays the same; only the angle changes.

Now apply this before the dot product:

$$ \mathrm{score}(i, j) = (R_i a_i) \cdot (R_j b_j) $$

Then:

$$ (R_i a_i) \cdot (R_j b_j) = a_i^\top R_i^\top R_j b_j = a_i^\top R_{j-i} b_j $$

That $R_{j-i}$ term is the reason RoPE works: after rotating by absolute positions, the dot product depends on the relative difference.

For a concrete example, let "sat" be the target at position $i = 3$, and "cat" be the source at position $j = 2$. Then:

$$ j - i = 2 - 3 = -1 $$

If $\theta = 30^\circ$, the relative rotation is $-30^\circ$:

$$ R_{-1} = \begin{bmatrix} \cos(-30^\circ) & -\sin(-30^\circ) \\ \sin(-30^\circ) & \cos(-30^\circ) \end{bmatrix} $$

So previous tokens naturally produce negative relative rotations under this convention.

Rotations also wrap around. With the same toy $\theta = 30^\circ$:

j - i =  -1 -> -30 degrees
j - i = -12 -> -360 degrees -> same direction as 0 degrees
j - i = -13 -> -390 degrees -> same direction as -30 degrees

So one frequency alone eventually repeats. RoPE avoids relying on one frequency by using many frequencies across different dimension pairs.

The real vector is high-dimensional, not 2D. A rotation is always easiest to describe inside a plane. In 2D, the whole space is one plane. In 3D, a rotation happens around an axis, which means points rotate inside the plane perpendicular to that axis. In higher dimensions, there is no single global angle for the whole space, so RoPE uses the clean version: split the vector into pairs of dimensions and rotate each 2D plane independently.

Each pair gets its own frequency:

$$ \theta_k = 10000^{-2k/d} $$

where $d$ is the vector dimension and $k$ indexes the dimension pair. The number $10000$ is a fixed base that spreads the frequencies across many scales. It is not learned; it is a design choice inherited from sinusoidal positional encodings. A larger base makes the slowest rotations change more gradually across long distances.

For example, if the rotated dimension is $d = 128$:

k = 0   -> theta = 1.0 radians  ~= 57.3 degrees per position
k = 16  -> theta = 0.1 radians  ~= 5.7 degrees per position
k = 32  -> theta = 0.01 radians ~= 0.57 degrees per position

So for the same offset $j-i = -1$:

k = 0   -> relative rotation is -1.0 radians  ~= -57.3 degrees
k = 16  -> relative rotation is -0.1 radians  ~= -5.7 degrees
k = 32  -> relative rotation is -0.01 radians ~= -0.57 degrees

As $k$ increases, $\theta_k$ gets smaller. That means early dimension pairs rotate quickly, while later dimension pairs rotate slowly.

For a larger rotated dimension, such as $d = 4096$, the same formula spreads the frequencies more finely:

k = 0     -> theta ~= 1.0 radians    ~= 57.3 degrees
k = 512   -> theta ~= 0.1 radians    ~= 5.7 degrees
k = 1024  -> theta ~= 0.01 radians   ~= 0.57 degrees
k = 2047  -> theta ~= 0.0001 radians ~= 0.0057 degrees

So the same relative offset leaves a different pattern across different dimension pairs. Fast rotations capture short-distance changes; slow rotations avoid wrapping too quickly and help with longer distances.

This also answers a natural “why not?” question.

If every dimension pair used the same $\theta$, position would be represented at only one scale, and the rotation would eventually wrap around. If the two sides used unrelated rotations, the clean relative-position term would disappear:

$$ R_i^\top R_j = R_{j-i} $$

So the design is conservative:

  • use the same rotation rule on both sides
  • use many frequencies across dimension pairs
  • keep the final score as a dot product

This rotation changes the vectors before scoring, but that is fine. The $A$ and $B$ matrices are learned together with this positional mechanism, so training learns vector positions that work well after rotation.

RoPE is not the only possible solution. Other choices include:

  • absolute position embeddings
  • learned relative position bias
  • distance-based biases such as ALiBi

RoPE is attractive because it satisfies the goal and constraints cleanly: the score remains a dot product, the position effect is relative, and the same relative offset can be reused at different absolute positions.

So "cat" immediately before "sat" can reuse the same positional structure whether it appears near the beginning or much later.

same relative offset -> same positional structure
different token states -> different scores

9. Learn Several Score-and-Mix Patterns in Parallel

So far, for each layer, we have one set of learned matrices:

$$ \begin{aligned} A &: \text{applied to every target position} \\ B &: \text{applied to every source position} \\ M &: \text{applied to every source position to produce vectors for mixing} \end{aligned} $$

RoPE modifies the score vectors before the dot product, so the scalar scores also depend on relative position.

This gives one score-and-mix pattern per layer. For a target token like "sat", the source message vectors are weighted and mixed into one new vector for that target. That mixed vector can then go through an MLP, where different dimensions can be combined, mixed, and folded into new features.

Then we can stack many layers:

score-and-mix -> MLP -> score-and-mix -> MLP -> ...

More layers give more sequential opportunities to learn new patterns.

But only scaling depth is rigid and expensive. The computation has to happen step by step through the layer stack.

That is useful, but it creates a natural question:

Why learn only one pattern per layer?

The same token may need several different kinds of information at the same time:

  • subject
  • verb
  • nearby syntax
  • formatting
  • long-range reference
  • topic

With one score-and-mix pattern, all of these compete inside the same set of scores.

If we want many independent patterns inside the same layer, we need more independent matrix sets.

One way to do that is to send the same input vector through several smaller learned transition spaces.

For example, suppose the model state has 2048 dimensions. Instead of using one big projection into another 2048-dimensional space, we can use several smaller projections:

2048-dimensional vector -> smaller transition space 1
2048-dimensional vector -> smaller transition space 2
2048-dimensional vector -> smaller transition space 3
...

Each smaller transition space has its own $A$, $B$, and $M$:

transition space 1: A_1, B_1, M_1
transition space 2: A_2, B_2, M_2
transition space 3: A_3, B_3, M_3
...

The original vector is not manually sliced apart. Each matrix projection collapses or moves the same full vector into a smaller learned space. Each space can learn a different pattern and produce its own smaller mixed vector.

Now we have several smaller vectors, not one final embedding-space vector. We could concatenate them directly, but concatenation only places them side by side. It does not let the model blend the transition spaces or decide how the pieces fit back into the normal embedding space.

So we concatenate the results and apply one more learned matrix:

$$ \mathrm{update}_i = O[\mathrm{update}_i^1; \mathrm{update}_i^2; \ldots; \mathrm{update}_i^H] $$

The output matrix $O$ projects the combined sub-vectors back into the model’s embedding space. It also lets information from different transition spaces mix after they have learned their own patterns.

The point is not that each smaller transition space is more powerful. The point is that the layer gets several independent matrix sets, so it can learn several score-and-mix patterns in parallel.

For example:

100 layers, 1 pattern per layer  -> 100 sequential score-and-mix patterns
100 layers, 8 patterns per layer -> 800 parallel/sequential opportunities

Instead of making one large score-and-mix system do everything, the model uses several smaller systems to learn different useful patterns at the same time.

This is the idea called multi-head attention. One independent $A/B/M$ set is one head.

My intuition is that this became so central to modern scaled LLMs because it combines two things that rarely come together: the model gets many learned patterns in the same layer, and hardware can compute those patterns efficiently in parallel with large matrix operations.

But this is not the only possible design. From first principles, the goal is:

Add more learned patterns inside a layer without simply making everything bigger.

The constraints are:

  • keep compute manageable
  • keep parameter growth controlled
  • preserve the normal embedding-space interface

Multi-head attention is one design that satisfies these constraints.


Personal thought: other ways to add patterns

This is a deviation from the main Transformer path, but it is useful for understanding the design space.

One idea is a low-rank adapter.

A normal matrix update would require learning a full new matrix. A low-rank adapter instead adds a small learned update beside an existing matrix:

$$ W' = W + UV $$

where $U$ and $V$ are much smaller than $W$.

More precisely, their outer dimensions match $W$, but their shared inner dimension is small. If:

$$ W \in \mathbb{R}^{2048 \times 2048} $$

then a rank-8 adapter uses:

$$ U \in \mathbb{R}^{2048 \times 8}, \quad V \in \mathbb{R}^{8 \times 2048} $$

so:

$$ UV \in \mathbb{R}^{2048 \times 2048} $$

For example, if $W$ maps 2048 dimensions to 2048 dimensions, then a full extra matrix has:

$$ 2048 \times 2048 \approx 4.2\text{ million parameters} $$

A rank-8 adapter can use:

$$ 2048 \times 8 + 8 \times 2048 \approx 33\text{ thousand parameters} $$

So $UV$ adds a small learned movement to the original matrix. It can capture an extra pattern or direction without replacing the full transformation.

Using our $A/B/M$ notation, we could add a low-rank adapter to $A$:

$$ A' = A + U_AV_A $$

Then the target-side vector becomes:

$$ a_i = A'x_i = Ax_i + U_AV_Ax_i $$

So the original $A$ still does its normal transformation, while $U_AV_A$ adds a small extra learned movement.

The same idea could apply to $B$ or $M$:

$$ \begin{aligned} B' &= B + U_BV_B \\ M' &= M + U_MV_M \end{aligned} $$

This gives the layer extra learned patterns inside the existing $A/B/M$ pathway.

The tradeoff is independence. A low-rank adapter modifies an existing pathway. A separate $A/B/M$ set creates a more independent pathway with its own scores and mixed vector.

Another linear algebra idea is to factor the matrix itself. Instead of learning one big matrix:

$$ W $$

we can force it to be built from smaller pieces:

$$ W = UDV $$

where $D$ might be diagonal or small. This says:

project into a smaller coordinate system
scale or select important directions
project back

This is close in spirit to SVD-style thinking:

$$ W \approx U\Sigma V^\top $$

The model does not need to literally compute an SVD during training. The point is architectural: we can constrain a large transformation to be made from smaller structured transformations.

For our $A/B/M$ path, that could look like:

$$ A = U_AD_AV_A $$

or:

$$ M = U_MD_MV_M $$

This could make a matrix learn a few important directions instead of an arbitrary full transformation.

Other structured choices are possible too:

  • block-diagonal matrices: let groups of dimensions transform mostly independently
  • orthogonal or rotation-like matrices: preserve vector length while changing direction
  • sparse matrices: allow only selected dimensions to interact

All of these are different answers to the same first-principles question:

How can we add more useful learned structure without paying for a completely unconstrained giant matrix?

Can we explore these combinations with multi-head attention? Or does it not really matter that much because, based on scaling laws, it is less about the architecture and more about scaling or data? I am just wondering how much of a difference combining these architectural innovations will actually make.


10. Transform the Mixed Vector With MLPs

The score-and-mix step produces one vector per position:

h_the, h_cat, h_sat

Now we need to add nonlinearity. The MLP is applied to each position vector independently, using the same weights:

$$ \begin{aligned} z_{\text{the}} &= \mathrm{MLP}(h_{\text{the}}) \\ z_{\text{cat}} &= \mathrm{MLP}(h_{\text{cat}}) \\ z_{\text{sat}} &= \mathrm{MLP}(h_{\text{sat}}) \end{aligned} $$

In matrix form, the MLP is applied row by row:

$$ Z = \mathrm{MLP}(H) $$

For next-token prediction after "the cat sat", we eventually read only the final vector at the last position, but the layer still transforms all positions in parallel.

This gives us the dense version:

every position vector uses the same MLP

Now the same goal and constraint appear again:

How can we learn more patterns without making every token use every parameter?

Different token states may need different transformations:

  • code token
  • math token
  • punctuation
  • proper noun
  • Chinese token
  • instruction text
  • JSON bracket
  • reasoning step

Making the MLP larger makes every vector pay for every parameter. The natural alternative is to create multiple expert MLPs and let each vector use only a few.

These experts are usually not tiny slices of one old MLP. They are separate MLPs, often with roughly the same shape as the dense MLP block.

$$ E_1, E_2, \ldots, E_{10} $$

To choose among 10 experts, we need 10 scalar scores. As before, the simplest way to produce a vector of scores is a matrix:

$$ \mathrm{scores}_i = Rh_i, \quad R \in \mathbb{R}^{10 \times d} $$

Each row gives one expert score:

$$ \mathrm{score}_k = r_k \cdot h_i $$

Then we select only a small number of experts for that vector:

many expert MLPs exist
each vector uses only a few

This is the core MoE idea:

increase total parameter capacity without increasing active compute per token as much.

It is similar in spirit to the multi-head idea: more independent learned patterns inside a layer. The difference is where it happens:

multi-head attention -> more score-and-mix patterns
MoE MLP              -> more nonlinear transformation patterns

Once we introduce this idea, new problems appear:

  • How many experts does each vector use?
  • Do we choose top-k experts or use a learned threshold?
  • How do we keep experts balanced so one expert does not get all the traffic?
  • How do we avoid wasting compute on experts with tiny scores?

After the score-and-mix step and MLP step, the result becomes the input to the next layer:

$$ X^{(2)} = Z^{(1)} $$

So each layer repeatedly refines the vector at each position:

old position vector
-> mix information from context
-> apply nonlinear MLP transformation
-> new position vector for the next layer

11. What This Gives Us

Now we can step back.

Each position has a fixed-size vector:

$$ x_i \in \mathbb{R}^d $$

That vector has to carry whatever matters for prediction: syntax, meaning, facts, tone, intent, ambiguity, and context.

It cannot preserve everything. It only needs to preserve what changes the probability of future tokens.

That is why scaling matters. If we want better predictions:

  • more data gives more patterns to learn
  • more parameters give more capacity to store and transform those patterns
  • more compute lets us train larger models on more data

But scaling laws themselves are empirical. First principles tells us what variables matter. Experiments tell us how much they matter.

So the final thesis is:

Given previous tokens, move the final vector toward the correct next-token vector.

To do that, the model learns to:

  • represent tokens as vectors
  • score how much to take from each source vector
  • mix useful message vectors
  • add position so order matters
  • run many score-and-mix patterns in parallel
  • add nonlinearity with MLPs
  • repeat this over many layers

An LLM is a prediction machine that moves token vectors through many layers of scoring, mixing, and nonlinear transformation until the final vector points toward the next likely token.

Final Thoughts

What is striking to me is how simple the core idea is.

Once we strip away the names, attention is almost common sense:

decide how much to take from each vector
take the useful parts
mix them
transform the result
repeat

And yet this simple mechanism scales into something surprisingly powerful.

That leaves me with more questions than answers:

  • Why does a model show something different after a breaking point? We see something different emerge from model behavior, such as the transition from GPT-2 to GPT-3.
  • Real intelligence, like what humans can do, is also compression and prediction, but humans interact with the physical world and use different ways of gathering feedback. Are we a step closer to creating real, self-evolving intelligence?
  • Regarding The Bitter Lesson, which says data and compute scaling matter more than architecture, and Kevin Kelly’s Out of Control: I am wondering how much of modern AI is rediscovering those principles through these systems. My current intuition is that they point in a similar direction. Hand-designed cleverness often loses to general methods that can absorb more compute and data. Complex systems are not usually built by controlling every part from the top down; they grow from many simple parts interacting, adapting, and feeding back on themselves. Modern AI feels like it sits at that intersection: simple local rules, massive scale, feedback from data, and emergent global behavior. Maybe intelligence is less about designing the perfect rule and more about creating the right system where useful structure can emerge.