Machine Learning for Business

I spent 1 year trying to predict company growth from SEC filings. It failed.

Eloi de Reynal — Tue, 12 May 2026 15:07:04 GMT

What about feeding whatever unstructured data you have about a company (full text of SEC filings or company website) into a custom encoder transformer, and have it output a quantitative prediction of its 5y growth rate?

That was my ultimate goal.

To get there, I had to teach a transformer how to understand numbers, train it to encode entire documents into a single embedding (and only keep what’s most interesting about said document), and use the embedding to try to predict the growth rate.

The prediction part failed. The rest is a success, and it can be used elsewhere.

Why unstructured data should carry signal

When I first checked Ferrari’s website, I was astonished. I had never seen a website so bad. I believe it would have been better for them to put a warning page reading “Dear customers, if you like cars, get lost. We, the top management and marketing team, decided that you’d rather watch strobing 480p videos of people carrying handbags with the Ferrari logo on them”.

There’s no way such a website is compatible with a healthy top management. A sane CEO would have set the marketing department on fire, sent an apology letter to every would-be customer, and gone on a pilgrimage to reflect on his sins.

How surprising. You guys actively repel customers and sell handbags.

I think the website of a company is the best public window to its inner workings. Like the way someone looks and speaks can tell you a lot about their psychology.

Also, I believe that the text of financial data gives a lot of information. If you see a footnote reading “we used mark-to-market accounting for our long term, nonfinancial contracts (but don’t worry, everything’s gonna be alright)”, you’d better be careful.

When it comes to financial analysis, there’s a no man’s land between what statistical models can do, and what humans can do. The former have been trained on 100,000s of datapoints but have a very shallow understanding of finance. The latter have been trained at most on 1,000s of datapoints (financial statements and business stories), but have a deep understanding of how things work.

LLMs have started bridging that gap, starting from the human side. Claude can help you analyze a financial statement, and would flag obviously fishy footnotes. But it hasn’t been trained to say “On average, companies with such an income statement grow 1.53% in the next year”.

This project is about trying to bridge the gap starting from the other side: a statistical model that’s more intelligent than a random forest, yet still trained on a lot of datapoints.

Thing is, there are a lot of technical hurdles to get there. First one is that text transformers are bad with numbers.

Ask an LLM: how close is 9,234 to 9,235?

Encoding every number between 9234 and 9334 with ModernBERT, and checking how close each number is to the lower (9,234) and upper (9,334) bounds. As you can see, the model considers 9,235 to be nearly as close to 9,334 as to 9,234. dist(9234, 9235) ≃ dist(9235, 9334). In case you’re asking, I just picked these bounds randomly.

I wouldn’t bet my ass on a financial model that has no clue how to sort numbers.

The problem is that ModernBERT’s1 tokenizer treats numbers as discrete tokens, and breaks them down nonsensically. For example, 1000 gets tokenized as ‘100’, ‘0’, whereas 999 gets tokenized as ‘999’. How is the model supposed to figure out that these 2 numbers are close together, despite sharing 0 common tokens?

The training objective (predicting masked tokens with CE loss) is not really suited for this task either. It rewards exact prediction, not a “hmmmm, the number I have to predict should be around 1,000”. Sure, with enough parameters and enough training tokens, one could get a decently performing transformer. But the inductive bias of the default architecture is just not suited for the job.

There are a few existing methods to tackle these problems, but I don’t like them2.

Step 1. Making ModernBERT good at numbers

< deep dive>

After a few weeks of experimentation, here’s the best strategy/architecture I came up with:

Encoding numbers in log-magnitude. Most of the time, in financial data, the relative uncertainty is more important than the absolute one. No one cares if Apple’s revenue is $416.161B or $416.162B, but everyone cares if their EBITDA margin is 0% or 1,000,000,000% (same absolute uncertainty). I chose to encode numbers from 1e-3 to 1e123, a 16 OOM window.
Breaking up the 16 OOM window into 128 bins, with linear interpolation.
1. Encoding (embedder): An embedding dict maps each of these 128 bins to a learned 768-dim (ModernBERT’s hidden dim) vector. It’s basically a free (flop-wise) set of parameters. As nearly all numbers fall between 2 bins, a linear interpolation of the 2 closest bins is performed4.
2. Decoding (prediction head): Classification-regression using the same 128 bins, with linear interpolation too. A simple MLP(last_hidden_states) with an output dim of 128 did okay, but I decided (maybe not the best call 5 ) to use a GLU approach that performed a bit better. Smooth Cross-Entropy loss on the 2 closest bins6.

I iterated quickly on this architecture by full-fine-tuning ModernBERT on a synthetic numeracy dataset, that tested the model’s ability to perform simple arithmetic operations on 1e-4 to 1e6 numbers, in natural language. Like “16 plus 24 gives [MASK]”, or “If you divide 1,234,345 by 567, you get [MASK]”, or “45 minus [MASK] gives 34”. I believe I spent around $30 of compute on these experiments.

The best architecture was able to predict the results with an MRE of around 10% after 30 L4-minutes on this toy dataset.

Then, I tested it on a real (though heavily formatted) dataset, that I synthesized from structured financial data that I got for another one of my projects. Results were impressive (MRE was about 5% or so, despite some numbers being quite difficult to infer from others).

So, at that point, I had a backbone model I knew would be good at numbers. But I had no data yet.

Step 1.2 Data acquisition and sanitization

Although this part is the most important of the whole blog post, it’s not the most conceptually interesting (despite a few tricks that I’m quite proud of). So, I discuss it in the annex. You can check it there if you want.

The final dataset is composed of:

All 2018 10-K filings
All AA accounts for all 50+ employees British companies (2018)
A wikipedia regularization dataset (100M tokens)

I appended website statistics (not their full texts, as I wanted to first check how the model would perform on statistics, that are arguably more information dense than the full website), to all company filings for which I could find the website archive as of 2018.

Everything was heavily curated and sanitized. You can check the annex for more information.

316M tokens dataset in total. Full dataset available here.

Growth rate was defined as an average of multiple growth figures: Employee count, revenue, net income… You can check the weights of this average here.

I can’t stress enough that good data matters much more than a good architecture. And that spending a few hours on data sanitization yields better results than spending 100s of hours tweaking the number of attention heads of a transformer.

Step 2. Training the backbone model on real data

So, after I obtained and sanitized all the 2018 SEC and companies house filings (>50 employees) and websites, I proceeded with the MLM training (predicting masked tokens in a sequence) of the model on that 300M tokens dataset7. The main objectives were:

Specializing the model on financial data
Teaching it how to handle numbers with its new number embedder/head

I started off using complex LoRA setups with warmup phases where I froze everything but the number-specific part of the model. Classic “I’m gonna over-engineer it” mistake. Full fine-tuning gave much better results. I just used a decent chunk of English Wikipedia for regularization, so that the model doesn’t catastrophically forget all its pre-training.

So, decent improvement above baseline. Training took around 2h on an RTX 6000 Blackwell. You can check a full SEC filing inference example here, just to get a vibe-sense of what these numbers mean. Loss comparison is irrelevant here because of the tokenizer change8.

The (trained) model is published on HuggingFace here. For more information on the training setup and failed experiments, please contact me if needed9.

So, at this point, we have a model that (superficially, it only has 150M params) understands finance, and numbers. But it’s not good at distilling a sequence, let alone a whole document into a nice vector that we can use for quantitative analysis10.

They don’t do cruises like they used to.

Step 3. Making the backbone into a document embedding model

This is the tricky part.

When a model has been trained to predict masked tokens, its last hidden states are quite informative, but they also contain specific details about the tokens to predict, that are irrelevant for sentence embedding purposes. Also, there’s a tendency of transformers to collapse the effective dimensionality of their outputs compared to their inputs11.

So, I really had to fine-tune the backbone model into a sequence embedder, to make it:

Better at encoding relevant sequence information, without all the token-prediction noise.
Better at discriminating between different sequences.

Usually, what you do is that you take pairs of sequences that you want the model to consider “close” (positives), and pairs that are dissimilar (negatives). Like “the company reported strong earning growth in the last quarter” and “our business grew strongly in the last quarter” for positives.

You take their CLS12 embeddings, as well as that of a lot of negative examples, and you do some contrastive learning13.

Thing is, for this to work, you need a lot of positive pairs to train the model upon.

And creating positive pairs is something that comes with a certain bias: you have to decide what, in a sequence, is sufficiently irrelevant to be paraphrased14. I didn’t want to bake such assumptions into the model, so I went for an unsupervised method for making it learn sequence representations. 2 methods, in fact.

3.1. JEPA. It failed. But it’s mathematically elegant.

The core idea of JEPA is quite close to contrastive learning. It’s basically “let’s give the model slightly augmented versions of the same text/image (different crops, different token masks, different dropout masks…) and ask it to output similar embeddings for them. In order to prevent it from collapsing into predicting a constant (which obviously satisfies the previous objective), let’s make the output’s distribution gaussian”15. I like it. I believe it’s the cleanest unsupervised way to make an output distribution both informative and locally consistent.

My “positives” were both augmented versions of the same chunk and different chunks from a given document.

It trained very well: the MSE loss term was really small and the SIGReg term was acceptable16. End model’s embeddings participation ratio was decent (at least 128), everything looked fine.

Problem is that it just didn’t work on my custom STS and ordering benchmarks.

I can’t give a definitive explanation for that, except for the obvious “ensuring the model gives consistent representation for very close pairs wasn’t sufficient to make it consistent on less close pairs”17. I believe the problem comes from the fact that different crops from a given image (which JEPA was designed for) are more closely related than different chunks from a given document. And, in my setup, the no man’s land between “incredibly close pairs (augmented versions of the same chunk)” and “vaguely related pairs (different chunks from the same document)” prevented the model from learning a continuous representation of its input space.

3.2. Autoencoder setup. Worked fine. But beware of its frequency bias.

Another idea I’ve had after checking the EmbeddingGemma paper was to use the CLS token as a bottleneck carrying information from a sequence encoder (that we want to train) to a decoder (that we don’t care too much about).

Training setup was: giving an unmasked chunk to the encoder, give its CLS representation to the decoder, along with a heavily masked18 version to the chunk, and have it reconstruct the chunk.

Unlike the original transformer paper, I decided to go for a bidirectional decoder, for a few reasons19. Also, unlike that paper, only 1 token (the CLS) was attended to in the cross-attention layers. Which was obviously wrong, and needed to be addressed: softmax(only_one_attention_score) = 1.

So, I asked Claude to come up with an idea, although I had a few of my own. It chose brute force: expanding the CLS token embedding into 16 slots via a different full rank (768*16, 768) transformation for each layer. At first, I thought it was dumb: it more than doubled the param count of the decoder relative to its encoder base model. Plus, 16 full rank projections of a single embedding is as close as it can get to a complete a priori red flag.

But it’s the best decoder I’ve tried20. The condition numbers of each slot transformations are incredibly small21, hinting that it’s a good solution.

The frequency bias of this autoencoder setup, and how to mitigate it

I liked the fact that this autoencoder setup makes the encoder only care about what can’t be inferred from the context. Seeing a few words from a CYA section is enough to guess all of it. So, the encoder doesn’t need to tell anything about it to the decoder.

But there’s a subtle trap in there. The presence of such boilerplate is in itself an information, and it needs to be preserved. If we take an image analogy, the decoder tries to denoise images, and the encoder thus only has to give high-frequency information to the decoder. Not “this is a landscape picture, and the sky is blue”. This, the decoder doesn’t need to know to denoise a patch from the pixels (tokens) it sees. The decoder doesn’t need it, therefore the encoder isn’t going to encode it. But we still might need it. So, I had to find a way for the low-frequency information (“this picture is a landscape” or “this 10-k comes from an energy trading company with strange accounting practices”) to make its way into the CLS embeddings.

On the right image, nearly all high-frequency information gets lost with the masking noise. But the low frequency information (and general theme of the image) is preserved, so the decoder doesn’t need the CLS token to reconstruct it. So, the encoder doesn’t need to embed any of it.

I chose to go for contrastive training, with positives defined as different chunks from the same document. It’s basically free, as long as you ensure that there are enough positives in each batch, which is quite an interesting batching problem22. This contrastive term draws chunks closer if they are from the same document, and pulls them apart if they come from different docs. The effect is that the encoder needs to encode something like “this chunk looks like something you would find in that type of documents”, which is an inherently low-frequency type of information. In the image analogy, it would be like ensuring that patches from the same image get encoded quite similarly. The only way to do that is for the encoder to put things like “This looks like a tree. Must be a part of a landscape” into its latent representation, so that the patch gets close to a patch that says “This looks like a sky. Must also be part of a landscape”.

I have not tested the real-world contribution of this contrastive loss term, though. I just noticed it did improve the effective dimensionality of the CLS space.

Here are results of the training23:

Giving a deceptive CLS token barely moves the decoder’s text accuracy. The vast improvement over the encoder-only baseline come from the facts that:

The MLM pre-training of the encoder was done at a fixed 15% masking rate. So it has no idea how to handle 80% masked inputs
The CLS carries basically all the numerical information, leaving more “mental space” to the decoder for text prediction/memorization

On the other hand, the CLS carries nearly all numerical information. At 15% masking rate, the numeric reconstruction is perfect (the loss floor is 0.5, see footnotes). At 85% masking rate, it’s nearly the same as the encoder’s at 15% masking rate.

The conclusion from this auto-encoder training is that the text content of SEC filings is very light on information. And that most of it is still mostly understandable at 80% masking rate24. Most of the information density seems to be in the numbers, and that’s why the encoder focuses on them25.

Checking how the embedding model performs

Nice reconstruction results don’t necessarily mean good real-world results. So, I decided to test the model on 2 custom benchmarks:

A kind of STS benchmark: 67 pairs of paraphrased financial sentences. Testing recall@1 and MRR. The code speaks for itself.
A kind of numerical ordering benchmark. Code speaks for itself. The idea is to check if the model is able to understand that “Last year revenue was $17 billion” is closer to “Last year revenue was $15 billion” than “Last year revenue was $1 billion”. 29 examples.

The model performs surprisingly well despite its lack of task-specific fine-tuning.

It lags BGE-base-v1.5 on pure semantic similarity tasks, but it blasts it (and every other embedding model I’ve tried) on numeric tasks.

We could vastly improve the performance on these specific tasks by fine-tuning the last layer of the model for them26.

Step 4. Embedding whole documents

Step 3 was the tricky one. Step 4 is the “how the heck can I train this with such a small dataset and less than $1,000 of compute?” one.

ModernBERT, that my financial models are based upon, only supports a context length up to 8,192 tokens. Some SEC filings are longer than 128,000 tokens. 16x-ing a context window is possible but quite boring, and it necessitates fine-tuning. Nothing insurmountable, but yet another rabbit hole (that I preferred not) to explore27.

So, I decided to go for another, less compute intensive method: aggregating CLS tokens, via another transformer, to be trained from scratch. The idea is very simple: take the CLS tokens from every chunk from one document, feed them into a transformer, and have it output (either via mean-pooling or attention pooling (CLS)) a single embedding.

Things get a bit fuzzy here, because there is no easy way to check whether the aggregated CLS is good. You can’t paraphrase 128K token documents as easily as you can 128 token chunks. You can’t easily decide that such documents are more alike than such others. So, the only things you’re left with are:

Completely unsupervised setups, like JEPA. That we saw doesn’t work too well unless you have a lot of data, with a certain shape.
The auto-encoder setup. That we saw was subject to the dimensionality collapse phenomenon (footnote 13). The participation ratio of the Step 3 CLS is already rather small, at 128 or so. Going for another aggregation step would likely decrease it further.
Completely supervised setup. Training the aggregator (along with a small regression head) to predict the 5y growth rate. Thing is, a small transformer with a 768 hidden dim has at least a few million params. There’s no way you can train it on a few thousands datapoints. Overfitting is a major concern here.

I’ve tried all 3 methods, and tested them all on the end-to-end task, that is, predicting 5y growth.

I was lucky enough that in-distribution prediction (training on 7000 companies, on the 2018-2015 period; testing on 700 unseen companies but on the same period) is a much easier task than out-of-distribution prediction (training on all 7700 companies on the 2018-2021 period, predicting 2021-2025 growth rates). It gave me a sense of how good the CLS representations were.

In all cases, I used a 6 or 12 layers transformer, with the exact same characteristics as ModernBERT, except for the local attention layers (all layers were global attention). Around 50M params.

The completely supervised setup28 instantly overfit. I was expecting that, but you always have a glimmer of hope. Nope.

JEPA gave me the nicest output distribution entropy/effective dimensionality29. But the output distribution was really noisy, and it just wasn’t good for in-distribution predictions. The CLS representation was so bad that I had to train the aggregator with a very small Learning Rate (1e-6) and a regularization term (MSE with base aggregator’s CLS) to get acceptable results. I got a .15 correlation between predicted in-distribution growth rates and the real growth rates.

The AutoEncoder setup worked the best. The setup is very similar to that of step 3:

The encoder is the CLS aggregator transformer, that we’re trying to train. Its inputs are CLS representations (as computed by the Step 3. encoder) of all the chunks of each document. The Step 3. encoder is, of course, frozen, and the CLSs are pre-computed. I chunked documents differently for each epoch, and pre-computed all the CLSs beforehand.
The decoder is initialized from the Step 3. decoder. It’s trainable. Its job is to de-noise a single chunk of the encoded document. So, it’s given an aggregated CLS representation of the whole document, but its job is to only de-noise a chunk of it.

Once the CLS aggregator has been trained30, it is frozen and a (trainable) regression head is added on top, to predict growth rates. The dimensionality collapse actually was a blessing here. When you have few datapoints, it certainly helps to fight overfitting. I got a .20 in-distribution prediction vs. actual correlation with this setup.

Unfortunately, the .2 in-distribution correlation didn’t translate to any out-of-distribution predictive power. Literally 0. You can check the much more detailed results here 31.

Is out-of-distribution prediction even possible?

When I realized that 2018-2021 and 2021-2025 actual growth rates were completely uncorrelated (0.0047) I thought that it would be difficult to predict. Unfortunately, I realized that after I had finished training the Step 4 model.

These growth scores being uncorrelated is perhaps the strongest indication that the task was simply impossible. It doesn’t simply mean that there is probably no signal to be found in the filings, but also that most explanatory variables are extrinsic to the company. If there’s something that makes companies grow for 3 years (say, a certain leadership style, the quality of its marketing department, an engineering moat), it’s statistically unlikely to make it grow for 3 more years. Maybe it’s about the 2018-2025 period specifically, but it shows how nearly impossible this task was.

On the other hand, I was able, with a lot more (structured) data, to actually predict some things that are quite similar to growth rates, and reported the results in a previous blog post.

So, I believe the conclusion is:

2021-2025 growth rates are inherently impossible to predict with a model trained on the 2018-2021 period
But it might still be possible to predict them on different periods, with a lot more data and compute.

Step 3 and Step 4 models work incredible for quantitative data extraction

You can actually use step 3 and step 4 models and training pipelines for a wide variety of quant tasks. It just happened that OOD company growth rates are impossible to predict from SEC filings.

So, I wanted to check how good the model was at extracting table data and reconstructing a clean, structured vector from the CLS representation.

The only part of the filings that is identically structured across filings is the web stats section, that consists of 2 tables, that you can check the structure of here.

So, I PCA-ed the website stats across all filings down to 90% variance explained, which left me with 28 components.

Then, I encoded the Website Statistics section with the Step 3 encoder, and built a small MLP to try to reconstruct the 28-dim structured vector from the 768-dim unstructured CLS embedding. Code is here (and here for the vanilla ModernBERT baseline).

Here are the results:

1024 -> 512 -> (28) MLP predict a PCA-compressed 28-dim structured vector from the CLS representation of a section that contains the same information. CLS[section] -> 28-dim ground truth vector"}" data-component-name="DatawrapperToDOM">

So, the 150M params encoder, that was never fine-tuned for this task (and was fine-tuned for a total of 10 5090-hours (or 6 H100-hours), is able to encode 54% (0.6 (R^2) * 0.9 (PCA variance explained)) of the quantitative information of a filing section. That’s really cool.

By the way, I am surprised that ModernBERT does a decent job too, despite its dubious number representation: it still encodes 42% of the information. I believe this good performance comes from the high proportion of small integer numbers in this section, that ModernBERT can tokenize and represent soundly. It is likely to be much lower for actual financial tables.

Conclusion & a few technical takeaways

Step 3 and Step 4 models are very good for quantitative feature extraction from unstructured data. Unfortunately, my end goal of predicting company growth rates from their filings wasn’t attainable.

I’ve learnt a bunch of things along the way:

If you need a transformer to understand numbers, use a custom number embedder/prediction head. It just won’t work otherwise (I tested numerous embedding models up to 8B, and they all sucked).
Clean up your data. It’s more important than everything else.
JEPA and other ‘completely unsupervised’ methods don’t work too well with text, because of how small the long-range dependency between text tokens seems to be (compared to image patches). Consequence is that, if you need to embed very long documents with a uniform information density, mean-pooling chunks embeddings might be okay32 (my aggregator transformer might not have been that necessary after all).
Beware the frequency bias of text auto-encoders.
Beware the dimensionality collapse effect of transformers.
Test the feasibility of an idea before throwing 1,000 hours and a few hundred dollars of compute to it. Or don’t, if you want to have fun.
Use git diff and zlib compressed : uncompressed ratio to increase and assess website pages information density. (Seen Annex - Websites)
Total compute used was around $400. Reproducing my results would cost less than $25. 16:1 ratio…

By the way, point 3 and 4 make me think that the LeJEPA way of creating views by cropping images probably puts a lot of emphasis on the lower frequency informations (where most of the information lives anyway, so it’s not too bad for standard image classification tasks). It could be problematic for image tasks where texture and other high-frequency information matter. If I have time, I will investigate a views creation algorithm which frequency bias aligns with the end classification goal. I was thinking that image masking via an Ising model (with a careful, task-specific temperature setting) could be nice. I’ve not given it too much thought though.

What the project looks like, in the end:

Annex: How I got the data and sanitized it

Claude Code helped a lot. I decided to only use free Anglophone companies data. Namely: SEC filings and Companies House filings. They were all available for download quite easily. I got them for 3 separate years: 2018, 2021 and 2025.

Then, I used CommonCrawl and the Wayback Machine to scrap the 20 top pages from each company website, after matching the company filing and websites using either SEC data, or gemini-flash-lite for the companies house filings (less than $5 for 5,000 companies). I intended to use the full website text as input data, but decided to use only stats (like the ratio of text length to html code length, as a proxy for how bloated the page is) for a first proof of concept.

A few tricks that I used to sanitize the HTMLs (SEC and Companies House Filings, and websites were all HTML):

Websites:

Git diff on all website pages vs. the landing page. Helped remove redundant headers and footers.
Removed all boilerplate pages with a regex on the url. (like ‘legal’, ‘privacy’, ‘cookie’, ‘sitemap’, ‘login’, ‘wp-admin’…)
Filtered out pages with low information density, using the gzip33 compressed size : uncompressed size ratio as a proxy for information density. Problem is: the longer a text, the more compressible it is. So, I built a baseline curve from Wikipedia articles: for each text length (100B to 1MB), what’s the “normal” compression ratio for natural language? Then for each page I compared its compression ratio to the baseline at that length. I was quite happy when I saw that the compression ratio very nicely aligned with my intuition of how dense a page was34.

SEC filings:

Sanitized tables using custom JS injection into a chromium headless browser. When you inspect a SEC filing table, the HTML is horrendous, but it can actually be sanitized programmatically if you take colspan and rowspan attributes into account.
HTML to text (element.textContent) for every non-table section. Easy, that’s it.

Companies House filings:

The HTML of these filings is fucked beyond repair. Nearly all html elements are absolute-positioned. I had to HTML → PDF → OCR → Markdown them. Best OCR model out there is GLM-OCR. I rented a RTX 5090 and used vLLM. It could be a good idea to build your custom SDK, because GLM’s default doesn’t use the GPU fully. My suggestion is: Layout-detect your whole heap of docs, store the bounding box somewhere, then get rid of the layout detection model, then use the “real” GLM OCR model on the bounding boxes, with aggressive vLLM memory settings. You’ll triple speed and GPU usage. I tried it, kind of worked, but I was so frustrated when I encountered a vLLM shutdown midway that I decided to revert to the default SDK and just wait 10h for the OCR to finish.
Most of the companies don’t report their income statements. So, I had to infer their growth rates from all the data they reported. The idea was: checking how all their reported figures grew, and weight-average (based on the significance of the said number) them to get a single number. Eg. Employee count grew 15% and asset base grew 12% → growth rate is probably around 14% (weighing employee count a bit more). Script is here and weights used are here.

After that, all numbers were put between and tags (that the model’s tokenizer understands).

ModernBERT is the backbone model I used. Multiple reasons for that:

Decent context window (thanks to Flash Attention (although SDPA is now trivial to implement in PyTorch) and Local Attention layers)
RoPE
Small (155M params).
Completely overtrained (trillions of training tokens vs. <1B params)

And I incidentally tested nearly all of them in my different experimentations.

Fuck, for some reason, it got changed by Claude Code to -12, 12. I should have checked the default config. The model works okay anyway, but it’s definitely sub-optimal. A few lost flops. From my experience, though, changing the number of bins from 64 to 128 only yielded marginal improvements, so I believe this wasted (-12, -3) range doesn’t have too much impact.

weight_upper = norm_pos - lower_idx.float()

weight_lower = 1.0 - weight_upper

emb_lower = self.magnitude_emb(lower_idx)

emb_upper = self.magnitude_emb(upper_idx)

interpolated = emb_lower * weight_lower.unsqueeze(-1) + emb_upper * weight_upper.unsqueeze(-1)

I think it wasn’t the best call, because it allows a fair bit of non-linearity AFTER the last hidden state. This increased post-hidden-state expressivity decreases the relevance of simple operations on these vectors (like mean pooling or cosine sim computation). Just like averaging 2 numbers before they get exponentiated will not give you a good idea of the exponentials average. I just didn’t want to re-train my model after I figured it out. Plus, the expressivity of the prediction head is still much lower than that of the 22 transformer layers before.

15% mask prob. When masked: 80/10/10 strategy: 80% of the masked tokens get replaced by [MASK] (or a special magnitude bin that I reserved for this purpose, in case of numbers), 10% get assigned a random token (or number), 10% are unchanged.

But if you’re interested nonetheless, here they are: ModernBERT’s loss (only text tokens were masked) on text is 0.65; My model’s text loss is 0.5 and its number loss is .85. Please note that the number loss floor is around .5, considering that the irreducible entropy of the soft-label targets (due to the bin interpolation) is around .5 nats (for a uniform distribution). But, due to {2016, 2017} being a fair chunk of the numbers to predict (and falling quite exactly inside one of the 128 bins), the actual entropy is a bit lower.

2 highlights, though:

I tried using 2D RoPE for tabular data (that make up a good part of financial data), because I didn’t see how 1D RoPE could handle it. I was really proud of the engineering behind my 2D RoPE implementation. But it turns out that 1D RoPE works fine, as long as cells and rows are clearly delimited. I suspect that tokens in a cell somehow get to carry the information of “how many tabs there are between them and the last line-break”.
You absolutely need to minimize padding wastes during training. Claude Code is biased towards bucketing the sequence lengths. That’s bullshit. Just order all your sequences by length, and greedily construct batches before shuffling them. SDPA makes memory usage ~ O(total_number_of_tokens_in_a_batch), so the greedy algorithm is straightforward.

In fact, an encoder transformer can serve as a sequence embedding model right after MLM pre-training (via mean-pooling or attention pooling (CLS, in case of BERT) of the last hidden states), and is usually decent at it. But it can be improved via specific fine-tuning techniques.

The [CLS] token is prepended to every sequence in BERT models. It’s supposed to be the representation of the sequence. It is not trained during MLM, so its hidden state defaults to being quite close to the mean pooling of the other tokens’. But when you train it on sequence-level tasks, it becomes much more interesting.

The mathematical concept is really simple:

Compute all pairwise cosine similarity (both positive and negative pairs) in your batch, so you get a 2D matrix
Softmax each line to make them a probability distribution
Just take the log likelihood of the positive pairs and turn it into a loss.

Result is: pull negatives apart, draw positives closer together. As a consequence, it also results in a higher effective embeddings dimensionality (measured by their participation ratio).

You could argue that “To be or not to be, that is the question” and “Life sucks, why did my father die?” can be positives, but you shouldn’t expect the resulting embedding model to care much about style.

When you think about it, there’s also a bit of bias baked in the JEPA training objective. It’s “on average, you can crop an image and still preserve the subject” or “no big deal if you mask a few tokens in a sentence”. But this bias seems acceptable to me.

MSE: 0.1, SIGReg: 10. The latter depends on the number of views and random projections, so it’s completely useless as a standalone. SIGReg loss at the beginning of the training was around 150.

Just check how erratic the euclidean distances are between similar pairs: benchmark results. Said similar pairs are here.

15-85% uniform mask_prob sampling.

In favor of bidirectional decoder:

Decoder can be initialized from the encoder. Only need to add cross-attention layers.
Autoregressive decoders tend to rely too much on context (they see all of it), which sucks. The only thing the encoder needs to encode is the beginning of the sequence and what can’t be inferred from context. I don’t think such an inductive bias would yield a nice sequence representation.

In favor of causal decoder:

Autoregressive decoder’s training is more elegant. Contrary to bidirectional transformers where only the ~15% masked tokens yield gradients, 100% of causal decoder tokens give training signal.

And I’ve tried a shit-ton of them. Memory Slots Expansion (Claude’s idea) got me to 1.7 loss (text + number) approximately. Here are what my “better ideas” yielded:

Using simple MLPs and fancier setups to directly update the decoder’s tokens based on the CLS token. So, no softmax attention involved. Best loss: 1.8
Component-wise softmax attention scores and values updates. It was painful to implement, and it worked no better. Best loss: 1.8 too.
A lot more memory slots (256), but much smaller rank individual projections (64). Best loss around 1.75.
Using full-rank Memory Slots transformations, but sharing the weights across layers. Allowed to use more memory slots for cheaper. This one was actually quite good. Parameter-wise, it was better than the brute force solution. But param count was never an issue. Compute was, and despite some possible optimizations (like pre-computing all memory slots for all layers (as they were shared across layers)), the increased attention computations (more memory slots => longer sequence length) far outweighed them. So I didn’t investigate further. Best loss around 1.7 too. But with more flops.

I was revolted that Claude came up with the best solution right away, especially considering how fishy 16 full rank independent per layer projections look at first sight.

20 or so, IIRC? For a 768-rank transformation, that’s just as small as it can get.

Sort chunks by length and document id, and greedily construct batches (it’s quite similar to tiling a 2D space).

You can check all the default hyperparams and script here. In a few words: encoder & decoder are initialized from the backbone. Decoder’s cross-attention’s output proj is zero-initialized.
Decoder’s architecture is here.

~66% text accuracy at 80% masking rate is indeed telling.

When I think about it, the fact that I weighed the number loss and text loss equally during training could make this a bit of a self-fulfilling prophecy. Indeed, numbers only make up 8-9% of tokens in an 10-K filing (a bit more in companies house data), but account for ~50% of the loss, by design. From my experience, though, changing the lambda of different loss terms only barely changes their final magnitude. So, small caveat here, but I believe the conclusion holds.

And actually, a slightly better (in terms of val loss) checkpoint of the autoencoder performed slightly worse on both benchmarks, showing that fine-tuning for the end tasks is indeed necessary.

My first concern would be whether a single RTX 6000 Pro (or an H100) would fit ModernBERT + 128K context. I’m pretty sure it would have been able to during inference (easily), but training memory usage was already quite high at 32768 tokens / batch… Nothing insurmountable either (easiest solution being using multiple GPUs, before optimizing anything).
Also, and mainly, fine-tuning at that length necessitates documents that are also coherent at that length. Which is really difficult to find/build, especially in the financial domain. Plus, the usual lack of long range dependency inside long documents (Code is really an outlier in this domain) would just teach the model to come up with a kind of sliding window attention.

Untrained transformer + regression head.
Input: Step 3 CLS tokens,
Output: mean pooling => growth rate.
Loss: MSE on growth rate.

Participation ratio was probably a bit over 100. Average cosine sim was basically 0.

Just check the code for all hyperparams and training tricks (I’m rather proud of the batching strategy), and all. Final growth rate predictor training and walk-forward validation code is here.

Something that I am surprised didn’t work: As all the models (Step 2, 3 and 4) had been trained on 2018 data, but were tested on 2021 data, I thought it would be a good idea to adapt 2021 aggregated CLS embeddings to make them more “2018”. So I fit a linear regression on 2021 agg-CLS vs 2018 agg-CLS before feeding them into the regression head. It helped with dialing down the average growth rate prediction (so the model may have learned something after all) to make it closer to the actual 2021-2025 median, but it didn’t help at all improving the prediction vs. actual correlation.

The idea is that, if there are no long-range dependency (and interaction) in “normal” text (code is not “normal” in that sense), the whole is not too different from the sum of its parts. Meaning that, as long as the CLS representation space is well formed, which in itself is very interesting question, averaging the CLSs will give a nice document-level representation.

zlib algorithm, more precisely. Just asked Claude Code what the best text compression algorithm was, and it came up with that name.

If you’re interested, a nice threshold is 0.4 times as dense as wikipedia. Below that density threshold, you get pages that are full of shit. Above, you might miss product lists or things like that, that do contain meaningful information.

Teaching LLMs

Eloi de Reynal — Thu, 19 Feb 2026 17:16:08 GMT

“2025 is gonna be the year of agents”. My ass.

2025 was the year when most AI companies (including the one where I was chief AI scientist) tried to build useful systems with AI, but mostly failed.

I worked at a french legal tech and we tried to automate the tedious part of notaries jobs. We built a chatbot (soooo 2024, but wait, it was an agentic chatbot, and there was a fair bit of engineering in it), and gave it to notaries to test.

They kind of liked it.

But like any other chatbot, it doesn’t learn from user feedback. Which is infuriating.

When you tell it “you failed. Next time, you should do THIS instead”, it says “sure, I won’t make the same mistake twice”. But it eventually does.

Most modern chatbots have a “memory” function that’s supposed to help with that, but it mostly sucks.

Here, I’d like to present a technical idea that I haven’t had time to test, but that I think would have a decent chance of making LLMs able to learn on the fly. It’s based on privileged information, distillation and prefix tuning (one of the best papers I’ve ever read, btw).

The feedback doom loop: you don’t want to teach a student who won’t learn

Do you ever downvote/upvote a ChatGPT response? I guess not. Because it’s useless, as ChatGPT won’t take the feedback into account until it’s too late for you to care.

Big AI firms would love to have conversation data on domains where LLMs typically fail. But they don’t get much of it, because users get a better and better sense of what LLMs fail at, and don’t give them tasks related to these domains (let alone give any feedback).

So, there’s a negative feedback loop:

LLMs suck at tasks where training data is sparse. => Users don’t engage with them on these tasks, and don’t give feedback anyway. => Training data for these tasks remains sparse.

What if, instead, you knew ChatGPT would improve right away based on user feedback? And not only temporarily, but across different chats?

I guess you'd be willing to take a bit of time to tell it that “when writing an email to @big-steel.pl, always use ‘get lost’ instead of ‘kind regards’”.

So, the vicious circle would break: the ability of the LLM to learn would make giving feedback directly profitable to the user.

Okay, but how do you teach an LLM?

Here’s the complete workflow that I’d suggest:

When a user is dissatisfied with the LLM’s actions, they downvote its response.
They then give constructive feedback.
If the LLM gets it right with the additional feedback, the user upvotes the feedback-enhanced response. Else, back to 2.
The constructive feedback is distilled into a few learned tokens or prefixes at the beginning of the context window. It’s the obviously difficult part, so it’s detailed in the next section.
Whenever the user starts a new conversation, the learned tokens/prefixes are loaded and prepended to the system prompt.

How do you distill feedback into a few learned tokens/prefixes?

You may think that we don’t need to. Why not just updating the model to increase its likelihood of outputting the right answer?

What we don’t want: Normal (LoRA) fine-tuning

When you want to fine-tune an LLM, you usually give it a bunch of pairs of (prompt, desired output). Then, it’s trained on the normal next-token prediction task: the backpropagated loss is usually -sum(log(prob)), prob being the probability that the model assigned to the ground truth token (the one that is actually part of the desired output).
Some guys observed that the difference matrices (let’s call them ΔW) between the weight matrices before and after fine-tuning usually have a high condition number (ratio between largest and smallest singular values), meaning that they can be well approximated by low rank matrices.

So, these guys figured out that it’d be much more efficient to train ΔW as a low rank approximation:

Where:

r is usually around 8 or 16, while d or k are closer to 8192 (Llama 3 70B for example).

It makes training much more parameter efficient.

Still, if you serve each user a different model, you get into some really hard infrastructure challenges. Batching user requests to process them in parallel (as is absolutely necessary to use GPUs efficiently) becomes quite difficult, because they all use different adapters (ΔW set of matrices). It’s partially solved, but it’s still a bit of a nightmare.

How about using a nice feature of transformer networks that allows them to somewhat learn without necessarily changing their weights?

What we could test: Prefix/prompt tuning

When you first discover Machine Learning, it’s common to think “wait a minute, how about we make a model that updates its weights based on its inputs. Like, on the fly, not just during training. Wouldn’t it be cool? It would make the model able to learn much deeper patterns”.

Turns out, the idea of having a network update its weights based on its inputs is called a hypernetwork, and it’s equivalent to having a network that’s highly non-linear on its inputs.

Transformers are just that.

Meaning that adding1 a bias to the inputs (like, say, adding a few tokens before a prompt, which is exactly the purpose of a system prompt) can really change the behavior of an LLM.

So, instead of backpropagating the loss to update the weights of the model, we may get away with backpropagating it to update a few learned tokens (invisible to the user) that will be prepended to the context.

Like an invisible and learned system prompt written in the continuous embedding space rather than the discrete token space. This is the concept of prompt/prefix learning.

As for the “fine-tuning” data, it would comprise pairs of (original_user_prompt, llm_generated_answer_after_user_feedback).

Now, let’s see why this technique is likely to be quite data-efficient, and how to make it even more so.

Data efficiency is key: this is where distillation and prefix-tuning’s lack of expressivity can help.

Pretraining necessitates billions of examples. Fine-tuning necessitates hundreds, at the very least. How could we teach an LLM from only ONE human feedback?

Here’s where expressivity and distillation come into play. Let’s dive quite deep into both concepts.

Expressivity

If an LLM sucks at a certain language, no amount of prompt engineering will make it able to master it. You need at least high-rank Parameter-Efficient Fine-Tuning (through LoRA) for that. In fact, you’ll probably be better off with a full fine-tuning of the model at that point.

On the other hand, prompt engineering is sufficient to tell a model to use such and such library when coding, or to change its tone.

More rigorously, full fine-tuning is more expressive than prompt engineering. The expressivity of a “steering” method basically defines how powerful it is: how much it can change the transfer function of a Neural Net.

Expressivity of different “steering” methods for LLMs.

Whatever you can do with prompt engineering, you can also do with full fine-tuning, but you’d need a shitload of examples to constrain the search space enough.

That’s when we run into the expressivity vs. sample efficiency tradeoff, which is just a particular case of the bias-variance tradeoff. I’ve coded a toy demonstration of it that you can check in the annex.

When a training method (or a model architecture) is able to fit a very large class of functions, it needs a very large dataset to define what, precisely, the function to fit is.

It’s a bit reminiscent of the “you need as many linear equations as there are dimensions in a vector space to define a point in it”.

System prompt (or prefix) tokens can only bias the attention scores without changing their order, while LoRA can completely change the relative importance of token-to-token interaction.

Meaning that, if interaction(token1, token2) > interaction(token2, token3) in a certain attention head of the base model, this relationship will always stay true regardless of the number of prefix tokens you add. On the other hand, a LoRA fine-tuning could reverse the order.

This is why, from a mathematical perspective, prompt engineering will never replace a full fine-tuning on a few billion tokens.

But in our case, I think prefix tuning is the sweet spot: more expressive than prompt engineering (not constrained by the discrete space of real-world tokens), yet still very sample-efficient. How can we make it even more so?

Distilling user feedback into learned tokens

Usually, when you fine-tune a model on real data, the loss function compares the next token probability distribution as predicted by the model with the ground-truth distribution (which obviously is one-hot). It computes the cross-entropy of that distribution, which handily simplifies as -log(probability).

Indeed, you’ll never find a text where the author gives a quantified insight into their hesitation when choosing their words. So real-world ground truth will always be a one-hot.

When you distill a teacher model into a student model, though, you can see the full output probability distribution of the teacher. And it carries much more information than the one-hot ground truth distribution2. So, you train the student model on the output distribution of the teacher.

It’s much more sample-efficient, and plainly better in terms of final loss on the real-world dataset, as the teacher’s outputs are less noisy than real-world data.

So, what about distilling the base model + user feedback into the base model + prefix tokens?

It would basically amount to compressing the user feedback. As the number of corrections grows, compression becomes increasingly interesting:

When user feedback gets a bit contradictory (like “in case A, do B, but in (the similar) case C, do D”), prefix tuning can help a lot, by basically finding the best theoretical prompt that allows this.
It helps limit the context size.

I think the fact that both prefix tuning and prompt engineering can only bias attention scores without changing their order3 would even make it possible to use intermediate activations in the distillation’s loss function.

Like, using a kind of regularization term on the MSE of each attention layer's activations.

But prefix-tuning won’t be sufficient for all use cases

True. At some point, the distance between the base model and the feedback-enhanced one will likely become too large for prefix-tuning to handle.

But if models are made a little more able to learn, maybe users will bother to teach them.

The learning doesn’t need to be perfect, just good enough for AI labs to gather feedback and training data on domains that are at the frontier of what LLMs can do. Then, pretraining can do the job.

That’s all I had to say. Writing this blog post took longer than it would have taken me to test this idea with Gemma 270M on my Mac, but I think it was more fun.

Conclusions, TL;DR

Current LLM memory is broken: users have no incentive to give feedback, because they don’t want to waste time teaching a student that won’t learn.

I propose distilling user corrections into learned prefix tokens. More expressive than prompt engineering, more sample-efficient than LoRA, and deployable without per-user model serving.

This reverses the negative feedback loop: if the LLM visibly learns, users will actually bother to teach it.

Annex: expressivity and inductive bias vs. Sample efficiency.

I trained a small 3-layer NN to model f(x) = sin(10x) between -1.5 and 1.5. Then, I tried to fine-tune it to model 2 different functions, using different fine-tuning methods. Most are self-explanatory. Here they are, in detail:

Full Retrain: None of the weights/biases were frozen.
Bias tuning: Weights frozen, biases are trained.
Input bias: main model is untouched, a learnable bias is added to the input.
LoRA L2 (rank 1). The second layer (dim=32) is fine-tuned using a rank-1 LoRA (so 64 params). Biases are frozen.

What’s next on this blog?

I’m currently working on something much more financially interesting: trying to predict the performance of companies based on the full text of their financial reports and websites. And pretty much all the text you can find about them.

I’m probably going to write a couple of blog posts on the matter in the near future, because I’ve made a few interesting advances in long context modeling for encoder transformers that I’d be happy to share.

Also, I’ve found a way to make encoder models much, much better at numeracy and tabular data reading without having to retrain them, which is necessary to accurately encode number-heavy financial reports.

Finally, I have to try a new sentence embedding training objective that I’ve been thinking about.

Not a proper use of the concept of addition. Couldn’t think of anything better though.

Actually, it carries -log(teacher_probability_of_actual_token) more bits of information per token. (Assuming that the absolute goal of the student model is to learn its teacher transfer function)

Indeed, prefix tokens can only steal attention from other tokens, without changing the ordering of attention scores among non-prefix tokens. At least in the first layer.

Generative AI is not the new Internet

Eloi de Reynal — Tue, 15 Jul 2025 21:13:05 GMT

“AI might be in a bubble, but so was the internet. It didn’t stop it from becoming the most transformative technology of the 21st century.”

So people say. And hearing this over and over again makes me want to punch some faces. I have even started downvoting stuff on reddit, something I had never done before.

I would happily break the legs of anyone showing the original graph on their blog.

The “hype cycle”, as it is called, is contaminated by survivorship bias. We tend to forget that the nominal trajectory after the “trough of disillusionment” is “the abyss of ridicule” and not “the slope of enlightenment”.

The internet is one of a few technologies that has had an impact as big as its hype, despite the bumps in the road. We are oblivious to the Segway, the IoT, cold fusion, space travel, that all greatly underdelivered, either out of technological infeasibility or poor product/market fit (no one wants a “connected dishwasher”).

Even though I am no Geoffrey Hinton, I think I have a decent enough level in Machine Learning and NLP to know what I’m talking about. This post isn’t another one of those “a machine cannot think, it just outputs an answer in its database” or “what you call AI is just an unintelligent program that predicts the next token. It’s a stochastic (whatever it means, I heard about it on instagram) parrot 😁” pieces.

Here are 5 major differences between the AI and dot-com bubbles.

1. The internet’s development was problem-driven. AI is a nice solution looking for a problem.

In the 1960s, the American government was concerned that if the soviets destroyed a central command-and-control hub, the entire US communications network would collapse. They thought it could be a good idea to build a decentralized (and thus more resilient) network. Universities loved the idea, too.

The minimum viable product, called the ARPANET, basically did what the current version of the internet does: it allowed to send bytes from a computer to another, through a decentralized network. You couldn’t yet send dick pics to your hot coworkers (for some reason it was not considered a priority at the time), but you sure could share the source code of a program or “email” people in the network.

It solved the problem of long range communication at the byte level.

Of course, there have been a lot of improvements made to the original protocol (TCP/IP, WWW and so on), but the need for a common protocol to send bytes from one computer to another over a decentralized network was clear from the start. And the internet delivered.

Generative AI, on the other hand, was created by people who wanted to make something intelligent, attained some success and thought “well, what can we do about it”? It turns out GPTs were useful as chatbots, so they went for chatbots.

In a sense, it’s similar to the 1997+ part of the internet bubble, where most companies were like “We have to do something with that ‘internet’ thing. Any idea?”. But the development of the underlying technology went through a completely different process.

2. The adoption of the internet was slow because you had to sell a few kidneys to buy a computer. AI today is already dead cheap.

If you wanted to get connected to the internet in the 1990s, you had to buy an internet-able computer (the equivalent of $3000 today), and then purchase a $100 (today dollars) monthly subscription. So, it was on the order of magnitude of a month salary.

No wonder it took time to take off. First you had to hear about it from your nerdy friend, and then you had to convince your wife that getting a computer with internet access was more important than getting your septic tank drained and your garage door fixed.

In the late 1990s, only about a third of the developed world population had internet access. There was plenty of room to grow, it allowed for high expectations, and it was a good excuse for its limited economic impact.

Today, there is no way you can spend a month salary on generative AI, without deliberately trying to do so. And nearly every working age person has directly or indirectly used a cutting-edge AI chatbot. So the reason why AI doesn’t have a significant economic impact is not that adoption is not complete. It’s that the technology is not yet able to. But more on that in the 5th point.

3. The internet has a positive network effect. AI has a negative one.

Cool if you sell books online or if you have a brand new ‘@hotmail.com’ address. But if no one browses the web or checks their emails, you are just a clanging cymbal that no one gives a shit about.

The internet had (and still has) a huge positive network effect, meaning that its usefulness grows with the number of users.

No such thing for AI yet. Quite the opposite in fact.

First of all, AI generated slop tends to contaminate datasets and it causes model collapse. On a more technical note, I am not exactly sure why it does: I once trained pre-trained CNNs on their own outputs1, basically trying to make them more sure about their guesses, and it didn’t cause them to go astray (it was just for fun, btw). But I guess things are different for massive Transformers.

Anyway, the more AI slop on the internet the worse the training datasets quality.

Second, and even worse, more people using AI brings down the value of AI. Generative AI is mainly used for creative tasks2, where users are competing for other people’s attention. And the more the said people use AI or get in contact with AI-generated stuff, the less impactful it becomes. Cool images or marketing clips lose value when everyone can generate them. Same thing goes for “personalized emails”.

Whenever I get an email from someone I don’t know who says “I’ve checked your {blog post or github repo} and found it fascinating! We are also into {vaguely related stuff}, so please join our {AI product} waiting list.”, I consider him a spam by default. So does everyone. Before the advent of generative AI, I would have thought “woah, someone actually checked this repo! I’m not used to this much appreciation, should I answer by sending them a dick pic?”

If a single person had had access to GPT-4 in the 2010s, he would have made millions of dollars from it, because no one was able to spot fishy AI-generated slop at the time. Now, it’s become a sixth sense to almost everyone.

The positive network effect could justify the exponential growth of internet companies market valuations.

AI companies benefit from no such network effect.3

4. The scaling laws of the internet are linear. The scaling laws of AI are worse than logarithmic.

If you double the number of cables connecting 2 countries, you double their connection speed. If you double the number of hard drives in a server, you double its storage capacity.

Double an LLM training compute and you get a barely noticeable difference. But you burnt twice as much money. AI scaling laws are worse than logarithmic.4

GPT-4.5 was trained using 10x more compute than GPT-4, but the difference with its predecessor is marginal (and arguably in the wrong direction). The difference is nowhere near that between GPT-4 and GPT-3, despite a similar training compute factor. In all mathematical rigor, if the performance gap had been the same, the scaling laws would have been called logarithmic, which is already quite bad. But they are far worse.

When scaling laws are nice enough, some big companies or government have a strong incentive to invest in the technology to develop it for their own use. That’s why some companies have super computers. That’s why microsoft employees already had connection speeds acceptable by today’s standard, as far back as in the 1990s. Then, as they work hard to make the technology less expensive, it becomes affordable to the general public. This is how we went from Enigma to the iPhone and from super expensive internet broadband to 5G.

But there is no such thing with AI. The best AIs currently being tested by Google, Anthropic and OpenAI are marginally better than those available to the general public. There is no “Spend 10x more to get a 10x better product” path that big companies can pursue.

I wonder if, in normal economic conditions, Google et al would have any financial incentive to train big models when 90% of the maximum economic value can be obtained from 0.1% of the maximum compute. I don’t think so, though I could be wrong.

5. The internet needed few technological breakthroughs to become what it is today. AI needs major ones to take off.

I will write a full technical blog post on that matter, but for now, let’s just state the facts.

The internet has evolved quite a bit from ARPANET. But there were few technological breakthroughs needed for this evolution: TCP/IP and WWW protocols, HTML, CSS and Javascript and fiber optics were about all that was needed. And at any point in the history of the internet, engineers and scientists knew the direction of the next step. “Problem: Disconnected networks can't talk to each other. => Solution: TCP/IP, a universal translator for data; Problem: Web pages are boring. => Solution: JavaScript, to make them come alive.”

Current LLMs have a (real but) limited economic impact. We are promised superintelligence. The thing is, no one I know about has the slightest fucking idea how to get there.

Benchmarks are getting maxed out by new reasoning models every other day, yet real world usefulness seems to be plateauing. Although I am an LLM power user, I don’t think I would lose much productivity if you forced me to use GPT-4 Turbo instead of the latest models.

Despite enormous investments and efforts, no one has been able to use LLMs for anything else than Chatbots and IDEs. Current LLMs need constant guidance from humans to work.

One thing that I’ve understood only recently is that most economic value comes from navigating the messiness of the world. Very few people are paid to work a fully documented and streamlined job.

You may think that accountants just line up numbers in spreadsheets, but they constantly make important and implicit decisions about where to put those numbers. Few of these micro-decisions can be found on the internet and thus in the training data of LLMs.
Despite code being one of the most abundant data forms on the internet, I find myself not using AI too much when coding. The interesting thing is that I’m not even able to single-out cases where AI fails. There are just too many low-probability failure modes to account for.
I’m not even talking about “hardware” engineers who are closer to the material messiness of the world. You will have a hard time finding online doc on the “Shit, I need to redesign this part because our historical supplier went bankrupt and the new one can’t machine Al 2024 alloy to the required tolerances. Should I figure out if we can use 7075 instead or redesign the part altogether?” problem.

Despite acing math and code benchmarks, LLMs have made little progress in that “messiness handling” skill. That’s why you can’t trust ChatGPT’s Operator to fill your shopping cart or Claude 3.7 to run a small shop (the latter post is genuinely funny, Anthropic engineers have a lot of humor).

For some reason, Anthropic has an edge in this domain, but I’ve seen no progress since Sonnet 3.5.

Fine if Elon Musk (whom I like, out of pure provocation) calls Grok 4 a Ph.D.-level AI because it never fails math tests and trick questions. Fine if it can solve 5th order PDE and output the result in Alexandrine verses.

But no one is paid for that.

It is only marginally closer to being autonomous than GPT 3.5.

The current path of AI development will not bring economically meaningful superintelligence in the foreseeable future. I’m not saying superintelligence will never happen, just that it’s unlikely to happen by scaling current approaches. We need a few breakthroughs, and as far as I know, no one knows what they will consist of.

Conclusion

In 1969, if you had told the average American “you will never live to see human settlements on the Moon nor humans on Mars”, he would have answered “what? am I going to get cancer or something? Are you saying I have less than 10 years left to live?”. The possibility of space exploration being at its apex was unimaginable. But it was.

In 1999, if you had told him “the internet is not going to be a big deal”, he would have called you a fool. Rightly so.

Today, no one knows where AI is going, but there seems to be hard technical problems to solve. The forward trajectory looks more like that of space exploration and cold fusion than that of the internet.

I don’t remember the exact code, but the loss function was probably something like loss(logits) = -log(max(logits))

And code.

I’ve heard the argument that, as more people interact with chatbots, conversation data becomes more abundant and allows AI companies to train better models. So this would amount to a network effect. But I am not sure about it, because unlabelled conversation data is notoriously difficult to work with. So much so that most AI companies offer their models for free on https://lmarena.ai just to collect a bit of human feedback, because the poorly labelled conversation data they get there is still more valuable than the formidable amount of raw data they have at home.

It all depends on the spectral decay of the “perfect” LLM’s transfer function. Check this post if you’re interested.

Hacking Spectral Bias: Using Carrier Functions to Increase Parameter Efficiency

Eloi de Reynal — Thu, 03 Jul 2025 21:17:37 GMT

When writing about the Stateful Transformer, I came across some nice ML concepts and discovered a few things worth writing about. This post is quite technical, though I restrained from using mathematical expressions when the concept behind them was explainable with words.

The core idea is that adding a well-chosen function to the target (and subtracting it at test time) can help escape local minima in the loss landscape.

A nice French forest in the Lyon area, where I wrote the beginning of this piece.

Learning curve and learning task

This may seem a bit naive to some of you, but I was wondering if the minimum attainable loss vs. number of layers (with a fixed width) curve had a definite shape, for example if adding a transformer layer would yield a predictable loss improvement.

It turns out it doesn’t. In fact, the shape of the loss vs. number of layers depends entirely on the task to learn, and especially on the spectral decay of the function to approximate.

Let’s check that.

Random points

I tried to fit a GELU MLP1 to a function defined by random points2.

First of all, the random points are not only random on y, but also on x, giving a slowly decaying frequency spectrum (as shown in the next figure) to the function to approximate.

Indeed, if the points were evenly spaced, there would be an obvious frequency peak in the vicinity of {number of points per unit} cycles per unit. But, as the points are unevenly spaced, the frequencies are spread, roughly following a 1/(distribution of distance between two neighboring points drawn from a uniform distribution) law. Which seems quite difficult to compute.

Anyway.

What we can see is that adding layers initially decreases the loss, resulting in early quick wins, until the points to fit are too hard to reach. Then, the loss plateaus. It could go down to 0 if we had a good enough network. But such network seems impossible to get just by increasing the number of layers: we run into training instabilities due to excessive depth before being able to model the hyper-high frequencies of the tail of the spectrum.

Left: the function to approximate. Right: its frequency spectrum.

Variable frequencies

The architecture is the same, but now the function to approximate is cos(15x³). The frequency of this function obviously grows with |x|.

Strong spectral bias observed.

Two things here:

The loss is decreasing exponentially (it’s about halved for each added layer until layer 5 and it’s divided by 4 by the last layer).
You can see a near-perfect illustration of the spectral bias: NNs have a strong tendency to first fit low frequencies.

Now, the first point basically proves that adding a layer doesn’t have a definite impact on loss, and that it depends on the complexity of the function to model: the loss curve here is dramatically different from that of the random points approximation problem.

The second point is more interesting. Let me digress a bit more about spectral bias first.

The spectral bias of NN has been extensively studied, and the math behind it is sound. But I wonder if real-life spectral bias also has something to do with the fact that, in some cases, the local sample rate of the dataset is not tuned to match the local frequency of the functions to approximate. Let’s take an example.

Let’s say we want to train a CNN to identify animal species based on their pictures. The function to approximate has high frequency regions (felid and bird species sometimes differ only by subtle features) and low-frequency ones (equids are very easy to tell apart: Striped => Zebra. Gray and long ears => Donkey. All the rest => Horse).

If we don’t oversample the high-frequency regions (providing a lot of pictures for each bird and felid species) in the dataset, we are likely to increase the spectral bias of our NN, as a somewhat low-frequency function will do the job of approximating the dataset. Mainly because the total loss on one epoch won’t be frequency-weighted.

This effect is quite obvious so I think it is often taken into account when designing datasets. In financial data modeling, it’s not always the case, though.

Here, the sample rate is constant, and independent of the local frequency. That explains part of the spectral bias.

In the following training runs, the initially evenly spaced datapoints have been spread out by cubic-rooting them, which allows for a sample rate proportional to the signal frequency3.

Despite the fact that the sampling rate is proportional to the signal frequency, we still observe a strong spectral bias.

There is still a bit of spectral bias. Also, despite more datapoints being in the “difficult region”, the NN converges quicker.

Enter the edge bias

Next I wanted to see how a GELU network would approximate a single-frequency function (excluding border effects). Would the loss vs. parameter count curve brutally go from one to zero as soon as we reach a certain model size?

I discovered something quite interesting: when you approximate a single frequency function4, the extremities of the input distribution are approximated first. Illustration below.

When modeling a pure frequency function, NNs start approximating the function near the extremity of the input domain.

I may be anticipating a bit on the rest, but it all comes down to this: “The frequency of a function composition depends both on the outer function’s frequency and the inner function’s frequency AND amplitude.” It’s pretty obvious if you take f(x) = sin(x) as the outer function and g(x) = a*x as the inner function.

This amplitude/frequency coupling is one of the reasons why the Fourier transform of a composition of functions is intractable.

NNs are a composition of as many functions as they have layers, so this coupling happens a lot.

Let’s try to see why it implies that they are better able to approximate higher frequencies at the border of the input distribution.

The following is not a super rigorous mathematical proof, but I think it gives a good enough intuition of the phenomenon.

Let’s take an underparametrized MLP, trying to approximate the above constant-frequency function.

As it is underparametrized, each layer will underfit its “ideal” transfer function, that is, the transfer function that would allow the whole MLP to approximate the cosine function.

The maximum representable frequency of a NN is basically proportional to the number of “kicks” in the curve (assuming ReLU-like activations), which itself grows exponentially with the number of layers, and polynomially with their widths.

So, for a given trained NN, the maximum representable frequency grows with depth, meaning that if for example we model the frequency of the norm of each hidden state, it likely grows with the depth of our NN.

This implies that the lower layers of the NN, taken together, have a lower frequency function to approximate than the whole NN.

The problem is, as we’re dealing with an underparametrized network, the lower layers have trouble fitting even this low-frequency function. As they are subject to the “normal” spectral bias, they will first try to approximate its lower frequencies.

Assuming our batch size is equal to our number of input points, the backward pass will basically say, for each point: “What is the best nth-order function that models what’s on our left and what’s on our right”. With n a number determined by the width and number of layers at that point. Near the extremities of the input domain, the optimization is less constrained: there is one direction5 in which you just have a few points to approximate, and you don’t care what’s beyond these points.

So the lower layers have an easier time approximating their target function near the extremities. Unless the target function happens to be very much biased toward having a lower norm at the extremities of the input domain, that leaves the approximated function with a somewhat higher derivative/amplitude near the extremities.

You may say that the optimization process does not happen layer by layer and that there are no intermediate target sub-functions to approximate, and you’d be right, but my point still holds: the lower layers still tend to yield functions with a higher amplitude at the frontier of the input domain6.

Now, given that, in a composition of two functions, a high inner function (ie lower layer) amplitude leads to a higher frequency of the composition, we can infer that the upper layers will have an easier time modeling higher frequencies where their input (ie the output of the lower layers) has a large amplitude. Then, the learning continues inwards to the center of the distribution, but slowly: it’s difficult for the next layer to model a complex function where the amplitude is small. The representable input domain thus slowly grows inwards7.

The next layers then inherit a high-frequency and high-amplitude input at the edge of the input distribution and a slightly greater useful input domain. They increase this edge bias even further.

It indeed seems that the edge spectral bias grows with NN depth.

If you have as much trouble understanding it as I’ve had, let me sum up this explanation:

The optimization process generally has more degrees of freedom at the extremity of the training input domain, and so for every layer.
It results in higher amplitude at the edges.
Which translates to higher representable frequencies at the edges, because of the aforementioned Fourier of a composition of functions.

The edge stuff is just the primer of the phenomenon, but the real driver is the amplitude/frequency coupling.

If, for example, I force a low-frequency, high-amplitude, center-heavy component into our function, we find that the center gets fit first8:

Now, I find this phenomenon (edge bias + ampl/freq coupling) explains some surprising things in ML. Let’s dig into it.

Edge bias and dimensionality

Now, any seasoned ML scientist would rightly think “So, you’re telling me about edges and you’re using a 1D example to illustrate it? That’s borderline dishonest. Everyone knows that things totally change with dimensionality. Have you heard about the shell effect?”.

Indeed, uniformly distributed high-dimensional vectors are almost always near the edges, due to the “shell effect”. So, maybe this edge bias doesn’t matter much in practical ML applications?

Let’s check that.

Here is a visualization of 1D and 3D edge biases.

Here, I plotted the Mean Squared Error vs. the distance to the center. Function to approximate is y = cos(x**1.5 * 15). We can see that, with 2 hidden layers, the extremities and the center get fit first. As the edge bias grows with depth, the 3-layer network fits the extremities first and doesn’t care about the center. In red, you can see the number of samples for each 0.1 norm interval. It is not constant because I wanted to make sure the sample rate was proportional to the frequency. The most important thing here is that failing to fit the center leaves a lot of points behind and greatly hurts the MSE on the whole dataset.

Here, the function to approximate is y = cos(x[0] * 7) + cos(x[1] * 7) + cos(x[2] * 7). Still some kind of edge bias: the edges are better approximated than the center. This effect grows with NN depth. Still, it doesn’t matter much, as the center accounts for a negligible portion of the training samples and of the total loss. So there still is an edge bias, but it has less impact.

By the way, the definition of “edge” depends on the norm you use. And what’s funny is that the appropriate norm depends on the interaction between the input components. If you expect no interaction at all (ie an additively-separable function), you should use the L1 norm (that’s what I did, as the 3D function to approximate is additively separable). If you expect moderate interaction, you should use an L{moderate order} norm9.

Despite the shell effect, the edge bias might in fact be a thing in high dimension:

Most real-world functions feature no input interaction or low-order ones. For example, I have heard that genomic studies rarely find interaction between genes. Meaning that if 1000 different genes code for IQ, it is not absurd to model the impact of each gene independently from the others.10 In financial modeling, which I’m more familiar with, input feature interactions are of a low order, provided you’ve already curated the data and pre-computed useful ratios.
So, in theory, the norms used to define real-world edges should be low-order ones.
Most real world data distributions are center-heavy (eg. normal). So the input vectors comprise a certain number of normally distributed independent input features. And if the features are not independent, then we just have to PCA the input space to make them more so.
Using 1. and 2, in the case where we have normally distributed input features and where the L1 norm is the most appropriate (low order interaction), the whole input space shows no shell effect at all.
Quite the opposite in fact: the majority of the input samples will be very closely distributed around the center, as their norm will be a sum of the absolute value of normally distributed features. I guess it results in a normal(-ish?11) shape for the nb of samples vs. norm curve, surely with a very low sd.
So, no shell effect at all.
If we used a very high order norm, L∞ for example, we would get a very strong shell effect, even with normally distributed input features, as the odds of a samples not being at the edge would exponentially decrease when increasing the input dim.

So, the edge bias might still be a thing in high dimension. Please tell me if you have come across something resembling it in your real world ML experience. I have not tested it.

Could be nice to do, but for now, I have a more interesting and practical thing to show.

Amplitude / Frequency coupling consequences: what if we added a function during training and subtracted it for inference?

The idea I want to explore here is that, as high frequencies are easier to model when the amplitude is locally high, what if we add a function whose local amplitude (meaning, its derivative) is high when the target function’s local frequency is high?

Here is an example.

Function to approximate:12

So, the frequency grows linearly with x’s norm.

What if we add this:

To get:

Or, if you prefer code, here is our function to approximate:

y = np.cos(np.pi * np.abs(x)**2 * 7) + .4 * np.abs(np.linalg.norm(x, ord=1, axis=1, keepdims=True))**2

The first term is our target function. The second one is useful only in that its derivative is proportional to our target frequency. The .4 factor was just empirically found to be quite good. It’s a bit reminiscent of a “carrier wave” in radio signals.

If we train the exact same model13 to approximate the target alone or the full function, here are the results:

So, a 1-layer network has a hard time approximating the full function, as it’s a bit more complex than the mere target. Things get easier for the 2-layer NN and the 3-layer one really takes advantage of the improved digestibility of the full function. So, amplitude/frequency coupling really is a thing, at least in this experimental setting.

Interestingly enough, this effect doesn’t hold with even deeper networks. The difference between ‘target alone’ and ‘full function’ losses becomes insignificant past a depth of 4 hidden layers14.

I don’t exactly know where it comes from, but I suspect 3-layers is the sweet spot between “The additional complexity of the full function makes it harder to model by a seriously underparametrized NN” and “There are enough layers for the lower ones to figure out they should have higher gradients where the expected frequency is high”.

What about carrier function learning?

When we look at the number of parameters of the models and the number of linear regions theoretically attainable by their transfer functions, we can see that they are largely sufficient for approximating our target with negligible error. Backpropagation is just doing a poor job of optimizing them.

The problem comes from the layered structure of Deep Neural Nets.
On the one hand it’s very effective: It enables complex, non-linear functions at a low computational cost. This efficiency comes from two key factors: the chain rule and the way complexity (the number of linear regions) grows exponentially with the number of layers.
So number of linear regions grows exponentially with depth, while training cost only grows linearly. Quite cool.
But on the other hand, it constrains the loss landscape, by making the tuning of each parameter dependent on the others. This results in a loss landscape with an effective dimension far lower than the number of parameters. This, in turn, makes finding local minima much more likely15.

Adding a carrier function seems to mitigate that, but we are actually cheating here, because we already know where the target function has a high frequency.

So, in real world applications, it would have to be dynamic (if not learnable).

Maybe we could imagine a kind of regularization layer that adds a carrier function to the input of the next layer. This function should not be learnt through backpropagation, but rather through some computation of a proxy for the frequency of the output of the next layer.

But I will write about this later. This blog post is already long enough.

Conclusion / Some key takeaways

The shape of the loss vs. nb of layers curve of a fixed-width MLP depends on the spectral decay of the target function. Pretty obvious when you think about it.
Real world spectral bias might be partly caused by a failure to match the local sampling rate of the input to the local frequency of the target function. I have no evidence for that though.
The observed frequency/amplitude coupling in Deep Neural Nets causes a few interesting phenomenons, including:
1. A kind of edge bias, where NNs tend to better learn data at the edge of the input domain. This edge bias might keep being a thing even with high-dimensional inputs, depending on the order of the interaction between input components and the distribution of the said input components.
2. The fact that you can increase the parameter efficiency of a model just by adding a kind of “carrier function” to the targets (and subtracting it at test time).
It could be interesting to design a kind of regularization layer based on 3.b.

Researching this was nice. I’ve only scratched the surface and probably have reinvented the wheel a lot, but I’ve had fun doing it.

Notes:

Input dim = 1, Output dim = 1, 1 to 6 hidden layers of dim 32 with GELU activation functions, trained for 10,000 epochs on the 50 points with Adam(lr=1e-3, weight_decay=0). Took the loss of the best of 5 runs for each depth.

Code for the random points:
N = 50
x = np.random.uniform(-1, 1, (N, 1))
y = np.random.normal(0, 1, (N, 1))

N = 500
x = np.linspace(-1, 1, N).reshape(-1, 1)
x = np.cbrt(x)
y = np.cos(np.pi * x**3 * 15)

N = 500
x = np.linspace(-1, 1, N).reshape(-1, 1)
y = np.cos(np.pi * x * 15)

We are still in the 1D case

The graph (and note below) illustrate this categorical assertion.

This plot illustrates the phenomenon: the first “kick” in the curve happens closer and closer to zero (the center of the input distribution) as we move deeper into the hidden layers.

N = 500
decay_rate = -4
x = np.linspace(-1, 1, N).reshape(-1, 1)
y = np.cos(np.pi * x * 15) + 2.71828**(decay_rate*abs(x))*2
(Yes, I could have used math.exp(1))

This is definitely not rigorous. What I’m saying is that the order of the norm used to define the edges should roughly reflect the order of the interaction between input variables. But even if your target function is separable or if the input variables interaction order exactly matches that of the norm you use (eg if you wanted to model the f(x, y, z) = sqrt(x**2 + y**2 + z**2) function), you have no guarantee that the actual transfer function of your trained NN involves interaction of this exact order.

In fact, upon closer study, it seems that this lack of observed interaction comes from a lack of data: you need to have a great deal of data points to figure out the interaction between multiple input variables, and there are only so many genomes you can sequence. So, not the best example I could give actually.

Not sure of the impact of the absolute value here. Not given it too much thought.

Looks like this in 2D:

input dim = 3,
hidden dim = 32,
LeakyReLU activation,
1 to 3 hidden layers,
output dim = 1,
1e6 data points, sampling rate proportional to frequency.

Oh, and the difference with < 3 layers really is significant, it’s not just a case of “try enough things and at some point you’ll get a significant result”. I’ve done a few training runs with different hidden dims and activations and the same effect can be observed. For example, the results are more impressive with a hidden dim of 64, and seem to remain significant for deeper networks.

Indeed, there are far fewer local minima in high-dimensional spaces: the odds of having a positive second derivative and a null derivative of the loss wrt every dimension become exponentially lower as the number of dimension grows. GIVEN ONLY THAT the partial derivatives are linearly independent from one another, which is not the case because of the deep structure of NN.

A (partially) failed attempt at improving the Transformer architecture.

Eloi de Reynal — Wed, 26 Mar 2025 01:07:33 GMT

This post is a bit technical. I’ve tried to make it as simple as possible and to not hide a lack of intuitive understanding behind complex mathematical expressions. However, I believe it’s still fairly complex. By carefully reviewing this post, you'll gain a deeper understanding of key ML concepts. If you’re not familiar with the Transformer architecture, I strongly recommend 3b1b’s videos on the subject 1. I made this post for people who like to dive deep into engineering problems and their (absence of) solutions. Please read the footnotes.

The final hidden state of a Transformer's channel – that is, the embedding vector after the last decoder layer – theoretically contains n_embd * 32 (or 16, depending on the configuration) bits of information. This is because it's composed of n_embd components, each encoded as a 32-bit floating-point number (fp32).

This final hidden state is then transformed into a probability distribution across all possible tokens. We sample from this distribution, resulting in a single token chosen from a vocabulary of vocab_size possibilities. The amount of information contained in this individual token can be quantified as log2(vocab_size).

The Llama 3-8B model has an embedding size of 4096 and a vocabulary size of 128,256. This means the unquantized last hidden state contains 4096 * 32 = 131,072 bits of information. However, this gets compressed down to approximately log2(128256) ≈ 17 bits when predicting the next token.

That's a significant reduction in information, which seems especially important considering that Transformers, as autoregressive models, rely on the generated tokens up to step n to predict the next token at step n+1.

Proposed architecture improvement: enriching token embeddings with the last hidden state of their generation step.

This idea sounded great, I was already picturing myself getting the Turin Award, making the Times cover and receiving a lot of emails from Altman, Amodei & al begging me to honor them by joining their companies. The Stateful GPT would give me fame, money and girls.

It turns out this wasn't as great an idea as I initially thought, for several reasons. The most obvious reasons are not the most critical ones.

A respectable being that doesn’t care about enriching token embeddings

1st Challenge: this is basically an RNN, so what about training parallelization?

When I first posted my idea on Reddit to get some feedback on this idea from people smarter than me, they all said “you’re basically re-inventing Recurrent Neural Networks. The exact architecture that was made obsolete by Transformers. Not good. A Transformer’s training can be parallelized, your architecture’s cannot.” It turns out they’re partially right on that, but only partially.

First, I’d like to take a step back and explain why RNN training is difficult to parallelize and why it matters. It is only loosely related to the main subject of this post but I find it interesting enough.

Why the training of RNNs is difficult to parallelize

I won’t dive too deep into how RNNs work, and as most blog posts do a very poor job at explaining them2, I’ll link a prompt I gave Grok. It did great, you can trust the answer. The core concept here is that in an RNN, to predict token n, you first have to predict all tokens up to n-1. So training an RNN on a text sequence of length n involves at least n sequential steps, each dependent on the completion of the last one. This is the exact opposite of parallelized training.

By contrast, a Transformer can be fed a n-length sequence and be trained to predict each token k from 2 to n+1 in parallel. If we take the training sequence '“the cat sat on the mat”, we get (6-1 =) 5 examples.

1. "the" -> "cat"
2. "the cat" -> "sat"
3. "the cat sat" -> "on"
4. "the cat sat on" -> "the"
5. "the cat sat on the" -> "mat"

These predictions can be performed in parallel thanks to the attention mechanism’s ability to handle variable context sizes and process each position's prediction independently (given the shared input). This ability is slightly hampered by my proposed architecture update, but we’ll see that later.

Now, you can actually parallelize the training of RNNs by using a large batch size: you can backpropagate through multiple full-sequences in parallel. The first problem is that the memory overhead is substantial: for each sequence, you have to store the whole, sequential, computation graph3 to then backprop through it. The second one, even worse, is that each sequence will still be processed sequentially.

Transformers’ computation graph, on the other hand, is quite light thanks to the absence of recurrence. And the only sequential thing about them is the layer-by-layer processing.

All in all, it’s more accurate to state that the training of RNNs is not very -yet still somewhat- parallelizable.

Why it matters

Although RNNs like LSTMs or GRUs are competitive with Transformers in terms of end loss (log-likelihood) vs. training compute (in flops), they are impractical due to their inability to make full use of GPUs.

In fact, if money could buy clock frequency (instead of more GPUs), Transformers wouldn’t have been needed: an RNN would train just as well (maybe even better?) than a Transformer on a single-core, 100THz CPU. But such a high frequency can’t be attained, for a bunch of reasons4. On the other hand, doubling compute power by doubling the number of transistors and logical cores is quite straightforward. Hence the need for a highly parallelized architecture.

Why the Stateful GPT’s training is still quite parallelizable

First, I’d like to give you a nice picture of Venice. It’s an incredible city and I’d like to visit it some day. I hitchhiked there once but I was in a hurry and had to go to Albania.

A lot of beautiful things happen when you’re looking for trouble. “Hey, let’s build a city on a big mudflat by the sea” => Most beautiful city in the world.

Back to the point.

Of course, the Stateful GPT behaves like a normal RNN during inference. As a consequence, its training should be difficult to parallelize.

Except if you accept a little discrepancy between what you train the model for and how you use it.

Here’s how the training goes, why it’s different from inference and why it doesn’t matter much.

Training:

Parallel Training Step: A standard, fully parallelized training step is performed on the Transformer. No information flows through the recurrent connection at this point. Cross-entropy loss (CE-loss) is computed, and normal backpropagation is performed. The last hidden states are stored.
Recurrent Training Step: The last hidden states stored in Step 1 are fed back into the model to enrich the token embeddings. Backpropagation is performed as usual, and new hidden states are stored.
Iterative Recurrence: Step 2 can be repeated multiple times, using the hidden states from the previous iteration, at the cost of additional training steps.

Here’s how training & inference differ:

Training: The recurrence depth is set to a fixed and finite number.
Inference: The effective recurrence depth is generated_sequence_length - 1. The first token is generated without recurrence. Token 2 uses Token 1's hidden state (1 degree of recurrence), Token 3 uses Token 2's hidden state (2 degrees of recurrence), and so on.

The discrepancy between training and inference is addressed by ensuring stability during training. If the CE loss remains controlled with increased recurrence depth during training, the model should be stable even with theoretically infinite recurrence depth during inference.

More formally, training ensures that CEloss(Tn)

Empirically again, I could observe that a single recurrence step during training gives 95% of the gains I would get by using, say 10 recurrence steps.

I even made a little test, by encoding the training step’s recurrence depth and giving this information to the recurrent layer, so as to track the “optimal” norm of the token enrichment term vs. the recurrence depth.

The norm indeed grew with depth. This suggests that the Stateful GPT incorporates progressively more information into the token prediction as recurrence depth increases. The effect was very small, though: a 1% increase in average norm going from depth = 1 to depth = 3.

Of course, the compute-intensiveness of each training epoch is dependent on the recurrence depth: we set depth = 2 for example, we’ll have to make 3 forward/backward passes for each batch: 1 for the standard Transformer training + 1 for each recurrent step. Each epoch thus costs (recurrence_depth + 1) times more than for a standard Transformer.

We’ll see if that’s a problem.

For now, let’s check the architecture of the Stateful GPT.

A deep dive into the Stateful GPT architecture

Weather forecasts for Verkhoyansk as of March 10, 12am UTC. Look how fast it warms up in this period: we’re near the max derivative of daylight hours (happening circa 3/21), and the thermal inertia of land is low.

Now it gets a bit technical.

The naive ways of enriching token embeddings with the last hidden state before their generation are either to:

Concatenate the standard token embedding and the last hidden state, and down project / transform them back to a dim_embedding-sized vector.
Sum the standard token embedding and some transformation of the hidden state.

(1) First technique: quite simple. Wait, is this image related to the post?!

These two options are valid. The first one is more compute intensive than the second, as it deals with vectors of dimension 2*dim_embedding instead of just dim_embedding. On the other hand, it allows for a real interaction between the last hidden state and the token embedding. It’s not just blindly adding information.

I decided to merge both approaches with a custom architecture halfway between the input gates of LSTMs and the attention mechanism of Transformers. The idea is to compute a component-wise attention score between the token embedding and the hidden state and update the former accordingly. From now on, I will refer to the token embedding as “x” and to the hidden state as “h”.

The high-level idea of this technique is to:

Check what information the hidden state can enrich the token embedding with. Compute an element-wise “attention” score that basically says, for each component, how much the token vector should be updated.
Project the hidden state into a space where it can be summed with the token embedding vector
Add the projected hidden state to the token embedding in order to enrich it. The sum is conditioned by the attention scores, meaning that the update’s magnitude will differ for each of the vector’s components.

So it goes like this:

Proposed enrichment technique. Explanations below.

More formally,

Let me explain this equation carefully.

So, each token embedding gets enriched (x = x + something) this way:

First, a learned matrix projects the hidden state into a nice (key) space where we can compute its element-wise affinity with the original token embedding. That’s the h@Wk part (in PyTorch, @ refers to matrix multiplication).
Second, a different learned matrix projects the token embeddings the same, into a “query” space. That’s the x@Wq part.
Then component-wise pseudo-attention scores are computed by multiplying each component of the projected token embedding by those of the projected hidden state. Let’s imagine x is the embedding of the word “car”. During learning, the query projection matrix has learned that this kind of tokens (nouns) are always willing to get their physical characteristics refined, for example by color or shape adjectives. So there will be high values for the components coding for “I want an update on my color, I have no idea what it is”. Meanwhile, the key matrix, through which the hidden state is projected, has learned to project the hidden state so that there is a clear “color” component. When you make a component-wise multiplication between the two vectors, you actually enable a dialogue like “Token embedding: I want my color updated, please.
Hidden State: I can definitely provide this information => high attention score for component ‘color’.
Token embedding: I’d like to know how fast I am.
Hidden state: sorry, I don’t know about that => low attention score for component ‘speed’.
Token Embedding: tbh, I don’t quite care about [something].
Hidden state: Too bad, I could have told you => low score.
Token Embedding: I don’t care about [some other thing].
Hidden state: Neither can I tell you about it => low score.”
In the equation, that’s the whole (h@W_k) * (x@W_q) term.
The negative scores are set to 0 by the ReLU activation function. This is not strictly necessary, but it doesn’t hurt performance and I like the increased interpretability it brings. Please note this example is anthropocentric and it doesn’t literally reflect what actually happens.
So now, we have a component-wise affinity between the token embedding and the hidden state. We have to bring the information where it’s due. The hidden state is first projected by the Wv matrix into a value space (compatible with the token embedding), meaning that this projection tries to present the hidden state’s information in such a way that it can be added to the token embedding. If, for example, the color components of token embeddings are typically in position 29-31, but are typically in position 4, 8, 12 in the hidden_state vector, the Wv matrix will soon learn to have a 1 in positions (29,4), (30,8) and (31,12). Now that we have a nice projected hidden state vector, we just have to multiply it component-wise by the attention scores to get the enrichment term.
We just add this enrichment term to our initial token embedding.

That’s it. The rest of the Transformer is kept identical.

If you are familiar with attention, you may have noticed that this architecture is very similar to cross-attention, except that it’s element-wise, and not token-wise.

Standard Transformer on the left, Stateful Transformer on the right. Please note that the hidden state feedback loop can be turned off.

Let’s see how the it all performs.

Empirical results

“The ReLU non linearity is not necessary, but it improves interpretability”

As my goal was to test the idea as fast as possible, I chose to go with a character-level Transformer, to save a layer of complexity.

I trained all the Transformers on a Gutenberg 10MB custom dataset, comprising a few books stitched together. This dataset is deeply flawed: test and val splits are qualitatively different, as they likely don’t even come from the same book/author. But I decided not to care.

Different flavors of Stateful Transformer

Among the different possible designs for the token enrichment mechanism, the component-wise attention performs best, with the fewest parameter count.

The concatenation approach (shown in the first real figure) works nicely too, in terms of minimum loss vs. number of params, but its drawback is that you can’t easily turn off the recurrence: As you’re dealing with a MLP that takes both a hidden state and a token embedding (and outputs an enriched embedding), you can’t decide to just feed it a token embedding (and zero-pad the hidden state placeholder) and expect it to output a coherent token embedding.

Standard vs. Stateful Transformer

I first trained a very small and especially shallow Transformer5 for 40 epochs. Here is how the stateful Transformer (with recurrence depth = 1) compares to the standard one.

The Stateful Transformer’s run looks much better. And indeed it is: the Standard Transformer’s train loss after 40 epochs is reached only after 16 training epochs by the Stateful Transformer. It’s more than twice as fast, which means that, even accounting for compute overhead, it outperforms the Standard Transformer.

Interestingly, the stateful Transformer seems to be a bit more prone to overfitting. Indeed:

\\frac{\\text{train_loss(Stateful Transformer)}}{\\text{train_loss(Standard Transformer)}}","id":"BBUHEEMGMK"}" data-component-name="LatexBlockToDOM">

Meaning, in plain English, that the Stateful Transformer does not generalize as well as the Standard one. If you have any idea why, please tell me in comments. I have a few hypotheses, but none of them is fully satisfactory. By the way, the Stateful Transformer has 140,000 params and the Standard Transformer only has 128,000. That’s a difference of 8.6%, which is significant but unlikely to explain the difference in performance between the two architectures.

When I saw this kind of results, I thought the Stateful Transformer would work great. I believed it might even scale well, meaning that the performance upgrade would be even greater for bigger Transformers (see next section for more details).

In fact, it’s the exact opposite. The following is another training run, comparing loss vs. training epoch for larger Transformers (4 layers instead of 2, 950,000 params).

The stateful GPT here performs marginally better than the standard one. But it uses twice as much compute during training. In one word, it’s not worth it. When I figured this out, I used all available resources of intellectual dishonesty to make my stateful GPT work better. Maybe if I train it first without recursion and just fine-tune it on a few recursion epochs? Nope, doesn’t work. Well, it does, up to a certain point, but it’s not scalable. Maybe if I freeze all the weights except for the enrichment mechanism? Nope, doesn’t work either. I tested a lot of things but, at the end of the day, the compute overhead was just not worth it.

Is the train/val loss a valid metric?

By the way, you may be wondering if the train/val loss really is a good metric for performance, considering that the inference process is slightly different from the training process. To figure it out, I trained a “big” (5M params) on the same dataset to assess the output of these two small models.

Doing so isn’t as straightforward as it seems, as merely using the big model’s loss on text generated by the small models can be insufficient. Indeed, a string like “things of the street of the things of the street of the things…” is technically sensible (big model’s loss on this string is low), but its information content is very low.

So I wanted to assess both the information content of a text and its “sensibleness”, measured respectively by the compressed size of the text and the loss of the big model on it.

The randomness of a Transformer’s output grows with the temperature of its softmax.
Also, the compressibility of a string decreases with its randomness (a perfectly random string is impossible to compress).
Finally, the loss of a bigger model on text generated by smaller models grows with these smaller models’ inference temperature.

Below are some graphs showing some empirical results on this matter. I compared two models, a Stateful GPT and a standard Transformer, trained until they had the exact same validation loss.

What we can see here is that, for a given validation loss, the Stateful Transformer performs exactly the same as the Standard one, meaning that the validation loss is a valid proxy for the stateful model’s inference performance.

I think it’s time for another unrelated picture.

The pillars of creation. Original here (and here). Why do we even bother with math and ML when we know that such majestic stuff floats in the ether? Incredible to know the pluto-sun distance would easily fit inside one pixel of this image (even at full res).

Why it only works for shallow Transformers

A delusional hope

At first, I thought my architecture tweak would work better for bigger Transformers.

The amount of information carried by an input vector depends on the transfer function of the network it is fed into. Quite obviously, if you take a dead neural net whose output is constant, you won’t be able to infer anything about the input. So, whatever the amount of information theoretically present in the input vector (32 bits * dim), you won’t be able to extract any.

Conversely, a big, well-trained neural network has a lot of decision boundaries, meaning that a small variation in the input vector will have a big effect on the output. I think the amount of information you can read from a vector is approximately equal to the number of decision boundaries of the NN, which is dependent on the number of parameters and the shape of the activation functions6.

Let’s keep this idea in mind: the bigger a neural net, the more sensitive its output is to small input variations.

Now, let’s look at something else: in a Transformer, the vocab size is finite, and so is the embedding dimension. When you embed tokens (the first thing you do in the forward pass of the Transformer), you look up the token id in a dictionary and get its associated vector. So, in the continuous vector space of dimension n_embedding, you only have a discrete set of vocab_size vectors you can feed the Transformer with. And I thought that the Stateful Transformer’s usefulness was in padding this discrete set to make it more continuous.

And I thought that this padding effect would be especially beneficial if the subsequent Transformer has the complexity to make use of the additional precision. Namely, if it’s bigger.

Also, if you look closely, the size of the gaps in the token embeddings space usually grows with the size of the model: the average distance between closest tokens grows like7 8

Anyway, two good reasons to believe, at first sight, that the Stateful Transformer architecture would perform better on big Transformers.

It doesn’t.

Reality strikes back

Let me show the architecture again:

Blue: parallel processing layers, Pink: token mixing layers.

All the blue steps here are steps where token embeddings are processed in parallel, with no communication between them.

The pink steps, conversely, are steps where each token vector gets updated based on the other tokens. Best example for that is attention.

I decided to give the enrichment mechanism a light pink tint, as the n-th token’s embedding is updated based on the (n-1)-th last hidden state (which was actually the one used to predict the n-th token). So, technically speaking, there is some token-to-token interaction happening, but it’s both local and unidirectional.

What’s interesting is that token mixing happens in each decoder layer. Let’s take a 50 layer-deep Transformer, trying to predict the n-th token of a sequence. As they flow through the Transformer, all the previous tokens’ hidden vectors provide information to the n-th token’s vector, through the attention mechanism. So, the (n-1)-th token’s penultimate hidden state freely provides information to the n-th token. As for the very last couple of hidden states (after the last attention layer), they don’t, because there are no token-mixing steps left.

What it all means is that the Stateful Transformer’s architecture update is interesting only in that it allows the very last hidden state of token n-1 to provide information to token n’s embedding, instead of the penultimate hidden state. Of course, the communication medium between vectors is completely different for these two cases9, but the end result is the same: tokens communicate up to the last layer in the standard Transformer, while the Stateful Transformer goes the extra mile and allows the tokens to communicate after the Feed Forward Network part of the last decoder layer. This update is really just about not discarding the work done by the last FFN on the previous token embeddings: the work done by all previous layers is still available thanks to the attention mechanism.

And this extra mile is not that valuable.

Intuitively, the last hidden state doesn’t carry much more information than the penultimate hidden state if there are a lot of decoder layers: each layer (including the last one) loses importance when there are a lot of them10.

So, this is the reason why the stateful Transformer architecture tweak only works for shallow models.

Cependant s’élançant de la flèche gothique / un son religieux se répand dans les airs. / Le voyageur s’arrête et la cloche rustique / aux derniers bruits du jour mêle de saints concert. Painting is L’Angélus, from Millet. Text is from Lamartine.

Conclusion:

This architecture tweak sucks, like nearly every other Transformer “improvement”.

Maybe you’re not convinced that “the last layer of a Transformer isn’t that important if there are a lot of them”, I wasn’t either. But upon closer look, if it’s indeed important, why not just add one more layer? It will do the same job as the recurrent architecture trick, while being more straightforward to train and use.

My next post will either be about some things I learned when researching this idea11, or a credit scoring model I’d like to train.

Only one little error here: Attention head results are concatenated, not summed. But technically, you could argue that a concatenation of vectors is equivalent to the sum of their expanded (ie. padded with 0s) version, so it’s just a slight imprecision.

In what fucking world do their authors live to believe that their readers will enjoy a bunch of formulas full of unspecified terms at the beginning of a “RNNs simply explained article”? Most of them go from “Here is the very high level idea (so high-level it doesn’t tell anything)” to “So, h_t = f(W * h_{t-1} + U * x_t + b) and y_t = g(V * h_t + c)”.

Backprop is basically about moving the parameters of a model in the direction that makes them decrease the loss. So, first, you have to compute the gradient of the loss w.r.t the parameters. You could use the finite difference method (the (f(x+h) - f(x)) / h stuff) along with the chain rule, but it’s very inefficient. So all ML frameworks store the computation graph of the network to be able to tell things like “the input was transformed in such and such way when going through the neural net, so the gradient of the loss w.r.t each layer’s params is such and such”.

First, the Field Effect Transistors used in integrated circuits act like capacitors. Meaning that switching them on and off involves a transfer of electrons which potential energy is equal to their capacitance multiplied by the voltage squared. This is just lost energy and heat. This lost power/heat is proportional to the rate at which you turn these transistors on and off, aka the clock speed. Heat management is one of the hardest problems of chips. Second limit is just the speed of electricity, which is approx 10⁸ m/s in silicon. A 100THz chip would need the signal path lengths max deviation to be less than 1 micrometer to ensure correct computations.

n_embd = 64, n_heads = 8, n_layers = 2, block_size = 33, batch_size = 2048

If I were to make a guess, I’d say that, for a MLP with ReLU activation the number of decision boundaries is basically lower than or equal to (2*dim)^depth, assuming that the dim is constant.

Think of a chess board: the closest distance between two neighboring cells is

Meaning that if you had 2-dimensional embeddings with a vocab size of 64, the average distance between two neighboring embeddings would be about one eighth of the maximum norm of these embeddings.

In fact, for practical model sizes, the average min distance between neighboring embeddings is terribly close to their norm. Meaning that they are as distant as can be. A surprising consequence of the curse of dimensionality.

Global attention between two token’s equally high-level hidden states vs. enrichment of a low-level token embedding with a high-level last hidden state.

Especially when you account for the “curse of depth” in LLMs

Namely:
1) how the Fourier transform of the activation function used in an NN relates to the Fourier transform of the NN’s prediction surface. Or, put more practically, why you shouldn’t use ReLUs when approximating a low-frequency function and conversely shouldn’t use GeLUs when modeling highly discontinuous phenomena.
2) When a higher hidden dim is needed vs. when a deeper network works best
3) Why and how the function to model conditions the shape of the best_attainable_loss vs. number of layers curve.

Boring stocks vs Growth stocks

Eloi de Reynal — Wed, 05 Feb 2025 23:21:44 GMT

Everything is priced in, right?

Well, let’s see.

Using data from Financial Modeling Prep’s API, I have compiled statistics on more than 8000 companies across the US, EUR and AU markets to investigate potential systematic biases in business valuation.

These results are based on data from 1980 to 2023.

Ratio vs. Annualized returns

The following table shows the correlations between various Valuation Ratios and Annualized Returns (with reinvested dividends) across all companies -regardless of industry.

Usually, price is in the numerator of valuation ratios but I reversed them, because it makes more sense for correlation computation1.

So far, the positive correlations support the value investing approach.

The P/B ratio has the best correlation with annualized returns, despite being less popular than P/E or P/FCF.

Let’s break it down by industry. Doing so decreases sample sizes and results in correlations that are less reliable. So take these figures with a grain of salt, especially when the sample size is below 100.

Here, a negative correlation suggests investors overemphasize current valuations, potentially overlooking negative future developments (or conversely, dismissing positive forecasts). A positive correlation suggests investors tend to be overconfident in their ability to predict the future.

It looks like a win for value investors. Most of the time, undervalued companies (in terms of valuation ratios) tend to outperform the ones with better “potential”.

But let’s take a step back.

What about 5-year returns?

Average(Annualized return) ≠ Annualized Average Return

When I averaged the annual returns in my dataset, I was surprised to get a negative number. I assumed I had made a mistake because everyone knows that, on average and in the long run, investing in stocks yields a positive return. So I checked every single line of code. There was no obvious blunder…

In fact, I realized that gains compound, but losses don't offset them linearly. For example, consider two investments held equally: one yielding a 50% annual return over five years and the other yielding -50% per year. The overall return would be 281%, not zero2. When I compounded each annual return over five years and then averaged the results, I obtained a 4% positive average annual return. Given that this is an unweighted average, it seems reasonable.

The preceding correlations don't account for the asymmetry between positive and negative annualized returns. They treat both equally3. However, growth stocks are considered high-risk, high-reward investments. This implies that, for a similar average annualized return, they should outperform value stocks.

To illustrate this phenomenon, the following graph shows the returns of 2 hypothetical portfolios of 2 stocks.

Light blue: Average(annualized_returns) = (1.4+0.4)/2 = 0.9 Dark blue: Average(annualized_returns) = (1.01+1.1)/2 = 1.055.

Despite a lower Average(annualized_returns), the high risk/high reward portfolio outperforms the other on a 5-year period.

So, let’s check the correlations between valuation ratios and 5-year returns.

These correlations remain significant, but they are much lower than for annualized returns.

A caveat: Calculating 5-year returns by exponentiating values, as done above, increases sensitivity to data inaccuracies. Correlations with 10-year returns are excluded, largely due to this issue. These correlations were all positive but insignificant, with the exception of Book Value / Price.

Valuation Ratios vs. Stock Volatility

We've established that 'transversal variance' is a good thing. A portfolio with some strong winners and some weak performers is preferable to a uniformly 'meh' one. In the long run, that unevenness pays off.

But volatility (internal variance) is terrible. Kelly's criterion demonstrates how damaging it is to ROI. You can't risk everything on volatile stocks (especially debt). The money sitting on the sidelines won't generate returns.

So here are the correlations between valuation ratios and the standard deviation of year-over-year returns.

I find it a bit odd: I thought undervalued companies were considered riskier, and that their risk was a reason for their undervaluation. But they are actually less volatile.

Banks and asset management firms are exceptions. Investors seem especially good at assessing and pricing in their risks. These are the only industries where undervaluation reliably correlates positively (and reliably, i.e, with a large sample size) with stock volatility.

Still, the STD of yearly returns gives an incomplete picture. Using the beta coefficient instead would have been a better option. Please contact me if you want me to add the Ratios vs Beta correlations to this analysis; I haven't computed those yet.

Conclusion

Investors seem to exhibit a slight, systematic bias toward believing they can predict the future and capitalize on it. Value investing appears somewhat more effective than growth investing.

Appendix: methodology

I used FMP’s API to gather financial data, focusing on US, European, and Australian public companies due to their more complete and reliable information. I screened approximately 15,000 businesses, initially filtering for:

Companies listed for over 5 years
Data accuracy (excluding outliers or inconsistencies like incorrect financial figures or improperly adjusted share counts for stock splits)

This process narrowed the dataset to 8,725 companies.

To calculate returns, I excluded the first two years of each company's public data and then computed returns with reinvested dividends, up to the last available data points.

Ratios were calculated using each company's third public financial statement, along with the stock price following its release.

Github repo

P/E ratios of -500 or +500 have basically the same “meaning”, while they are polar opposites. Reversing the ratios gives -0.002 and +0.002, which are close together. It effectively reflects that you’re paying a lot for current profits. The 1/x function is not monotonic, which is bad for correlation computation.

In the no-rebalancing case the “average” overall multiplier (when combining the two separate outcomes) would be the mean of (1.5⁵) and (0.5⁵); that is, (7.59 + 0.03)/2 ≈ 3.81, corresponding to a total return of about +281%

You can get a sense of it by looking at the formula.

[Tech report] How I analyzed 337,576 financial statements

Eloi de Reynal — Fri, 17 Jan 2025 23:12:06 GMT

This post is for those who like Machine Learning and Math. The next post will be about business/finance so feel free to skip this one! Or else, just jump to the “What about politics” section.

When I had the idea of completing the financial statements analysis, I genuinely believed it would take no more than 8 hours. I ended up spending over 40.

Data acquisition

Everything begins with data. I got in touch with Financial Modeling Prep to see if they could grant me a full, free access to their API in exchange for a bit of promotion. They agreed.

Downloading the data was quite straightforward. The only challenge was dumping JSON objects larger than 1GB. To avoid overloading the RAM of my old laptop, I split the data into multiple files. In total, I obtained financial reports from 25,866 companies, with up to 30 years of historical data each.

It's important to note that a value of 0 for a line item can have multiple meanings. It could indicate that the line item is not applicable to the specific financial statement (e.g., Cost of Revenue or Inventory for a Bank). Alternatively, it might mean the data scraper at Financial Modeling Prep was unable to locate the data, or it could truly represent a value of 0. So, a 0 in the data does not necessarily signify a value between -0.1 and 0.1, but rather signifies one of the following: a non-applicable item, a value that was not found, or an actual value of 0.

Data cleaning

Data cleaning proved to be more difficult than anticipated. I initially assumed the data would be clean. But when I realized that the average Revenue in my whole dataset was 1e28 dollars and the average net profit margin was 5000% (net income 50 times higher than revenue on average) I suspected that maybe something was wrong. So I asked Claude to come up with upper and lower bounds for each of the report lines. After a few back and forth passes, it gave me sensible limits.

By the way, since a good portion of the statements were reported in a currency other than the USD, I had to convert the numbers before applying the limits. And as some of the line items were ratios, I had to exclude them from the exchange-rate conversion process.

I tried two different approaches about what to do with invalid statements:

Discarding the statement and all other reports from the same company
Replacing the invalid value with an arbitrary value, such as 0.01, unlikely to exist elsewhere in the dataset. (Remember this value, as I will refer to it later.)

The first one resulted in over 80% of the statements being discarded, which made the dataset too small: my model was immediately overfitting and the val loss was terrible.

The second one resulted in a very sparse dataset. On average, approximately 40% of the inputs were either zeros (ie not applicable, such as Cost of Revenue in bank’s statements) or 0.01 (ie invalid data).

Which is not ideal, because these are false zeros. They don’t carry the same information as a real value of 0. They are qualitatively different from “something small, between -0.1 and 0.1”, and they should be treated accordingly.

On the one hand, I want my model to be quite linear because my problem is also largely linear, and I don't want the model to be so complex that it's prone to overfitting. On the other hand, 40% of my values need to be treated in a completely non-linear fashion. There needs to be a clear distinction between a 0 (not applicable), a 0.01 (invalid data, which should be disregarded), and a 0.02 (a normal, genuine near-zero value). This behavior is not typical for MLPs, as it's inherently non-linear. Hence, a slightly fancier architecture can be useful.

Now, you can’t just feed a model with raw financial data: having some inputs with mean and sd below 1 (like ratios) alongside Revenue figures in the billions is far from best practice. So one of the best things you can do is normalize the data so that mean = 0 and sd = 1, for all input features. For example, if the average revenue across the whole dataset is 1 billion, and its standard deviation is also about 1 billion, we should convert each revenue value using the formula:

Or, more generally:

Let’s imagine we exclude the original invalid values from normalization. That’s to say we normalize everything except for 0 and 0.01, that we keep as they are. By doing this, we are effectively ensuring that these unknown or not-applicable values are treated as average values. Indeed, 0 is now the exact average (by construction) of our new value distribution, and 0.01 is close to that average, being just 0.01 standard deviations above it. This approach is beneficial as it aligns with the natural heuristic: "If I don't know the exact figure, I assume it's not too far from average," which we humans use when guess-timating.

Now that we have made everything possible on the dataset side to mitigate the problems caused by these strange values, let’s see how architecture comes into play.

Over engineering a NN architecture: the ultimate guide

Here are the general features of the model:

Input dim: 101 features/year * 3 years = 303 features
Output dim: 8 features for next year 1.
Training data:
- inputs: financial statements for years n-3, n-2 and n-1 (every possible one until last_year-2)
- outputs: 8 features from year n’s financial statement (until last_year-1)
Val data:
- inputs: financial statements for last_year-3, last_year-2 and last_year-1
- outputs: 8 features from last year’s financial statements

I can hear people thinking “there is massive data leakage, beginner’s mistake, haha”. I kindly suggest the people thinking that should have their kneecaps split (link is safe, no worries).

If you're curious about that concern, please refer to the Appendix section. For now, let's move on to the network's architecture.

Network architecture choice

When predicting sequence data, RNNs and Transformers are often considered the go-to models. In our case, they would be unsuitable.
As we saw in the previous post, a persistence model is very effective and it doesn’t get more linear than that. When a simple linear regression does an acceptable job, there is usually no need for a fancier architecture than a simple MLP.

As explained in the Data cleaning section, the values 0 and 0.01 should ideally be treated differently from typical near-zero numbers. So I tried to design an architecture that allows for that without departing too much from a standard FFN 2.

I first had an idea that was quite mathematically sound but turned out to be a parameter-inefficient, hardly convergent, overfitting-prone mess3.

Then, a second approach, simpler and quite similar to the former one from a mathematical standpoint, avoided matrix computations. It proved to be a partial failure: it was slightly less parameter efficient than a simple MLP, while being better at handling very sparse data4.

Finally, a third approach worked perfectly. It achieved a lower test loss while using less parameters than a MLP and resulted in cleaner gradients which were easier to interpret.

The basic, non-optimized version is incredibly simple.

First, for each input vector (a flat vector with 303 features – 101 for each year over 3 years), a `mask_invalid` vector is created. This is a 303-dimensional vector where each position has a 1 if the corresponding value in the input vector is 0.01 (the code for 'off-bounds'), and a 0 otherwise.
It’s really a kind of grid that, overlaid on the original input vector, would show where the invalid values are.

Next, a `mask_zero` vector is defined in the same way. This vector indicates the positions of the 0 values in the input vector.

Then, these three vectors are concatenated and given to a simple Linear layer + LeakyReLU. The first third of the output vector is considered the next layer’s input_vector, and the other two thirds are the mask_invalid and mask_zero vectors. It’s equivalent to a simple MLP which input_vector is
concat(values_input_vector, mask_invalid, mask_zero) and output is a 8-dimensional prediction vector.

So this un-optimized network consisted of two Masked Layers (it’s the name I used for them) and a linear output layer.

If we look at the parameter count of one Masked Layer, we have:

The term before the “+” refers to the size of the transformation matrix and the term after refers to the bias of our layer.

This is quite a lot. Of course, it’s easily handled by my laptop, but let’s remember that parameter count directly influences overfitting proneness on small datasets (like ours). The more parameters a model has, the more it will be able to learn the noise of the training dataset and struggle with test loss.

Moreover, we can definitely sense that such an architecture is not optimal. Two thirds of the concatenated input vector are very light on information (a mere 303*2 = 606 bits, compared to 303*32 = 9,696 bits for the first third5) and two-thirds of the matrix’s parameters are wasted just to handle this small amount of information.

The first thing that could be done to decrease the number of parameters is to project each of the input_values and masks into a smaller sub-space before concatenating them.

At first I tried projecting them into different sub-spaces through different projection matrices, and it yielded good results.

Then I tried using the same projection matrix for all three projections, on the very scientific ground of “Hell yeah, I bet it works” and you bet it worked fine, achieving even lower test loss (while increasing train loss), which is perfect.

So the general architecture of one layer looked like that:

I then computed the condition number of each matrix and got quite a high number (~1e3) for each layer’s result_proj matrix. It was a clear indication that I could LoRA6 them. I did so with a compression ratio of 3.

Then we have the final architecture of our layers, that looks like the following:

All in all, the model has 24,625 parameters and achieves 3.5% better test loss compared to the best MLP I came up with, that has 49,459 parameters.

Of Convos and LoRAs: why this architecture works so well.

The first part of the forward pass is equivalent to a 1D convolution with out_features filters, in_features stride and filter_size + flattening with interleaving each filter’s result.

What this means is that the projection matrix main_proj gets optimized to select filters that maximize information retained from each of the input vector’s 3 parts. Just like a normal learned projection matrix does, you may say, but the difference is that it has to make concessions to find the best filters for all three sub parts. This is analogous to a CNN learning the best filters for processing images.

And in our case, it works fine: the similarity in data presentation means that a single filter is likely to effectively process both the main values and their corresponding masks.

The contrast between the two stages of the forward pass is particularly interesting. The first stage achieves parameter efficiency through weight-sharing and convolutions, while the second achieves it by LoRA-ing the transformation matrix.

Here is why it works:

Convolutions aim to find the (usually) full-rank down projection that best captures local features across the image, using a fixed number of parameters. The only global interaction between features in convolutions comes from the competition to change the filters' weights during back-propagation.
LoRAs are about finding the down projection that best captures global features/interaction, and expanding the result in a meaningful way. It results in a low-rank transformation.

Basically, the convolution finds a common ground for features interaction, and the LoRA projection enforces this interaction, all in a parameter efficient manner.

Training

Here is a comparison between the best MLP and the best Masked Net (the above mentioned architecture) I could design.

We can see that there is likely some room for improvement in weight initialization. Furthermore, the training loss is higher for the masked network compared to the MLP, while the validation loss is lower. This suggests reduced risk of overfitting and better predictive potential.

Conclusion/Executive summary

Data Preprocessing Matters. Careful data cleaning and normalization, including separate handling of invalid data and genuine zeros, was critical.
Masked Nets perform great. Link to github repo is in the comments section if you need to use them.
Parameter efficiency is key. A parameter-inefficient net is not only compute intensive to train, but also at risk of overfitting on small datasets.

Appendix

Data leakage concerns

Let’s focus on a single company. During training, the model will be presented with data from say, 2009, 2010, 2011 and asked to predict the 8 desired features for year 2012. It will also see, either before or after the aforementioned training example (training examples are randomly presented) another one that looks like 2008, 2009, 2010, and asked about 2011. At such point the model could think “Haha, I know what I have to predict because I have seen the data sometime before in my inputs”. So it will just parrot back the 2011 report as “stored” in its weights. The model is just learning to parrot things back, not to predict them.

The problem is especially noticeable in the case of transformers. If you don’t mask the upper part of the attention matrix, the transformer will be able to peek into future tokens and output them without trying to predict them7. Even if you properly mask it, presenting a transformer with overlapping sequences will favor such behavior: it will learn sequences rather than language rules.

So this concern is theoretically valid. But it doesn’t mean that ML zealots have the right to live.

Why it isn’t a big deal.

Three reasons:

In order to learn sequences, a model must have a lot of “storage space”, meaning a lot of parameters relative to the size of the training dataset. Which is not the case for our ~24k parameters model, trained on a 130MB dataset.
Models prone to data leakage are those whose behavior is very non-linear, such as transformers and RNNs. Our prediction model is quite linear, using only two low-dimensional LeakyReLU layers, which is nowhere near the 3rd degree polynomial attention seen in transformers (not even accounting for the softmax).
Predicting financial statements is fundamentally linear. One of the best naive prediction model is just a persistence model (ie predicting the numbers will stay the same as before): an identity matrix with no bias, it doesn’t get more linear than that.
So it’s unreasonable to think that the Adam optimizer will tune weights to encourage sequence memorizing while both the data and the network’s architecture favor a quasi-linear behavior. ML fanatics should be piledrived (link is safe).

A real world, political analogy

Let’s imagine three people who need to talk to smooth over some minor disagreement.
The first is a heavy weight boxer with a fair sized chip on his shoulder.
The second one is a non-violent communicator who holds little grudge against the two others because he has learned to process his ~~bad~~ ~~negative~~ painful feelings.
The third is a social justice warrior, full of indignation against the world, including the other two.
One of their common friends, who’s called Convolution (a name he didn't choose), tries to find the best place for all of them to “talk”. The boxer immediately calls for a box ring. The non-violent guy wants to talk through the disagreement in a nice, calm place and suggests the others should come over at his place. The SJW doesn’t want to talk at all (sorry guys I tried to keep the post non political until now but it’s impossible due to the deeply controversial nature of the topic).
Convolution then says, 'Well, SJW doesn't want to communicate; the Non-Violent one wants a calm place but doesn't hold a grudge, so his opinion doesn't matter as much, and the Boxer wants a ring, so we'll go with the ring.
Convolution just found the best place for each of the three to talk.

I skip the part where Non-Violent submits a kind and empathetic disagreement to Convolution on that matter and SJW calls out the systemic violence of the capitalist system that oppresses the weak. They all meet at the local gym, on the box ring.

At first, Non-Violent and SJW are a bit hesitant and cling to the ropes. Then the referee, whose name is LoRA Down Projection (obviously didn’t choose either) pulls them off the ropes and shoves them into a smaller space with Boxer. He then says “I want the disagreement settled in ten minutes”. Boxer needs less time to explain with loads of compassion that he was hurt by the two others and that he’s happy to have a thoughtful conversation to figure out what went wrong. And you know, he likes to talk with his hands.
Seven minutes later, they all come to the common conclusion that everything is fine, it’s never been finer, boxer, “you’re my best friend we’re brothers in arms forever (but please release mine, I want to use them again some time)”. A lawyer passing by, whose name is LoRA Up (you bet) decides to translate this informal agreement into legal terms. So, he expands the “everything’s fine” conclusion to 35 pages and sends them to the judge.

Blah blah ever after.

["revenue", "netIncome", "eps", "epsdiluted", "freeCashFlow", "totalStockholdersEquity", "operatingCashFlow", "dividendsPaid"]

FFN (feed forward network) and MLP (multi-layer perceptron) can be used interchangeably as they basically mean the same thing: linear layers with non-linearity between.

It was about interpolating missing values using a regression matrix (that basically said “Here is my best guess for metric X based on the other 302 metrics”, like “Oh, the Revenue line is missing but I will be able to infer it from COGS, Operating expenses and EBITDA”), with each “guess” weight-averaged by a Softmaxed learned covariance-like matrix, parameterized by the mask of missing/invalid values. Mathematically speaking, it was the “soundest” architecture, but it had too many parameters, involved complex matrix computations and was an absolute mess to train. I couldn’t use LayerNorms because they messed with the regression nature of the task. This architecture was an utter failure. It didn’t converge most of the time and when it did, there were obvious signs of overfitting.

Before going through the main neural net, every missing metric in the input was inferred from the others using a simple Linear layer, the result was multiplied by a mask vector (1 where data is missing/invalid, 0 otherwise), then added to the main input vector, resulting in a full input vector with each 0 replaced by a nice inferred value. This architecture kind of worked: replacing Linear layers by these “masked interpolation layers” resulted in a net improvement, but it was less parameter-efficient than a MLP due to a much higher parameter count in each layer. It was a partial failure, but I used this architecture for my previous post, out of pride to use MY architecture instead of a basic one.

This is just an estimate, as the “real” information stored in the vectors is also dependent on their data distribution and the architecture they are fed into.

Here, LoRA is the acronym for Low Rank Approximation and not Low Rank Adaptation.

You can check my french posts on how ChatGPT works. Maybe I’ll translate them but it might be hard as the examples I used don’t work in English.

I analyzed 337,576 Financial Statements

Eloi de Reynal — Fri, 10 Jan 2025 23:36:44 GMT

This post has been made possible by Financial Modeling Prep, which gave me a full and free API access to their excellent database. This post is more about Business Analysis than Machine Learning, but I’ll write a technical one soon.

Drawing conclusions from Income/Cash Flow/Balance Sheet statements requires two things:

A deep understanding of finance and how different items in financial statements reflect real-world business operations
Good intuition, developed through experience

While machine learning models struggle with the first aspect, they excel at the second. That's why I decided to train a neural network on virtually every public financial statement issued since 2000 to discover what statistical and financial patterns a model could learn from experience.

When analyzing financial statements for investment decisions, I typically ask questions like:

If a company's revenue grew strongly from year n-3 to year n-1, should I expect continued growth or regression to the mean?
Is long-term debt an indicator of healthy investment and long-term planning, or poor financial management?
When a company performs well, is it more likely to issue new stocks (increasing capital) or buy them back (returning capital to shareholders)? In other words, should I expect dilution?
More broadly, is studying financial statements worth my time before investing?

The model I have trained provides some clues. It has been trained to predict 8 metrics1 for next years’ financial statements based on the last three years of history, each comprising 101 components2.

Is it possible to predict the future by looking at financial statements alone?

To some extent.

A naive -yet surprisingly effective- approach is to simply carry over figures from year n-1 to year n. A small corrective factor can be applied to account for year-over-year average growth.

Any improvement in accuracy beyond this baseline can be considered successful. Below, I compare the model's performance to a basic persistence model. The metrics listed are those I consider most relevant for investors and stockholders.

Here, the “loss” is the opposite of accuracy. It is a common machine learning term, which, in this case, is the average prediction error, in standard deviations3.

On average, my model is 16% more precise than a persistence model for these metrics (0.1341 vs 0.1596 average loss). It performs best on the first five metrics but shows poor performance on the last three.

This discrepancy exists because the heuristics the model learned to predict dividends, revenue, and stockholders' equity don't effectively apply to future predictions. While these patterns were learned from historical data (2000 to 2022/2023), they prove counterproductive when forecasting future performance4

It looks like there is no way to reliably predict the Revenue of a company just by looking at its past financial statements. The same holds true for Dividends Paid and Stockholders’ Equity.

The narratives “revenue should keep on growing like the last years” and “revenue will have to regress to the mean” are undecidable, and mostly cancel out. If anything, the latter is slightly more accurate (see footnote 5). This answers my first question.

In contrast, metrics related to net income show much higher predictability.

Interpreting predictability

First of all, it should be noted that the least predictable features are also the least volatile. On average, in the ~25,000 public companies I analyzed, the y.o.y changes in Revenue accounted for only 7% of the standard deviation of revenue across companies. To put it plainly: revenue varies much more across companies than it does y.o.y within the same company. The same can be said of Stockholders Equity.

It is surprising, though, that these metrics can’t be predicted at all. After reflection, I think it boils down to this:

We have to know the initial state of a system to predict its evolution

Financial statements are made to reflect the financial state of a company. The metrics they display have survived a selective pressure from investors and regulators to convey the most important and useful financial information.

But they don’t say much about the operational state of a business (which predicts revenue) or the emotional state of board members (which predicts Dividends and Stockholders’ Equity). The initial state of a system is necessary to predict its evolution, and failing to capture it impairs prediction. Because it lacks the information to make a decent prediction, the model learns its noise and makes prediction based on nonsensical metrics it has constructed.

It's like a gambler thinking, "Each time red came up twice in a row, green followed." He has learned a rule that was true for some past examples but is no more likely than random to hold true in the future. The same gambler, if he has lots of experience but little mathematical ability, will also have observed (rather than calculated) that black and red have about the same probability to come up. As for this rule, it is valid.

My model has learned the same kind of simple (and valid) rule: the best bet on Dividends, Revenue, and Stockholders' Equity evolution is that they remain the same year over year. It has just "improved" this somewhat, lowering the error it yields on past data at the expense of its real prediction ability (just like devising complex roulette rules beyond "black and red are roughly 50-50" is harmful to one's wallet).

In the coming weeks, I'm going to try to build a model capable of digesting operational and unstructured data, like the text of SEC filings and company websites.

What Has the Model Learned? (And How Useful Is It?)

Having a black box that makes predictions can be useful, but understanding its reasoning is even better. With a bit of math, we can peek inside.

How does it predict EPS?

EPS is arguably the most important metric for a shareholder to follow. Let's examine how the model predicts it6.

The table below shows the relative importance of each metric in predicting EPS. Negative numbers indicate that the factor has a negative influence on EPS. However, "influence" should be understood more as correlation. Since the model doesn't fully grasp how different numbers interact, it sometimes makes counterintuitive connections. For example, it considers Cost of Revenue a positive predictor of future EPS, simply because it tends to grow at about the same rate as Revenue itself. Therefore, please interpret the following tables with caution.

The numbers listed below are the gradients of EPS with regard to the different inputs. It gives an idea of the relative weights of the input metrics for EPS forecasting. They have been normalized so that the most important gradient is 1.

A few observations stand out: naturally, EPS and Diluted EPS are the best predictors of themselves, following the "persistence-is-the-best-default-model" rule. Beyond that, the results become harder to interpret. At first glance, they seem counterintuitive:

Free Cash Flow, CapEx, and Net Cash Provided By Operating Activities show positive gradients, while Operating Cash Flow doesn't—despite being closely related.
Net Income has the most negative gradient, while Operating Income shows a reasonably positive one.
Total Debt and Net Debt display opposite gradients across all years.

What's happening here is that the model is creating its own formulas and constructing metrics it needs to predict future EPS. It essentially arrives at the realization: "Hmm, future EPS seems to be related to the previous year's Operating Cash Flow minus previous year's CapEx – there might be something here. And interestingly, most of the time, the previous year's FCF is nearly identical to this subtraction. However, when there's a discrepancy, I'd rather trust FCF, so I'm going to assign a large positive weight to the previous year's FCF, a positive weight to CapEx, and a negative weight to Operating CF."7

It kind of hedges its bets.

What about Free Cash Flow?

The model is still hedging, but the results are far more explainable. It appears that the best predictor of Free Cash Flow (FCF) – aside from FCF itself – is the "net change from year n-3 to year n-1 in things that are, represent, or consume cash (and in that last case, are not strictly mandatory)." This includes investments, equity, retained earnings… essentially, everything that represents the company's ability to generate cash, even if it doesn't appear as FCF.

Interestingly, historical data is much more important here than it is for predicting EPS. EPS can be reliably predicted8 using only year n-1 data, or very nearly so. In contrast, when predicting FCF, data from years n-2 and n-3 combined have nearly the same significance as data from year n-1.

Now for Net Income, then I’ll go to bed

That's quite interesting. The model almost didn't hedge here. Two main takeaways:

Revenue shows a positive gradient in year n-1 but a negative one in year n-3. This suggests that, regarding revenue's impact on future Net Income, growth is more crucial than absolute figures.
The best negative predictor is dividends paid. So, the notion of board members milking a company dry before things go south might not be just a legend after all. On a personal note, I once received a 28% dividend yield from a Polish company before things took a turn for the worse (the company isn't defunct yet, but it's definitely past its prime). This illustrates that while predicting dividend payments might not be reliable, the payment itself can carry significant predictive weight.

Conclusion / Executive Summary

Forecasting company financials based on past financials does work, but only for metrics loosely connected to operations and board members sentiment.

Free Cash Flow, EPS and Net Income are quite predictable while Revenue and Dividends are not. This last observation suggests Financial Statements alone don’t say anything valuable about the mindset of board members and the operational state of a business.

Model hedging makes interpreting EPS prediction heuristics difficult. However the prediction heuristics for Cash Flow and Net Income are easier to grasp. You can sort and search the gradient tables to have a better sense of the influence of each financial statement metric on predictions.

Using only financial statements to predict EPS, the prior year's data provides 70% of the predictive value. This figure is down to 55% for Free Cash Flow and 57% for Net Income.

Further Work

It is surprising that a prediction model could work at all; I wasn’t expecting such significant results. Still, much more can be done, by including unstructured/text data. For example, I’ve always wondered if a business’ performance could be predicted from the content of its website.

Also, it would be a good idea to divide this analysis by sector. Banks and manufacturing companies don’t include the same items in their financial statements. And even though it doesn’t matter much (the model is able to say “Oh, cost of Revenue is 0, it’s more likely to be a bank than something else”), it will definitely help with interpreting gradients.

Unfortunately, I might need to sell one or two kidneys to afford enough compute to train a model for that...

Another post will follow soon, discussing the technical aspects of this project, including some interesting math and a novel (yet simple) architecture I developed to handle the data's sparsity.

["revenue", "netIncome", "eps", "epsdiluted", "freeCashFlow", "totalStockholdersEquity", "operatingCashFlow", "dividendsPaid"]

So the input dim is 303 + 1 for currency embedding (Currency symbol, as a string, is difficult to feed into a NN. Representing it as a float would likely be too compressive, so I chose a 2-dimensional embedding to represent it)

The data was normalized beforehand.

In fact, if we dive even deeper on the last three metrics, we can see something interesting: the performance of the average of persistence model + my model is better than both. What it means is that my model has learned something valuable from the past data, but is over-applying it to the future. By a factor of 2 approximately. Meaning that (Model + PeristenceModel*2)/3 is the best prediction model for these metrics.

This regression-to-the-mean narrative is the one that prevails in past data, as it is what the model has learned. The gradients of the predicted revenue wrt revenue for years n-2 and n-3 are positive, which means that the model has learned that past glory is a better indicator of future revenue than strong growth. The effect is so small that this nitpicking only deserves a footnote.

As this is a business post, I won’t go too deep into the math, but the general idea is that, as the model is nearly linear (it involves only two low-dimensional LeakyRELU non-linearities), its gradients are not too far from constant. So the local sensitivity of my model wrt its inputs is pretty much constant and I can average it over a batch to get a good idea of the input’s influence over the predicted output.

FCF ≈ Operating Cash Flow - CapEx, but with low reliability (the input data wasn’t always clean), and it looks like lambda * FCF - (1-lambda) * (Operating Cash Flow - CapEx) is a better proxy for FCF than FCF itself. The gradients reflect that, giving CapEx a positive influence.

If a 30% improvement over persistence model is deemed reliable

The law of hidden costs

Eloi de Reynal — Sun, 22 Dec 2024 20:19:06 GMT

"It's stupid not to wear a helmet when cycling: it doesn't cost anything and it can save a life." This is an increasingly common opinion on the subject, and it's defensible.

Here's another one: "Since we're all going to die one day, might as well be while cycling than shitting myself in a hospital bed." It's also defensible, but it's less convincing.

And here's another one, even less convincing but true in my opinion:

"Wearing a helmet when cycling can save a life, but it's not worth it. There are too many hidden costs associated with it."

Here are the hidden costs I'm thinking of:

Time spent putting on and taking off the helmet.
The maintenance, through a physical habit, of the feeling of external danger linked to cycling.

The bulkiness of the helmet.
The mental load associated with carrying a helmet around: the fear of forgetting it, etc.
The decline in interest in cycling due to the inconvenience of wearing a helmet.

In the debates I've had, these arguments failed to convince my friends. Indeed, the aforementioned costs are difficult to measure, while the benefit (a life saved) is measurable.

The debate becomes interesting when we actually bring in measurements and statistics. Some studies seem to indicate that wearing a helmet does not significantly reduce mortality, but like Rousseau, we're going to temporarily disregard the facts to do some back-of-the-envelope calculations. Indeed, it's very probable, even obvious, that in the event of an accident, wearing a helmet reduces mortality. Too many factors come into play in more general studies, which explains their counterintuitive conclusions.

While sparing you the various hypotheses mentioned in this spreadsheet, it would seem that, all things being equal, you gain about 16 seconds of life per trip by wearing a helmet. Everyone can draw their own conclusions, but for my part, I am willing to lose 30 seconds of my life to live more freely and not bother with a helmet. Of course, it would take me more time to buy it, put it on and take it off than it would save me, which in itself is already a deal-breaker, but that's not even the heart of the problem: I find it hard to combine a form of carefree attitude, necessary for happiness, with a mindset so cautious that it pushes me to wear a helmet to save 16 seconds of life. For me, that's the overriding hidden cost.

A similar calculation can be made for the 80 km/h speed limit on roads, the covid lockdowns, etc...

Nowadays, we try to rationalize everything, to reduce measurable risks, without ever taking into account those that are not. However, pusillanimity is a real and serious risk, which is favored by wearing a helmet on a bicycle, refraining from talking to strangers (even if they offer candy), putting on a sweater when you go outside and it's cold... But it's not measurable, so it's always neglected.

Some examples in business

This phenomenon is very similar to the streetlight effect. It is widespread in almost all dimensions of a business: let's imagine, to talk about marketing, that to increase the visit rate of my company's website (a measurable metric), I change the tone of our newsletter to make it more artificially personal, more sales-oriented, etc... We might manage to improve this metric, but probably at the cost of a legitimate decline in trust from readers and a subsequent drop in sales, which are much more obscure.

Now, when managing “human resources”. It's always good to "process" everything, in order to reduce the error rate and depend less on the individual performance of employees. But we rarely think about the effects of such a company policy on employee engagement and subsequent turnover. If an employee feels that we want to make them replaceable and that we have limited confidence in their ability to do their job, they are unlikely to be willing to stay for long and be effective.

In engineering school, a mechanics professor who had worked for 20 years at Renault told me, about lean manufacturing that we were studying: "I'm not at all convinced by this practice. Maybe the sub-optimal layout of the production lines allows operators to move around the factory a bit and take advantage of it to rest. We don't know if it’s not bad for a company to optimize everything, all things considered."

In management information systems

In IT, we often see similar paradoxes. Complete digitization seems to improve efficiency, but it can damage employee communication. With efficient messaging systems and easy access to data, employees have fewer reasons to talk in person. So much communication is lost in the process!

The perverse effects of "rights" in software

The most insidious problem, I think, often lies in a seemingly small detail of ERP systems: access rights.

The typical thinking goes like this: "Employee X is in role Y, they don't need to access menu Z, which is for employees in role W. They might break things or steal data. And anyway, it's not their job."

That's understandable, and sometimes it's even necessary. For instance, it's perfectly reasonable that an employee shouldn't see their colleagues' pay slips.

However, more often than not, this restrictive approach causes more harm than good.

Firstly, if the ERP system is built well, and the support team is on the ball, it's virtually impossible to "break everything." Regular backups ensure that, in a worst-case scenario, we'd only lose a few hours' worth of data entry. That's a cost of maybe tens of thousands of euros at most, nowhere near the millions an employee could cause by physically sabotaging production. It's actually kind of strange that the physical machinery is often left more accessible than the highly restricted ERP access.

Secondly, there's a real disconnect between the message leaders often send and the way they act. On one hand, they'll give these empowering, motivational speeches, saying "Be proactive, give us your ideas, I trust you, drive the company forward!" On the other, these very same leaders implement restrictions that shout: "Stay in your lane, don't touch that, I don't trust you enough to give you these accesses." When words and actions clash like that, people will believe the actions. If ERP access is being used as a safety net, then either there’s a problem of trust from the top, or you have a weak team. Either way, it's a human or organizational issue, not an IT one.

Driving drunk: an act of citizenship

I'm reaching the heights of hypocrisy here, because I've never really drunk alcohol and don't plan to start anytime soon.

However, we can imagine the following situation: you are in a little party with your friends and the atmosphere is great. It would be even better with 2-3 more drinks. Only, you are afraid of having an accident on the way home, if you drink too much. So you stop there. The obvious cost (dying in a car accident) makes you forget the hidden benefit (having a very good time with your friends).

This last example was perhaps not necessary, but it justifies the subtitle of this post, which I like.