<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Machine Learning for Business]]></title><description><![CDATA[I try to make sense of finance using machine learning]]></description><link>https://www.eloidereynal.com</link><image><url>https://substackcdn.com/image/fetch/$s_!p71o!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff79f19e5-065a-4513-bad7-d6a011bd8f74_600x600.png</url><title>Machine Learning for Business</title><link>https://www.eloidereynal.com</link></image><generator>Substack</generator><lastBuildDate>Thu, 09 Apr 2026 00:40:47 GMT</lastBuildDate><atom:link href="https://www.eloidereynal.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Eloi de Reynal]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[eligius@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[eligius@substack.com]]></itunes:email><itunes:name><![CDATA[Eloi de Reynal]]></itunes:name></itunes:owner><itunes:author><![CDATA[Eloi de Reynal]]></itunes:author><googleplay:owner><![CDATA[eligius@substack.com]]></googleplay:owner><googleplay:email><![CDATA[eligius@substack.com]]></googleplay:email><googleplay:author><![CDATA[Eloi de Reynal]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Teaching LLMs]]></title><description><![CDATA[Wouldn't it be great if LLMs learned from user feedback?]]></description><link>https://www.eloidereynal.com/p/teaching-llms</link><guid isPermaLink="false">https://www.eloidereynal.com/p/teaching-llms</guid><dc:creator><![CDATA[Eloi de Reynal]]></dc:creator><pubDate>Thu, 19 Feb 2026 17:16:08 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!bX73!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F269fc4f7-6858-4a45-8f0d-2f52c4acbc65_3123x2970.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>&#8220;2025 is gonna be the year of agents&#8221;. My ass.</p><p>2025 was the year when most AI companies (including the one where I was chief AI scientist) tried to build useful <em>systems</em> with AI, but mostly failed.</p><p>I worked at a french legal tech and we tried to automate the tedious part of notaries jobs. We built a chatbot (soooo 2024, but wait, it was an <em>agentic</em> chatbot, and there was a fair bit of engineering in it), and gave it to notaries to test. </p><p>They kind of liked it.</p><p>But like any other chatbot, it doesn&#8217;t learn from user feedback. Which is infuriating.</p><p>When you tell it &#8220;you failed. Next time, you should do THIS instead&#8221;, it says &#8220;sure, I won&#8217;t make the same mistake twice&#8221;. But it eventually does. </p><p>Most modern chatbots have a &#8220;memory&#8221; function that&#8217;s supposed to help with that, but it mostly sucks.</p><p><strong>Here, I&#8217;d like to present a technical idea that I haven&#8217;t had time to test, but that I think would have a decent chance of making LLMs able to learn on the fly. </strong>It&#8217;s based on privileged information, distillation and <a href="https://openreview.net/pdf?id=JewzobRhay">prefix tuning</a> (one of the best papers I&#8217;ve ever read, btw).</p><h2>The feedback doom loop: you don&#8217;t want to teach a student who won&#8217;t learn</h2><p>Do you ever downvote/upvote a ChatGPT response? I guess not. Because it&#8217;s useless, as ChatGPT won&#8217;t take the feedback into account until it&#8217;s too late for you to care.</p><p>Big AI firms would love to have conversation data on domains where LLMs typically fail. But they don&#8217;t get much of it, because users get a better and better sense of what LLMs fail at, and don&#8217;t give them tasks related to these domains (let alone give any feedback).</p><p>So, there&#8217;s a negative feedback loop:</p><p>LLMs suck at tasks where training data is sparse. =&gt; Users don&#8217;t engage with them on these tasks, and don&#8217;t give feedback anyway. =&gt; Training data for these tasks remains sparse.</p><p>What if, instead, you knew ChatGPT would improve right away based on user feedback? And not only temporarily, but across different chats?</p><p>I guess you'd be willing to take a bit of time to tell it that &#8220;when writing an email to &lt;Polish customer who gets offended by elementary politeness&gt;@big-steel.pl, always use &#8216;get lost&#8217; instead of &#8216;kind regards&#8217;&#8221;.</p><p>So, the vicious circle would break: the ability of the LLM to learn would make giving feedback directly profitable to the user.</p><h2>Okay, but how do you teach an LLM?</h2><p>Here&#8217;s the complete workflow that I&#8217;d suggest:</p><ol><li><p>When a user is dissatisfied with the LLM&#8217;s actions, they downvote its response.</p></li><li><p>They then give constructive feedback.</p></li><li><p>If the LLM gets it right with the additional feedback, the user upvotes the feedback-enhanced response. Else, back to 2.</p></li><li><p>The constructive feedback is distilled into a few learned tokens or prefixes at the beginning of the context window. It&#8217;s the obviously difficult part, so it&#8217;s detailed in the next section.</p></li><li><p>Whenever the user starts a new conversation, the learned tokens/prefixes are loaded and prepended to the system prompt.</p></li></ol><h2>How do you distill feedback into a few learned tokens/prefixes?</h2><p>You may think that we don&#8217;t need to. Why not just updating the model to increase its likelihood of outputting the right answer?</p><h4>What we don&#8217;t want: Normal (LoRA) fine-tuning</h4><p>When you want to fine-tune an LLM, you usually give it a bunch of pairs of (prompt, desired output). Then, it&#8217;s trained on the normal next-token prediction task: the backpropagated loss is usually -sum(log(<em>prob</em>)), <em>prob</em> being the probability that the model assigned to the ground truth token (the one that is actually part of the desired output). <br>Some guys observed that the difference matrices (let&#8217;s call them &#916;W) between the weight matrices before and after fine-tuning usually have a high condition number (ratio between largest and smallest singular values), meaning that they can be well approximated by low rank matrices. </p><p>So, these guys figured out that it&#8217;d be much more efficient to train &#916;W as a low rank approximation:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\Delta W = \\Delta W_1 \\Delta W_2&quot;,&quot;id&quot;:&quot;YPGIOPJWAL&quot;}" data-component-name="LatexBlockToDOM"></div><p>Where:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\Delta W \\in \\mathbb{R}^{d \\times k} \\text{ is the weight update matrix}&quot;,&quot;id&quot;:&quot;YHHMITOLKE&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\Delta W_1 \\in \\mathbb{R}^{d \\times r} \\text{ is a tall matrix,}&quot;,&quot;id&quot;:&quot;YJHUHDMBGV&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\Delta W_2 \\in \\mathbb{R}^{r \\times k} \\text{ is a wide matrix, and}&quot;,&quot;id&quot;:&quot;ILLUUSKWEY&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;r \\ll \\min(d, k) \\text{ is the low rank.}&quot;,&quot;id&quot;:&quot;PTZTSJKUKX&quot;}" data-component-name="LatexBlockToDOM"></div><p>r is usually around 8 or 16, while d or k are closer to 8192 (Llama 3 70B for example).</p><p>It makes training much more parameter efficient.</p><p>Still, if you serve each user a different model, you get into some really hard infrastructure challenges. Batching user requests to process them in parallel (as is absolutely necessary to use GPUs efficiently) becomes quite difficult, because they all use different adapters (&#916;W set of matrices). It&#8217;s partially solved, but it&#8217;s still a bit of a nightmare.</p><p>How about using a nice feature of transformer networks that allows them to somewhat learn without necessarily changing their weights?</p><h4>What we could test: Prefix/prompt tuning</h4><p>When you first discover Machine Learning, it&#8217;s common to think &#8220;wait a minute, how about we make a model that updates its weights based on its inputs. Like, on the fly, not just during training. Wouldn&#8217;t it be cool? It would make the model able to learn much deeper patterns&#8221;.</p><p>Turns out, the idea of having a network update its weights based on its inputs is called a hypernetwork, and it&#8217;s equivalent to having a network that&#8217;s highly non-linear on its inputs.</p><p>Transformers are just that.</p><p>Meaning that adding<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> a bias to the inputs (like, say, adding a few tokens before a prompt, which is exactly the purpose of a system prompt) can really change the behavior of an LLM.</p><p>So, instead of backpropagating the loss to update the weights of the model, we may get away with backpropagating it to update a few learned tokens (invisible to the user) that will be prepended to the context.</p><p>Like an invisible and learned system prompt written in the continuous embedding space rather than the discrete token space. This is the concept of prompt/prefix learning.</p><p>As for the &#8220;fine-tuning&#8221; data, it would comprise pairs of (original_user_prompt, llm_generated_answer_after_user_feedback).</p><p>Now, let&#8217;s see why this technique is likely to be quite data-efficient, and how to make it even more so.</p><h2>Data efficiency is key: this is where distillation and prefix-tuning&#8217;s lack of expressivity can help.</h2><p>Pretraining necessitates billions of examples. Fine-tuning necessitates hundreds, at the very least. How could we teach an LLM from only ONE human feedback?</p><p>Here&#8217;s where expressivity and distillation come into play. Let&#8217;s dive quite deep into both concepts.</p><h4>Expressivity</h4><p>If an LLM sucks at a certain language, no amount of prompt engineering will make it able to master it. You need at least high-rank <strong>P</strong>arameter-<strong>E</strong>fficient <strong>F</strong>ine-<strong>T</strong>uning (through LoRA) for that. In fact, you&#8217;ll probably be better off with a full fine-tuning of the model at that point.</p><p>On the other hand, prompt engineering is sufficient to tell a model to use such and such library when coding, or to change its tone.</p><p>More rigorously, full fine-tuning is more <strong>expressive</strong> than prompt engineering. The <strong>expressivity</strong> of a &#8220;steering&#8221; method basically defines how powerful it is: how much it can change the transfer function of a Neural Net.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bX73!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F269fc4f7-6858-4a45-8f0d-2f52c4acbc65_3123x2970.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bX73!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F269fc4f7-6858-4a45-8f0d-2f52c4acbc65_3123x2970.png 424w, https://substackcdn.com/image/fetch/$s_!bX73!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F269fc4f7-6858-4a45-8f0d-2f52c4acbc65_3123x2970.png 848w, https://substackcdn.com/image/fetch/$s_!bX73!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F269fc4f7-6858-4a45-8f0d-2f52c4acbc65_3123x2970.png 1272w, https://substackcdn.com/image/fetch/$s_!bX73!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F269fc4f7-6858-4a45-8f0d-2f52c4acbc65_3123x2970.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bX73!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F269fc4f7-6858-4a45-8f0d-2f52c4acbc65_3123x2970.png" width="1456" height="1385" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/269fc4f7-6858-4a45-8f0d-2f52c4acbc65_3123x2970.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1385,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1037550,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.eloidereynal.com/i/186750836?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F269fc4f7-6858-4a45-8f0d-2f52c4acbc65_3123x2970.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bX73!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F269fc4f7-6858-4a45-8f0d-2f52c4acbc65_3123x2970.png 424w, https://substackcdn.com/image/fetch/$s_!bX73!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F269fc4f7-6858-4a45-8f0d-2f52c4acbc65_3123x2970.png 848w, https://substackcdn.com/image/fetch/$s_!bX73!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F269fc4f7-6858-4a45-8f0d-2f52c4acbc65_3123x2970.png 1272w, https://substackcdn.com/image/fetch/$s_!bX73!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F269fc4f7-6858-4a45-8f0d-2f52c4acbc65_3123x2970.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Expressivity of different &#8220;steering&#8221; methods for LLMs.</figcaption></figure></div><p>Whatever you can do with prompt engineering, you can also do with full fine-tuning, but you&#8217;d need a shitload of examples to constrain the search space enough.</p><p>That&#8217;s when we run into the expressivity vs. sample efficiency tradeoff, which is just a particular case of the <a href="https://en.wikipedia.org/wiki/Bias&#8211;variance_tradeoff">bias-variance</a> tradeoff. I&#8217;ve coded a toy demonstration of it that you can check in the annex.</p><p>When a training method (or a model architecture) is able to fit a very large class of functions, it needs a very large dataset to define what, precisely, the function to fit is.</p><p>It&#8217;s a bit reminiscent of the &#8220;you need as many linear equations as there are dimensions in a vector space to define a point in it&#8221;.</p><p>System prompt (or prefix) tokens can only bias the attention scores without changing their order, while LoRA can completely change the relative importance of token-to-token interaction.</p><p>Meaning that, if <strong>interaction(token1, token2) &gt; interaction(token2, token3)</strong> in a certain attention head of the base model, this relationship will always stay true regardless of the number of prefix tokens you add. On the other hand, a LoRA fine-tuning could reverse the order.</p><p>This is why, from a mathematical perspective, prompt engineering will never replace a full fine-tuning on a few billion tokens.</p><p>But in our case, I think prefix tuning is the sweet spot: more expressive than prompt engineering (not constrained by the discrete space of real-world tokens), yet still very sample-efficient. How can we make it even more so?</p><h4>Distilling user feedback into learned tokens</h4><p>Usually, when you fine-tune a model on real data, the loss function compares the next token probability distribution as predicted by the model with the ground-truth distribution (which obviously is one-hot). It computes the cross-entropy of that distribution, which handily simplifies as -log(probability).</p><p>Indeed, you&#8217;ll never find a text where the author gives a quantified insight into their hesitation when choosing their words. So real-world ground truth will always be a one-hot.</p><p>When you distill a teacher model into a student model, though, you can see the full output probability distribution of the teacher. And it carries much more information than the one-hot ground truth distribution<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>. So, you train the student model on the output distribution of the teacher. </p><p>It&#8217;s much more sample-efficient, and plainly better in terms of final loss on the real-world dataset, as the teacher&#8217;s outputs are less noisy than real-world data.</p><p>So, what about distilling the base model + user feedback into the base model + prefix tokens? </p><p>It would basically amount to compressing the user feedback. As the number of corrections grows, compression becomes increasingly interesting:</p><ol><li><p>When user feedback gets a bit contradictory (like &#8220;in case A, do B, but in (the similar) case C, do D&#8221;), prefix tuning can help a lot, by basically finding the best theoretical prompt that allows this.</p></li><li><p>It helps limit the context size.</p></li></ol><p>I think the fact that both prefix tuning and prompt engineering can only bias attention scores without changing their order<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a> would even make it possible to use intermediate activations in the distillation&#8217;s loss function.</p><p>Like, using a kind of regularization term on the MSE of each attention layer's activations.</p><h2>But prefix-tuning won&#8217;t be sufficient for all use cases</h2><p>True. At some point, the distance between the base model and the feedback-enhanced one will likely become too large for prefix-tuning to handle.</p><p>But if models are made a little more able to learn, maybe users will bother to teach them. </p><p>The learning doesn&#8217;t need to be perfect, just good enough for AI labs to gather feedback and training data on domains that are at the frontier of what LLMs can do. Then, pretraining can do the job.</p><p>That&#8217;s all I had to say. Writing this blog post took longer than it would have taken me to test this idea with Gemma 270M on my Mac, but I think it was more fun.</p><h2>Conclusions, TL;DR</h2><p>Current LLM memory is broken: users have no incentive to give feedback, because they don&#8217;t want to waste time teaching a student that won&#8217;t learn. </p><p>I propose distilling user corrections into learned prefix tokens. More expressive than prompt engineering, more sample-efficient than LoRA, and deployable without per-user model serving. </p><p>This reverses the negative feedback loop: if the LLM visibly learns, users will actually bother to teach it.</p><p></p><h2>Annex: expressivity and inductive bias vs. Sample efficiency.</h2><p>I trained a small 3-layer NN to model f(x) = sin(10x) between -1.5 and 1.5. Then, I tried to fine-tune it to model 2 different functions, using different fine-tuning methods. Most are self-explanatory. Here they are, in detail:</p><ul><li><p>Full Retrain: None of the weights/biases were frozen.</p></li><li><p>Bias tuning: Weights frozen, biases are trained.</p></li><li><p>Input bias: main model is untouched, a learnable bias is added to the input.</p></li><li><p>LoRA L2 (rank 1). The second layer (dim=32) is fine-tuned using a rank-1 LoRA (so 64 params). Biases are frozen.</p></li></ul><p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8tkJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e8c63aa-1f8c-40f0-b9d7-019589216c52_1500x1000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8tkJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e8c63aa-1f8c-40f0-b9d7-019589216c52_1500x1000.png 424w, https://substackcdn.com/image/fetch/$s_!8tkJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e8c63aa-1f8c-40f0-b9d7-019589216c52_1500x1000.png 848w, https://substackcdn.com/image/fetch/$s_!8tkJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e8c63aa-1f8c-40f0-b9d7-019589216c52_1500x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!8tkJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e8c63aa-1f8c-40f0-b9d7-019589216c52_1500x1000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8tkJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e8c63aa-1f8c-40f0-b9d7-019589216c52_1500x1000.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0e8c63aa-1f8c-40f0-b9d7-019589216c52_1500x1000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:229003,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.eloidereynal.com/i/186750836?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e8c63aa-1f8c-40f0-b9d7-019589216c52_1500x1000.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8tkJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e8c63aa-1f8c-40f0-b9d7-019589216c52_1500x1000.png 424w, https://substackcdn.com/image/fetch/$s_!8tkJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e8c63aa-1f8c-40f0-b9d7-019589216c52_1500x1000.png 848w, https://substackcdn.com/image/fetch/$s_!8tkJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e8c63aa-1f8c-40f0-b9d7-019589216c52_1500x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!8tkJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e8c63aa-1f8c-40f0-b9d7-019589216c52_1500x1000.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ev6D!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F271a41c1-7d88-4fe5-a058-a24c463ddff3_1500x1000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ev6D!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F271a41c1-7d88-4fe5-a058-a24c463ddff3_1500x1000.png 424w, https://substackcdn.com/image/fetch/$s_!ev6D!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F271a41c1-7d88-4fe5-a058-a24c463ddff3_1500x1000.png 848w, https://substackcdn.com/image/fetch/$s_!ev6D!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F271a41c1-7d88-4fe5-a058-a24c463ddff3_1500x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!ev6D!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F271a41c1-7d88-4fe5-a058-a24c463ddff3_1500x1000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ev6D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F271a41c1-7d88-4fe5-a058-a24c463ddff3_1500x1000.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/271a41c1-7d88-4fe5-a058-a24c463ddff3_1500x1000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:235372,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.eloidereynal.com/i/186750836?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F271a41c1-7d88-4fe5-a058-a24c463ddff3_1500x1000.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ev6D!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F271a41c1-7d88-4fe5-a058-a24c463ddff3_1500x1000.png 424w, https://substackcdn.com/image/fetch/$s_!ev6D!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F271a41c1-7d88-4fe5-a058-a24c463ddff3_1500x1000.png 848w, https://substackcdn.com/image/fetch/$s_!ev6D!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F271a41c1-7d88-4fe5-a058-a24c463ddff3_1500x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!ev6D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F271a41c1-7d88-4fe5-a058-a24c463ddff3_1500x1000.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h2>What&#8217;s next on this blog?</h2><p>I&#8217;m currently working on something much more financially interesting: trying to predict the performance of companies based on the full text of their financial reports and websites. And pretty much all the text you can find about them.</p><p>I&#8217;m probably going to write a couple of blog posts on the matter in the near future, because I&#8217;ve made a few interesting advances in long context modeling for encoder transformers that I&#8217;d be happy to share. </p><p>Also, I&#8217;ve found a way to make encoder models much, much better at numeracy and tabular data reading without having to retrain them, which is necessary to accurately encode number-heavy financial reports.</p><p>Finally, I have to try a new sentence embedding training objective that I&#8217;ve been thinking about.</p><p></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Not a proper use of the concept of addition. Couldn&#8217;t think of anything better though.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>Actually, it carries -log(teacher_probability_of_actual_token) more bits of information per token. (Assuming that the absolute goal of the student model is to learn its teacher transfer function)</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Indeed, prefix tokens can only steal attention from other tokens, without changing the ordering of attention scores among non-prefix tokens. <strong>At least in the first layer.</strong></p></div></div>]]></content:encoded></item><item><title><![CDATA[Generative AI is not the new Internet]]></title><description><![CDATA[Investors often draw the parallel between the two. That's a mistake.]]></description><link>https://www.eloidereynal.com/p/generative-ai-is-not-the-new-internet</link><guid isPermaLink="false">https://www.eloidereynal.com/p/generative-ai-is-not-the-new-internet</guid><dc:creator><![CDATA[Eloi de Reynal]]></dc:creator><pubDate>Tue, 15 Jul 2025 21:13:05 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!RZce!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd31693dc-e255-4058-9e6d-0c58fdf1c0ab_2880x1874.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>&#8220;AI might be in a bubble, but so was the internet. It didn&#8217;t stop it from becoming the most transformative technology of the 21st century.&#8221;</p><p>So people say. And hearing this over and over again makes me want to punch some faces. I have even started downvoting stuff on reddit, something I had never done before.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RZce!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd31693dc-e255-4058-9e6d-0c58fdf1c0ab_2880x1874.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RZce!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd31693dc-e255-4058-9e6d-0c58fdf1c0ab_2880x1874.png 424w, https://substackcdn.com/image/fetch/$s_!RZce!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd31693dc-e255-4058-9e6d-0c58fdf1c0ab_2880x1874.png 848w, https://substackcdn.com/image/fetch/$s_!RZce!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd31693dc-e255-4058-9e6d-0c58fdf1c0ab_2880x1874.png 1272w, https://substackcdn.com/image/fetch/$s_!RZce!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd31693dc-e255-4058-9e6d-0c58fdf1c0ab_2880x1874.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RZce!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd31693dc-e255-4058-9e6d-0c58fdf1c0ab_2880x1874.png" width="2880" height="1874" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d31693dc-e255-4058-9e6d-0c58fdf1c0ab_2880x1874.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1874,&quot;width&quot;:2880,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:315350,&quot;alt&quot;:&quot;undefined&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="undefined" title="undefined" srcset="https://substackcdn.com/image/fetch/$s_!RZce!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd31693dc-e255-4058-9e6d-0c58fdf1c0ab_2880x1874.png 424w, https://substackcdn.com/image/fetch/$s_!RZce!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd31693dc-e255-4058-9e6d-0c58fdf1c0ab_2880x1874.png 848w, https://substackcdn.com/image/fetch/$s_!RZce!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd31693dc-e255-4058-9e6d-0c58fdf1c0ab_2880x1874.png 1272w, https://substackcdn.com/image/fetch/$s_!RZce!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd31693dc-e255-4058-9e6d-0c58fdf1c0ab_2880x1874.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">I would happily break the legs of anyone showing the original graph on their blog.</figcaption></figure></div><p> The &#8220;hype cycle&#8221;, as it is called, is contaminated by survivorship bias. We tend to forget that the nominal trajectory after the &#8220;trough of disillusionment&#8221; is &#8220;the abyss of ridicule&#8221; and not &#8220;the slope of enlightenment&#8221;.</p><p>The internet is one of a few technologies that has had an impact as big as its hype, despite the bumps in the road. We are oblivious to the Segway, the IoT, cold fusion, space travel, that all greatly underdelivered, either out of technological infeasibility or poor product/market fit (no one wants a &#8220;connected dishwasher&#8221;).</p><p>Even though I am no Geoffrey Hinton, I think I have a <a href="https://www.eloidereynal.com/p/hacking-spectral-bias-using-carrier">decent</a> <a href="https://www.eloidereynal.com/p/tech-report-how-i-analyzed-337576">enough</a> <a href="https://www.eloidereynal.com/p/a-partially-failed-attempt-at-improving">level</a> in Machine Learning and NLP to know what I&#8217;m talking about. This post isn&#8217;t another one of those &#8220;a machine cannot think, it just outputs an answer in its database&#8221; or &#8220;what you call AI is just an unintelligent program that predicts the next token. It&#8217;s a stochastic (whatever it means, I heard about it on instagram) parrot &#128513;&#8221; pieces.</p><p>Here are 5 major differences between the AI and dot-com bubbles.</p><h2>1. The internet&#8217;s development was problem-driven. AI is a nice solution looking for a problem.</h2><p>In the 1960s, the American government was concerned that if the soviets destroyed a central command-and-control hub, the entire US communications network would collapse. They thought it could be a good idea to build a decentralized (and thus more resilient) network. Universities loved the idea, too.</p><p>The minimum viable product, called the ARPANET, basically did what the current version of the internet does: it allowed to send bytes from a computer to another, through a decentralized network. You couldn&#8217;t yet send dick pics to your hot coworkers (for some reason it was not considered a priority at the time), but you sure could share the source code of a program or &#8220;email&#8221; people in the network.</p><p>It solved the problem of long range communication at the byte level.</p><p>Of course, there have been a lot of improvements made to the original protocol (TCP/IP, WWW and so on), but the need for a common protocol to send bytes from one computer to another over a decentralized network was clear from the start. And the internet delivered.</p><p>Generative AI, on the other hand, was created by people who wanted to make something intelligent, attained some success and thought &#8220;well, what can we do about it&#8221;? It turns out GPTs were useful as chatbots, so they went for chatbots.</p><p>In a sense, it&#8217;s similar to the 1997+ part of the internet bubble, where most companies were like &#8220;We have to do something with that &#8216;internet&#8217; thing. Any idea?&#8221;. But the development of the underlying technology went through a completely different process.</p><h2>2. The adoption of the internet was slow because you had to sell a few kidneys to buy a computer. AI today is already dead cheap.</h2><p>If you wanted to get connected to the internet in the 1990s, you had to buy an internet-able computer (the equivalent of $3000 today), and then purchase a $100 (today dollars) monthly subscription. So, it was on the order of magnitude of a month salary.</p><p>No wonder it took time to take off. First you had to hear about it from your nerdy friend, and then you had to convince your wife that getting a computer with internet access was more important than getting your septic tank drained and your garage door fixed.</p><p>In the late 1990s, only about a third of the developed world population had internet access. There was plenty of room to grow, it allowed for high expectations, and it was a good excuse for its limited economic impact.</p><p>Today, there is no way you can spend a month salary on generative AI, without deliberately trying to do so. And nearly every working age person has directly or indirectly used a cutting-edge AI chatbot. So the reason why AI doesn&#8217;t have a significant economic impact is not that adoption is not complete. It&#8217;s that the technology is not yet able to. But more on that in the 5th point.</p><h2>3. The internet has a positive network effect. AI has a negative one.</h2><p>Cool if you sell books online or if you have a brand new &#8216;@hotmail.com&#8217; address. But if no one browses the web or checks their emails, you are just a clanging cymbal that no one gives a shit about.</p><p>The internet had (and still has) a huge positive network effect, meaning that its usefulness grows with the number of users.</p><p>No such thing for AI yet. Quite the opposite in fact. </p><p>First of all, AI generated slop tends to contaminate datasets and it causes <a href="https://futurism.com/ai-models-falling-apart">model collapse</a>. On a more technical note, I am not exactly sure why it does: I once trained pre-trained CNNs on their own outputs<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>, basically trying to make them more sure about their guesses, and it didn&#8217;t cause them to go astray (it was just for fun, btw). But I guess things are different for massive Transformers.</p><p>Anyway, the more AI slop on the internet the worse the training datasets quality.</p><p>Second, and even worse, more people using AI brings down the value of AI. Generative AI is mainly used for creative tasks<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>, where users are competing for other people&#8217;s attention. And the more the said people use AI or get in contact with AI-generated stuff, the less impactful it becomes. Cool images or marketing clips lose value when everyone can generate them. Same thing goes for &#8220;personalized emails&#8221;. </p><p>Whenever I get an email from someone I don&#8217;t know who says &#8220;I&#8217;ve checked your {blog post or github repo} and found it fascinating!  We are also into {vaguely related stuff}, so please join our {AI product} waiting list.&#8221;, I consider him a spam by default. So does everyone. Before the advent of generative AI, I would have thought &#8220;woah, someone actually checked this repo! I&#8217;m not used to this much appreciation, should I answer by sending them a dick pic?&#8221;</p><p>If a single person had had access to GPT-4 in the 2010s, he would have made millions of dollars from it, because no one was able to spot fishy AI-generated slop at the time. Now, it&#8217;s become a sixth sense to almost everyone.</p><p>The positive network effect could justify the exponential growth of internet companies market valuations. </p><p>AI companies benefit from no such network effect.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a></p><h2>4. The scaling laws of the internet are linear. The scaling laws of AI are worse than logarithmic.</h2><p>If you double the number of cables connecting 2 countries, you double their connection speed. If you double the number of hard drives in a server, you double its storage capacity.</p><p>Double an LLM training compute and you get a barely noticeable difference. But you burnt twice as much money. AI scaling laws are worse than logarithmic.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a></p><p>GPT-4.5 was trained using 10x more compute than GPT-4, but the difference with its predecessor is marginal (and <a href="https://x.com/karpathy/status/1895213020982472863">arguably in the wrong direction</a>). The difference is nowhere near that between GPT-4 and GPT-3, despite a similar training compute factor. In all mathematical rigor, if the performance gap had been the same, the scaling laws would have been called logarithmic, which is already quite bad. But they are far worse.</p><p>When scaling laws are nice enough, some big companies or government have a strong incentive to invest in the technology to develop it for their own use. That&#8217;s why some companies have super computers. That&#8217;s why microsoft employees already had connection speeds acceptable by today&#8217;s standard, as far back as in the 1990s. Then, as they work hard to make the technology less expensive, it becomes affordable to the general public. This is how we went from Enigma to the iPhone and from super expensive internet broadband to 5G.</p><p>But there is no such thing with AI. The best AIs currently being tested by Google, Anthropic and OpenAI are marginally better than those available to the general public. There is no &#8220;Spend 10x more to get a 10x better product&#8221; path that big companies can pursue.</p><p>I wonder if, in normal economic conditions, Google et al would have any financial incentive to train big models when 90% of the maximum economic value can be obtained from 0.1% of the maximum compute. I don&#8217;t think so, though I could be wrong.</p><h2>5. The internet needed few technological breakthroughs to become what it is today. AI needs major ones to take off.</h2><p>I will write a full technical blog post on that matter, but for now, let&#8217;s just state the facts.</p><p>The internet has evolved quite a bit from ARPANET. But there were few technological breakthroughs needed for this evolution: TCP/IP and WWW protocols, HTML, CSS and Javascript and fiber optics were about all that was needed. And at any point in the history of the internet, engineers and scientists knew the direction of the next step. &#8220;Problem: Disconnected networks can't talk to each other. =&gt; Solution: TCP/IP, a universal translator for data; Problem: Web pages are boring. =&gt; Solution: JavaScript, to make them come alive.&#8221;</p><p>Current LLMs have a (real but) limited economic impact. We are promised superintelligence. The thing is, no one I know about has the slightest fucking idea how to get there.</p><p>Benchmarks are getting maxed out by new reasoning models every other day, yet real world usefulness seems to be plateauing. Although I am an LLM power user, I don&#8217;t think I would lose much productivity if you forced me to use GPT-4 Turbo instead of the latest models.</p><p>Despite enormous investments and efforts, no one has been able to use LLMs for anything else than Chatbots and IDEs. Current LLMs need constant guidance from humans to work. </p><p>One thing that I&#8217;ve understood only recently is that most economic value comes from navigating the messiness of the world. Very few people are paid to work a fully documented and streamlined job. </p><ul><li><p>You may think that accountants just line up numbers in spreadsheets, but they constantly make important and implicit decisions about where to put those numbers. Few of these micro-decisions can be found on the internet and thus in the training data of LLMs.</p></li><li><p>Despite code being one of the most abundant data forms on the internet, I find myself not using AI too much when coding. The interesting thing is that I&#8217;m not even able to single-out cases where AI fails. There are just too many low-probability failure modes to account for.</p></li><li><p>I&#8217;m not even talking about &#8220;hardware&#8221; engineers who are closer to the material messiness of the world. You will have a hard time finding online doc on the &#8220;Shit, I need to redesign this part because our historical supplier went bankrupt and the new one can&#8217;t machine Al 2024 alloy to the required tolerances. Should I figure out if we can use 7075 instead or redesign the part altogether?&#8221; problem.</p></li></ul><p>Despite acing math and code benchmarks, LLMs have made little progress in that &#8220;messiness handling&#8221; skill. That&#8217;s why you can&#8217;t trust ChatGPT&#8217;s Operator to <a href="https://www.understandingai.org/p/computer-use-agents-seem-like-a-dead">fill your shopping cart</a> or Claude 3.7 to <a href="https://www.anthropic.com/research/project-vend-1">run a small shop</a> (the latter post is genuinely funny, Anthropic engineers have a lot of humor).</p><p>For some reason, Anthropic has an edge in this domain, but I&#8217;ve seen no progress since Sonnet 3.5.</p><p>Fine if Elon Musk (whom I like, out of pure provocation) calls Grok 4 a Ph.D.-level AI because it never fails math tests and trick questions. Fine if it can solve 5th order PDE and output the result in Alexandrine verses. </p><p>But no one is paid for that. </p><p>It is only marginally closer to being autonomous than GPT 3.5.</p><p>The current path of AI development will not bring economically meaningful superintelligence in the foreseeable future. I&#8217;m not saying superintelligence will never happen, just that it&#8217;s unlikely to happen by scaling current approaches. We need a few breakthroughs, and as far as I know, no one knows what they will consist of.</p><h2>Conclusion</h2><p>In 1969, if you had told the average American &#8220;you will never live to see human settlements on the Moon nor humans on Mars&#8221;, he would have answered &#8220;what? am I going to get cancer or something? Are you saying I have less than 10 years left to live?&#8221;. The possibility of space exploration being at its apex was unimaginable. But it was.</p><p>In 1999, if you had told him &#8220;the internet is not going to be a big deal&#8221;, he would have called you a fool. Rightly so.</p><p>Today, no one knows where AI is going, but there seems to be hard technical problems to solve. The forward trajectory looks more like that of space exploration and cold fusion than that of the internet.</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>I don&#8217;t remember the exact code, but the loss function was probably something like loss(logits) = -log(max(logits))</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>And code.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>I&#8217;ve heard the argument that, as more people interact with chatbots, conversation data becomes more abundant and allows AI companies to train better models. So this would amount to a network effect. But I am not sure about it, because unlabelled conversation data is notoriously difficult to work with. So much so that most AI companies offer their models for free on https://lmarena.ai just to collect a bit of human feedback, because the poorly labelled conversation data they get there is still more valuable than the formidable amount of raw data they have at home.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>It all depends on the spectral decay of the &#8220;perfect&#8221; LLM&#8217;s transfer function. Check <a href="https://www.eloidereynal.com/p/hacking-spectral-bias-using-carrier">this post</a> if you&#8217;re interested.</p></div></div>]]></content:encoded></item><item><title><![CDATA[Hacking Spectral Bias: Using Carrier Functions to Increase Parameter Efficiency]]></title><description><![CDATA[This title is clickbait, but only to a small portion of the population.]]></description><link>https://www.eloidereynal.com/p/hacking-spectral-bias-using-carrier</link><guid isPermaLink="false">https://www.eloidereynal.com/p/hacking-spectral-bias-using-carrier</guid><dc:creator><![CDATA[Eloi de Reynal]]></dc:creator><pubDate>Thu, 03 Jul 2025 21:17:37 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!cZVx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02757d0d-2140-4759-bd62-f3b0db8d817e_2048x1536.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>When writing about the Stateful Transformer, I came across some nice ML concepts and discovered a few things worth writing about. This post is quite technical, though I restrained from using mathematical expressions when the concept behind them was explainable with words.</em></p><p><em>The core idea is that adding a well-chosen function to the target (and subtracting it at test time) can help escape local minima in the loss landscape.</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cZVx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02757d0d-2140-4759-bd62-f3b0db8d817e_2048x1536.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cZVx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02757d0d-2140-4759-bd62-f3b0db8d817e_2048x1536.jpeg 424w, https://substackcdn.com/image/fetch/$s_!cZVx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02757d0d-2140-4759-bd62-f3b0db8d817e_2048x1536.jpeg 848w, https://substackcdn.com/image/fetch/$s_!cZVx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02757d0d-2140-4759-bd62-f3b0db8d817e_2048x1536.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!cZVx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02757d0d-2140-4759-bd62-f3b0db8d817e_2048x1536.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cZVx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02757d0d-2140-4759-bd62-f3b0db8d817e_2048x1536.jpeg" width="1456" height="1092" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/02757d0d-2140-4759-bd62-f3b0db8d817e_2048x1536.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1092,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1831854,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.eloidereynal.com/i/161097453?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02757d0d-2140-4759-bd62-f3b0db8d817e_2048x1536.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cZVx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02757d0d-2140-4759-bd62-f3b0db8d817e_2048x1536.jpeg 424w, https://substackcdn.com/image/fetch/$s_!cZVx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02757d0d-2140-4759-bd62-f3b0db8d817e_2048x1536.jpeg 848w, https://substackcdn.com/image/fetch/$s_!cZVx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02757d0d-2140-4759-bd62-f3b0db8d817e_2048x1536.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!cZVx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02757d0d-2140-4759-bd62-f3b0db8d817e_2048x1536.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">A nice French forest in the Lyon area, where I wrote the beginning of this piece.</figcaption></figure></div><h1>Learning curve and learning task</h1><p>This may seem a bit naive to some of you, but I was wondering if the minimum attainable loss vs. number of layers (with a fixed width) curve had a definite shape, for example if adding a transformer layer would yield a predictable loss improvement.</p><p>It turns out it doesn&#8217;t. In fact, the shape of the loss vs. number of layers depends entirely on the task to learn, and especially on the spectral decay of the function to approximate.</p><p>Let&#8217;s check that.</p><h4>Random points</h4><p>I tried to fit a GELU MLP<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> to a function defined by random points<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!t_9A!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F968aa632-478f-42da-b105-baf89d94c497_3352x1920.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!t_9A!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F968aa632-478f-42da-b105-baf89d94c497_3352x1920.png 424w, https://substackcdn.com/image/fetch/$s_!t_9A!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F968aa632-478f-42da-b105-baf89d94c497_3352x1920.png 848w, https://substackcdn.com/image/fetch/$s_!t_9A!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F968aa632-478f-42da-b105-baf89d94c497_3352x1920.png 1272w, https://substackcdn.com/image/fetch/$s_!t_9A!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F968aa632-478f-42da-b105-baf89d94c497_3352x1920.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!t_9A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F968aa632-478f-42da-b105-baf89d94c497_3352x1920.png" width="1456" height="834" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/968aa632-478f-42da-b105-baf89d94c497_3352x1920.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:834,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2399419,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.eloidereynal.com/i/161097453?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F968aa632-478f-42da-b105-baf89d94c497_3352x1920.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!t_9A!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F968aa632-478f-42da-b105-baf89d94c497_3352x1920.png 424w, https://substackcdn.com/image/fetch/$s_!t_9A!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F968aa632-478f-42da-b105-baf89d94c497_3352x1920.png 848w, https://substackcdn.com/image/fetch/$s_!t_9A!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F968aa632-478f-42da-b105-baf89d94c497_3352x1920.png 1272w, https://substackcdn.com/image/fetch/$s_!t_9A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F968aa632-478f-42da-b105-baf89d94c497_3352x1920.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/Pajob/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/15604090-1204-4b88-9bbe-47ffe6161764_1260x660.png&quot;,&quot;thumbnail_url_full&quot;:&quot;&quot;,&quot;height&quot;:449,&quot;title&quot;:&quot;Loss vs. Number of Parameters (best of 5 runs) (Random points)&quot;,&quot;description&quot;:&quot;Create interactive, responsive &amp; beautiful charts &#8212; no code required.&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/Pajob/1/" width="730" height="449" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p></p><p>First of all, the random points are not only random on y, but also on x, giving a slowly decaying frequency spectrum (as shown in the next figure) to the function to approximate. </p><p>Indeed, if the points were evenly spaced, there would be an obvious frequency peak in the vicinity of {number of points per unit} cycles per unit. But, as the points are unevenly spaced, the frequencies are spread, roughly following a 1/(distribution of distance between two neighboring points drawn from a uniform distribution) law. <a href="https://math.stackexchange.com/questions/2994197/minimum-distance-from-uniformly-distributed-points">Which seems quite difficult to compute</a>.</p><p>Anyway.</p><p>What we can see is that adding layers initially decreases the loss, resulting in early quick wins, until the points to fit are too hard to reach. Then, the loss plateaus. It could go down to 0 if we had a good enough network. But such network seems impossible to get just by increasing the number of layers: we run into training instabilities due to excessive depth before being able to model the hyper-high frequencies of the tail of the spectrum.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PuP9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92e7ed9f-04b5-429d-b07f-e781d2bcf221_3288x968.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PuP9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92e7ed9f-04b5-429d-b07f-e781d2bcf221_3288x968.png 424w, https://substackcdn.com/image/fetch/$s_!PuP9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92e7ed9f-04b5-429d-b07f-e781d2bcf221_3288x968.png 848w, https://substackcdn.com/image/fetch/$s_!PuP9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92e7ed9f-04b5-429d-b07f-e781d2bcf221_3288x968.png 1272w, https://substackcdn.com/image/fetch/$s_!PuP9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92e7ed9f-04b5-429d-b07f-e781d2bcf221_3288x968.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PuP9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92e7ed9f-04b5-429d-b07f-e781d2bcf221_3288x968.png" width="1456" height="429" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/92e7ed9f-04b5-429d-b07f-e781d2bcf221_3288x968.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:429,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:546760,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.eloidereynal.com/i/161097453?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92e7ed9f-04b5-429d-b07f-e781d2bcf221_3288x968.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!PuP9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92e7ed9f-04b5-429d-b07f-e781d2bcf221_3288x968.png 424w, https://substackcdn.com/image/fetch/$s_!PuP9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92e7ed9f-04b5-429d-b07f-e781d2bcf221_3288x968.png 848w, https://substackcdn.com/image/fetch/$s_!PuP9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92e7ed9f-04b5-429d-b07f-e781d2bcf221_3288x968.png 1272w, https://substackcdn.com/image/fetch/$s_!PuP9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92e7ed9f-04b5-429d-b07f-e781d2bcf221_3288x968.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Left: the function to approximate. Right: its frequency spectrum.</figcaption></figure></div><h4>Variable frequencies</h4><p>The architecture is the same, but now the function to approximate is cos(15x&#179;). The frequency of this function obviously grows with |x|.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!V0SU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F570607da-fd2f-478a-b16a-3e5795ca6abe_3366x1958.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!V0SU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F570607da-fd2f-478a-b16a-3e5795ca6abe_3366x1958.png 424w, https://substackcdn.com/image/fetch/$s_!V0SU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F570607da-fd2f-478a-b16a-3e5795ca6abe_3366x1958.png 848w, https://substackcdn.com/image/fetch/$s_!V0SU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F570607da-fd2f-478a-b16a-3e5795ca6abe_3366x1958.png 1272w, https://substackcdn.com/image/fetch/$s_!V0SU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F570607da-fd2f-478a-b16a-3e5795ca6abe_3366x1958.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!V0SU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F570607da-fd2f-478a-b16a-3e5795ca6abe_3366x1958.png" width="1456" height="847" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/570607da-fd2f-478a-b16a-3e5795ca6abe_3366x1958.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:847,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2571034,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.eloidereynal.com/i/161097453?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F570607da-fd2f-478a-b16a-3e5795ca6abe_3366x1958.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!V0SU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F570607da-fd2f-478a-b16a-3e5795ca6abe_3366x1958.png 424w, https://substackcdn.com/image/fetch/$s_!V0SU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F570607da-fd2f-478a-b16a-3e5795ca6abe_3366x1958.png 848w, https://substackcdn.com/image/fetch/$s_!V0SU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F570607da-fd2f-478a-b16a-3e5795ca6abe_3366x1958.png 1272w, https://substackcdn.com/image/fetch/$s_!V0SU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F570607da-fd2f-478a-b16a-3e5795ca6abe_3366x1958.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Strong spectral bias observed.</figcaption></figure></div><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/fmXOS/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0c95aa5b-d8dc-45b5-925d-0180943a52c4_1260x660.png&quot;,&quot;thumbnail_url_full&quot;:&quot;&quot;,&quot;height&quot;:472,&quot;title&quot;:&quot;Loss vs. Number of Parameters (best of 5 runs, Variable frequencies)&quot;,&quot;description&quot;:&quot;Constant sampling rate.&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/fmXOS/1/" width="730" height="472" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p>Two things here: </p><ol><li><p>The loss is decreasing exponentially (it&#8217;s about halved for each added layer until layer 5 and it&#8217;s divided by 4 by the last layer).</p></li><li><p>You can see a near-perfect illustration of the spectral bias: NNs have a strong tendency to first fit low frequencies.</p></li></ol><p>Now, the first point basically proves that adding a layer doesn&#8217;t have a definite impact on loss, and that it depends on the complexity of the function to model: the loss curve here is dramatically different from that of the random points approximation problem.</p><p>The second point is more interesting. Let me digress a bit more about spectral bias first.</p><p>The spectral bias of NN <a href="https://arxiv.org/pdf/1806.08734">has been extensively studied</a>, and the math behind it is sound. But I wonder if real-life spectral bias also has something to do with the fact that, in some cases, the local sample rate of the dataset is not tuned to match the local frequency of the functions to approximate. Let&#8217;s take an example.</p><p>Let&#8217;s say we want to train a CNN to identify animal species based on their pictures. The function to approximate has high frequency regions (felid and bird species sometimes differ only by subtle features) and low-frequency ones (equids are very easy to tell apart: Striped =&gt; Zebra. Gray and long ears =&gt; Donkey. All the rest =&gt; Horse). </p><p>If we don&#8217;t oversample the high-frequency regions (providing a lot of pictures for each bird and felid species) in the dataset, we are likely to increase the spectral bias of our NN, as a somewhat low-frequency function will do the job of approximating the dataset. Mainly because the total loss on one epoch won&#8217;t be frequency-weighted.   </p><p>This effect is quite obvious so I think it is often taken into account when designing datasets. In financial data modeling, it&#8217;s not always the case, though.</p><p>Here, the sample rate is constant, and independent of the local frequency. That explains part of the spectral bias. </p><p>In the following training runs, the initially evenly spaced datapoints have been spread out by cubic-rooting them, which allows for a sample rate proportional to the signal frequency<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VFRj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35b625a2-0a4e-4884-a7ef-11046d2648fb_3360x1922.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VFRj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35b625a2-0a4e-4884-a7ef-11046d2648fb_3360x1922.png 424w, https://substackcdn.com/image/fetch/$s_!VFRj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35b625a2-0a4e-4884-a7ef-11046d2648fb_3360x1922.png 848w, https://substackcdn.com/image/fetch/$s_!VFRj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35b625a2-0a4e-4884-a7ef-11046d2648fb_3360x1922.png 1272w, https://substackcdn.com/image/fetch/$s_!VFRj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35b625a2-0a4e-4884-a7ef-11046d2648fb_3360x1922.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VFRj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35b625a2-0a4e-4884-a7ef-11046d2648fb_3360x1922.png" width="1456" height="833" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/35b625a2-0a4e-4884-a7ef-11046d2648fb_3360x1922.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:833,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2722261,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.eloidereynal.com/i/161097453?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35b625a2-0a4e-4884-a7ef-11046d2648fb_3360x1922.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VFRj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35b625a2-0a4e-4884-a7ef-11046d2648fb_3360x1922.png 424w, https://substackcdn.com/image/fetch/$s_!VFRj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35b625a2-0a4e-4884-a7ef-11046d2648fb_3360x1922.png 848w, https://substackcdn.com/image/fetch/$s_!VFRj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35b625a2-0a4e-4884-a7ef-11046d2648fb_3360x1922.png 1272w, https://substackcdn.com/image/fetch/$s_!VFRj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35b625a2-0a4e-4884-a7ef-11046d2648fb_3360x1922.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Despite the fact that the sampling rate is proportional to the signal frequency, we still observe a strong spectral bias.</figcaption></figure></div><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/cyfrh/2/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cdf2901a-5a2f-4a1a-9bbc-a94e679215ec_1260x660.png&quot;,&quot;thumbnail_url_full&quot;:&quot;&quot;,&quot;height&quot;:470,&quot;title&quot;:&quot;Loss vs. Number of Parameters (best of 5 runs, Variable frequencies)&quot;,&quot;description&quot;:&quot;Data sampling rate proportional to signal frequency.&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/cyfrh/2/" width="730" height="470" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p>There is still a bit of spectral bias. Also, despite more datapoints being in the &#8220;difficult region&#8221;, the NN converges quicker.</p><h2>Enter the edge bias</h2><p>Next I wanted to see how a GELU network would approximate a single-frequency function (excluding border effects). Would the loss vs. parameter count curve brutally go from one to zero as soon as we reach a certain model size?</p><p>I discovered something quite interesting: when you approximate a single frequency function<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>, the extremities of the input distribution are approximated first. Illustration below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!h3RJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeae087b-dfcd-4b85-a238-1eea079d1461_2218x584.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!h3RJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeae087b-dfcd-4b85-a238-1eea079d1461_2218x584.png 424w, https://substackcdn.com/image/fetch/$s_!h3RJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeae087b-dfcd-4b85-a238-1eea079d1461_2218x584.png 848w, https://substackcdn.com/image/fetch/$s_!h3RJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeae087b-dfcd-4b85-a238-1eea079d1461_2218x584.png 1272w, https://substackcdn.com/image/fetch/$s_!h3RJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeae087b-dfcd-4b85-a238-1eea079d1461_2218x584.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!h3RJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeae087b-dfcd-4b85-a238-1eea079d1461_2218x584.png" width="1456" height="383" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/deae087b-dfcd-4b85-a238-1eea079d1461_2218x584.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:383,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:590003,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.eloidereynal.com/i/161097453?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeae087b-dfcd-4b85-a238-1eea079d1461_2218x584.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!h3RJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeae087b-dfcd-4b85-a238-1eea079d1461_2218x584.png 424w, https://substackcdn.com/image/fetch/$s_!h3RJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeae087b-dfcd-4b85-a238-1eea079d1461_2218x584.png 848w, https://substackcdn.com/image/fetch/$s_!h3RJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeae087b-dfcd-4b85-a238-1eea079d1461_2218x584.png 1272w, https://substackcdn.com/image/fetch/$s_!h3RJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeae087b-dfcd-4b85-a238-1eea079d1461_2218x584.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">When modeling a pure frequency function, NNs start approximating the function near the extremity of the input domain.</figcaption></figure></div><p><br>I may be anticipating a bit on the rest, but it all comes down to this: &#8220;The frequency of a function composition depends both on the outer function&#8217;s frequency and the inner function&#8217;s <strong>frequency AND amplitude.&#8221; </strong>It&#8217;s pretty obvious if you take f(x) = sin(x) as the outer function and g(x) = a*x as the inner function.</p><p>This amplitude/frequency coupling is one of the reasons why the Fourier transform of a composition of functions is intractable.</p><p>NNs are a composition of as many functions as they have layers, so this coupling happens a lot.</p><p>Let&#8217;s try to see why it implies that they are better able to approximate higher frequencies at the border of the input distribution.</p><p>The following is not a super rigorous mathematical proof,  but I think it gives a good enough intuition of the phenomenon.</p><p>Let&#8217;s take an underparametrized MLP, trying to approximate the above constant-frequency function.</p><p>As it is underparametrized, each layer will underfit its &#8220;ideal&#8221; transfer function, that is, the transfer function that would allow the whole MLP to approximate the cosine function.</p><p>The maximum representable frequency of a NN is basically proportional to the number of &#8220;kicks&#8221; in the curve (assuming ReLU-like activations), which itself grows exponentially with the number of layers, and polynomially with their widths.</p><p>So, for a given trained NN, the maximum representable frequency grows with depth, meaning that if for example we model the frequency of the norm of each hidden state, it likely grows with the depth of our NN.</p><p>This implies that the lower layers of the NN, taken together, have a lower frequency function to approximate than the whole NN. </p><p>The problem is, as we&#8217;re dealing with an underparametrized network, the lower layers have trouble fitting even this low-frequency function. As they are subject to the &#8220;normal&#8221; spectral bias, they will first try to approximate its lower frequencies. </p><p>Assuming our batch size is equal to our number of input points, the backward pass will basically say, for each point: &#8220;What is the best nth-order function that models what&#8217;s on our left and what&#8217;s on our right&#8221;. With n a number determined by the width and number of layers at that point. Near the extremities of the input domain, the optimization is less constrained: there is one direction<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a> in which you just have a few points to approximate, and you don&#8217;t care what&#8217;s beyond these points.</p><p>So the lower layers have an easier time approximating their target function near the extremities. Unless the target function happens to be very much biased toward having a lower norm at the extremities of the input domain, that leaves the approximated function with a somewhat higher derivative/amplitude near the extremities.</p><p>You may say that the optimization process does not happen layer by layer and that there are no intermediate target sub-functions to approximate, and you&#8217;d be right, but my point still holds: the lower layers still tend to yield functions with a higher amplitude at the frontier of the input domain<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a>.</p><p>Now, given that, in a composition of two functions, a high <em>inner function</em> (ie <em>lower layer</em>) amplitude leads to a higher frequency of the composition, we can infer that the upper layers will have an easier time modeling higher frequencies where their input (ie the output of the lower layers) has a large amplitude. Then, the learning continues inwards to the center of the distribution, but slowly: it&#8217;s difficult for the next layer to model a complex function where the amplitude is small. The representable input domain thus slowly grows inwards<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a>.</p><p>The next layers then inherit a high-frequency and high-amplitude input at the edge of the input distribution and a slightly greater useful input domain. They increase this edge bias even further.</p><p><strong>It indeed seems that the edge spectral bias grows with NN depth.</strong></p><p>If you have as much trouble understanding it as I&#8217;ve had, let me sum up this explanation:</p><ol><li><p>The optimization process generally has more degrees of freedom at the extremity of the training input domain, and so for every layer.</p></li><li><p>It results in higher amplitude at the edges.</p></li><li><p>Which translates to higher representable frequencies at the edges, because of the aforementioned Fourier of a composition of functions.</p></li></ol><p>The edge stuff is just the primer of the phenomenon, but the real driver is the amplitude/frequency coupling.</p><p>If, for example, I force a low-frequency, high-amplitude, center-heavy component into our function, we find that the center gets fit first<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Slv5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1df0e6ad-53e2-44a8-a260-6d27eb30e7c6_3366x914.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Slv5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1df0e6ad-53e2-44a8-a260-6d27eb30e7c6_3366x914.png 424w, https://substackcdn.com/image/fetch/$s_!Slv5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1df0e6ad-53e2-44a8-a260-6d27eb30e7c6_3366x914.png 848w, https://substackcdn.com/image/fetch/$s_!Slv5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1df0e6ad-53e2-44a8-a260-6d27eb30e7c6_3366x914.png 1272w, https://substackcdn.com/image/fetch/$s_!Slv5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1df0e6ad-53e2-44a8-a260-6d27eb30e7c6_3366x914.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Slv5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1df0e6ad-53e2-44a8-a260-6d27eb30e7c6_3366x914.png" width="1456" height="395" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1df0e6ad-53e2-44a8-a260-6d27eb30e7c6_3366x914.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:395,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1136998,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.eloidereynal.com/i/161097453?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1df0e6ad-53e2-44a8-a260-6d27eb30e7c6_3366x914.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Slv5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1df0e6ad-53e2-44a8-a260-6d27eb30e7c6_3366x914.png 424w, https://substackcdn.com/image/fetch/$s_!Slv5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1df0e6ad-53e2-44a8-a260-6d27eb30e7c6_3366x914.png 848w, https://substackcdn.com/image/fetch/$s_!Slv5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1df0e6ad-53e2-44a8-a260-6d27eb30e7c6_3366x914.png 1272w, https://substackcdn.com/image/fetch/$s_!Slv5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1df0e6ad-53e2-44a8-a260-6d27eb30e7c6_3366x914.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Now, I find this phenomenon (edge bias + ampl/freq coupling) explains some surprising things in ML. Let&#8217;s dig into it.</p><p></p><h2>Edge bias and dimensionality</h2><p>Now, any seasoned ML scientist would rightly think &#8220;So, you&#8217;re telling me about edges and you&#8217;re using a 1D example to illustrate it? That&#8217;s borderline dishonest. Everyone knows that things totally change with dimensionality. Have you heard about the shell effect?&#8221;.</p><p>Indeed, uniformly distributed high-dimensional vectors are almost always near the edges, due to the &#8220;shell effect&#8221;. So, maybe this edge bias doesn&#8217;t matter much in practical ML applications?</p><p>Let&#8217;s check that.</p><p>Here is a visualization of 1D and 3D edge biases.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QpI2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20276116-1368-4cd2-a25a-e478e4f9160f_3314x1070.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QpI2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20276116-1368-4cd2-a25a-e478e4f9160f_3314x1070.png 424w, https://substackcdn.com/image/fetch/$s_!QpI2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20276116-1368-4cd2-a25a-e478e4f9160f_3314x1070.png 848w, https://substackcdn.com/image/fetch/$s_!QpI2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20276116-1368-4cd2-a25a-e478e4f9160f_3314x1070.png 1272w, https://substackcdn.com/image/fetch/$s_!QpI2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20276116-1368-4cd2-a25a-e478e4f9160f_3314x1070.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QpI2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20276116-1368-4cd2-a25a-e478e4f9160f_3314x1070.png" width="1456" height="470" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/20276116-1368-4cd2-a25a-e478e4f9160f_3314x1070.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:470,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:771958,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.eloidereynal.com/i/161097453?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20276116-1368-4cd2-a25a-e478e4f9160f_3314x1070.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QpI2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20276116-1368-4cd2-a25a-e478e4f9160f_3314x1070.png 424w, https://substackcdn.com/image/fetch/$s_!QpI2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20276116-1368-4cd2-a25a-e478e4f9160f_3314x1070.png 848w, https://substackcdn.com/image/fetch/$s_!QpI2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20276116-1368-4cd2-a25a-e478e4f9160f_3314x1070.png 1272w, https://substackcdn.com/image/fetch/$s_!QpI2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20276116-1368-4cd2-a25a-e478e4f9160f_3314x1070.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Here, I plotted the Mean Squared Error vs. the distance to the center. Function to approximate is y = cos(x**1.5 * 15). We can see that, with 2 hidden layers, the extremities and the center get fit first. As the edge bias grows with depth, the 3-layer network fits the extremities first and doesn&#8217;t care about the center. In red, you can see the number of samples for each 0.1 norm interval. It is not constant because I wanted to make sure the sample rate was proportional to the frequency. The most important thing here is that failing to fit the center leaves a lot of points behind and greatly hurts the MSE on the whole dataset.</figcaption></figure></div><p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Bfi2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b9ca71f-5039-4b9b-88f6-1a9cd7804fa2_3364x2030.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Bfi2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b9ca71f-5039-4b9b-88f6-1a9cd7804fa2_3364x2030.png 424w, https://substackcdn.com/image/fetch/$s_!Bfi2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b9ca71f-5039-4b9b-88f6-1a9cd7804fa2_3364x2030.png 848w, https://substackcdn.com/image/fetch/$s_!Bfi2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b9ca71f-5039-4b9b-88f6-1a9cd7804fa2_3364x2030.png 1272w, https://substackcdn.com/image/fetch/$s_!Bfi2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b9ca71f-5039-4b9b-88f6-1a9cd7804fa2_3364x2030.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Bfi2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b9ca71f-5039-4b9b-88f6-1a9cd7804fa2_3364x2030.png" width="1456" height="879" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3b9ca71f-5039-4b9b-88f6-1a9cd7804fa2_3364x2030.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:879,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2028403,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.eloidereynal.com/i/161097453?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b9ca71f-5039-4b9b-88f6-1a9cd7804fa2_3364x2030.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Bfi2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b9ca71f-5039-4b9b-88f6-1a9cd7804fa2_3364x2030.png 424w, https://substackcdn.com/image/fetch/$s_!Bfi2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b9ca71f-5039-4b9b-88f6-1a9cd7804fa2_3364x2030.png 848w, https://substackcdn.com/image/fetch/$s_!Bfi2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b9ca71f-5039-4b9b-88f6-1a9cd7804fa2_3364x2030.png 1272w, https://substackcdn.com/image/fetch/$s_!Bfi2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b9ca71f-5039-4b9b-88f6-1a9cd7804fa2_3364x2030.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Here, the function to approximate is <strong>y = cos(x[0]  * 7) + cos(x[1]  * 7) + cos(x[2] * 7)</strong>. Still some kind of edge bias: the edges are better approximated than the center. This effect grows with NN depth. Still, it doesn&#8217;t matter much, as the center accounts for a negligible portion of the training samples and of the total loss. So there still is an edge bias, but it has less impact. </figcaption></figure></div><p>By the way, the definition of &#8220;edge&#8221; depends on the norm you use. And what&#8217;s funny is that the appropriate norm depends on the interaction between the input components. If you expect no interaction at all (ie an additively-separable function), you should use the L1 norm (that&#8217;s what I did, as the 3D function to approximate is additively separable). If you expect moderate interaction, you should use an L{moderate order} norm<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-9" href="#footnote-9" target="_self">9</a>.</p><p>Despite the shell effect, the edge bias might in fact be a thing in high dimension:</p><ol><li><p>Most real-world functions feature no input interaction or low-order ones. For example, I have heard that genomic studies rarely find interaction between genes. Meaning that if 1000 different genes code for IQ, it is not absurd to model the impact of each gene independently from the others.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-10" href="#footnote-10" target="_self">10</a> In financial modeling, which I&#8217;m more familiar with, input feature interactions are of a low order, provided you&#8217;ve already curated the data and pre-computed useful ratios. <br><strong>So, in theory, the norms used to define real-world edges should be low-order ones.</strong></p></li><li><p>Most real world data distributions are center-heavy (eg. normal). So the input vectors comprise a certain number of <strong>normally distributed independent input features.</strong> And if the features are not independent, then we just have to <a href="https://en.wikipedia.org/wiki/Principal_component_analysis">PCA</a> the input space to make them more so.</p></li><li><p>Using 1. and 2, in the case where we have normally distributed input features and where the L1 norm is the most appropriate (low order interaction), the whole input space shows no shell effect at all. <br>Quite the opposite in fact: the majority of the input samples will be very closely distributed around the center, as their norm will be a sum of the absolute value of normally distributed features. I guess it results in a normal(-ish?<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-11" href="#footnote-11" target="_self">11</a>) shape for the nb of samples vs. norm curve, surely with a very low sd. <br>So, no shell effect at all. <br>If we used a very high order norm, L&#8734; for example, we would get a very strong shell effect, even with normally distributed input features, as the odds of a samples not being at the edge would exponentially decrease when increasing the input dim.</p></li></ol><p>So, the edge bias might still be a thing in high dimension. Please tell me if you have come across something resembling it in your real world ML experience. I have not tested it. </p><p>Could be nice to do, but for now, I have a more interesting and practical thing to show.</p><h2>Amplitude / Frequency coupling consequences: what if we added a function during training and subtracted it for inference?</h2><p>The idea I want to explore here is that, as high frequencies are easier to model when the amplitude is locally high, what if we add a function whose local amplitude (meaning, its derivative) is high when the target function&#8217;s local frequency is high?</p><p>Here is an example.</p><p>Function to approximate:<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-12" href="#footnote-12" target="_self">12</a></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;y = \\sum_{i=1}^{3} \\cos(7\\pi x_i^2)&quot;,&quot;id&quot;:&quot;MDTDBNXGSF&quot;}" data-component-name="LatexBlockToDOM"></div><p>So, the frequency grows linearly with x&#8217;s norm.</p><p>What if we add this:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;y = 0.4 \\left( \\sum_{i=1}^{3} |x_i| \\right)^2&quot;,&quot;id&quot;:&quot;IOERUZBDEB&quot;}" data-component-name="LatexBlockToDOM"></div><p>To get:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;y = \\sum_{i=1}^{3} \\cos(7\\pi x_i^2) + 0.4 \\left( \\sum_{i=1}^{3} |x_i| \\right)^2&quot;,&quot;id&quot;:&quot;BWQMOVLVVY&quot;}" data-component-name="LatexBlockToDOM"></div><p>Or, if you prefer code, here is our function to approximate:</p><p>y = np.cos(np.pi * np.abs(x)**2 * 7) + .4 * np.abs(np.linalg.norm(x, ord=1, axis=1, keepdims=True))**2</p><p>The first term is our target function. The second one is useful only in that its derivative is proportional to our target frequency. The .4 factor was just empirically found to be quite good. It&#8217;s a bit reminiscent of a &#8220;carrier wave&#8221; in radio signals.</p><p>If we train the exact same model<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-13" href="#footnote-13" target="_self">13</a> to approximate the target alone or the full function, here are the results:</p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/U98Jv/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a93fd1b6-7210-4130-b1d6-d5d0d7e00793_1260x660.png&quot;,&quot;thumbnail_url_full&quot;:&quot;&quot;,&quot;height&quot;:294,&quot;title&quot;:&quot;Matching amplitude to frequency makes it easier for a NN to learn.&quot;,&quot;description&quot;:&quot;AVG loss has been computed as the average of 4 best of 4 training runs, on each target function.&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/U98Jv/1/" width="730" height="294" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p></p><p>So, a 1-layer network has a hard time approximating the full function, as it&#8217;s a bit more complex than the mere target. Things get easier for the 2-layer NN and the 3-layer one really takes advantage of the improved digestibility of the full function. So, amplitude/frequency coupling really is a thing, at least in this experimental setting.</p><p>Interestingly enough, this effect doesn&#8217;t hold with even deeper networks. The difference between &#8216;target alone&#8217; and &#8216;full function&#8217; losses becomes insignificant past a depth of  4 hidden layers<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-14" href="#footnote-14" target="_self">14</a>.</p><p>I don&#8217;t exactly know where it comes from, but I suspect 3-layers is the sweet spot between &#8220;The additional complexity of the full function makes it harder to model by a seriously underparametrized NN&#8221; and &#8220;There are enough layers for the lower ones to figure out they should have higher gradients where the expected frequency is high&#8221;.</p><h2>What about carrier function learning?</h2><p>When we look at the number of parameters of the models and the number of linear regions theoretically attainable by their transfer functions, we can see that they are largely sufficient for approximating our target with negligible error. Backpropagation is just doing a poor job of optimizing them.</p><p>The problem comes from the layered structure of Deep Neural Nets.<br>On the one hand it&#8217;s very effective: It enables complex, non-linear functions at a low computational cost. This efficiency comes from two key factors: the chain rule and the way complexity (the number of linear regions) grows exponentially with the number of layers. <br>So number of linear regions grows exponentially with depth, while training cost only grows linearly. Quite cool.<br>But on the other hand, it constrains the loss landscape, by making the tuning of each parameter dependent on the others. This results in a loss landscape with an effective dimension far lower than the number of parameters. This, in turn, makes finding local minima much more likely<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-15" href="#footnote-15" target="_self">15</a>.</p><p>Adding a carrier function seems to mitigate that, but we are actually cheating here, because we already know where the target function has a high frequency.</p><p>So, in real world applications, it would have to be dynamic (if not learnable).</p><p>Maybe we could imagine a kind of regularization layer that adds a carrier function to the input of the next layer. This function should not be learnt through backpropagation, but rather through some computation of a proxy for the frequency of the output of the next layer.</p><p>But I will write about this later. This blog post is already long enough.</p><h2>Conclusion / Some key takeaways</h2><ol><li><p>The shape of the loss vs. nb of layers curve of a fixed-width MLP depends on the spectral decay of the target function. Pretty obvious when you think about it.</p></li><li><p>Real world spectral bias might be partly caused by a failure to match the local sampling rate of the input to the local frequency of the target function. I have no evidence for that though.</p></li><li><p>The observed frequency/amplitude coupling in Deep Neural Nets causes a few interesting phenomenons, including:</p><ol><li><p>A kind of edge bias, where NNs tend to better learn data at the edge of the input domain. This edge bias might keep being a thing even with high-dimensional inputs, depending on the order of the interaction between input components and the distribution of the said input components.</p></li><li><p>The fact that you can increase the parameter efficiency of a model just by adding a kind of &#8220;carrier function&#8221; to the targets (and subtracting it at test time).</p></li></ol></li><li><p>It could be interesting to design a kind of regularization layer based on 3.b.</p></li></ol><p>Researching this was nice. I&#8217;ve only scratched the surface and probably have reinvented the wheel a lot, but I&#8217;ve had fun doing it.</p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><h2>Notes:</h2><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Input dim = 1, Output dim = 1, 1 to 6 hidden layers of dim 32 with GELU activation functions, trained for 10,000 epochs on the 50 points with Adam(lr=1e-3, weight_decay=0). Took the loss of the best of 5 runs for each depth.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>Code for the random points: <br>N = 50<br>x = np.random.uniform(-1, 1, (N, 1))<br>y = np.random.normal(0, 1, (N, 1))</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>N = 500<br>x = np.linspace(-1, 1, N).reshape(-1, 1)<br>x = np.cbrt(x)<br>y = np.cos(np.pi * x**3 * 15)</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>N = 500<br>x = np.linspace(-1, 1, N).reshape(-1, 1)<br>y = np.cos(np.pi * x * 15)</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>We are still in the 1D case</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>The graph (and note below) illustrate this categorical assertion.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>This plot illustrates the phenomenon: the first &#8220;kick&#8221; in the curve happens closer and closer to zero (the center of the input distribution) as we move deeper into the hidden layers.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!juAg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ee89749-7915-44f7-af58-18ecbd159e12_2158x1492.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!juAg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ee89749-7915-44f7-af58-18ecbd159e12_2158x1492.png 424w, https://substackcdn.com/image/fetch/$s_!juAg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ee89749-7915-44f7-af58-18ecbd159e12_2158x1492.png 848w, https://substackcdn.com/image/fetch/$s_!juAg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ee89749-7915-44f7-af58-18ecbd159e12_2158x1492.png 1272w, https://substackcdn.com/image/fetch/$s_!juAg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ee89749-7915-44f7-af58-18ecbd159e12_2158x1492.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!juAg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ee89749-7915-44f7-af58-18ecbd159e12_2158x1492.png" width="467" height="322.98695054945057" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8ee89749-7915-44f7-af58-18ecbd159e12_2158x1492.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1007,&quot;width&quot;:1456,&quot;resizeWidth&quot;:467,&quot;bytes&quot;:401699,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.eloidereynal.com/i/161097453?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ee89749-7915-44f7-af58-18ecbd159e12_2158x1492.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!juAg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ee89749-7915-44f7-af58-18ecbd159e12_2158x1492.png 424w, https://substackcdn.com/image/fetch/$s_!juAg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ee89749-7915-44f7-af58-18ecbd159e12_2158x1492.png 848w, https://substackcdn.com/image/fetch/$s_!juAg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ee89749-7915-44f7-af58-18ecbd159e12_2158x1492.png 1272w, https://substackcdn.com/image/fetch/$s_!juAg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ee89749-7915-44f7-af58-18ecbd159e12_2158x1492.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>N = 500<br>decay_rate = -4<br>x = np.linspace(-1, 1, N).reshape(-1, 1)<br>y = np.cos(np.pi * x * 15) + 2.71828**(decay_rate*abs(x))*2<br>(Yes, I could have used math.exp(1))</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-9" href="#footnote-anchor-9" class="footnote-number" contenteditable="false" target="_self">9</a><div class="footnote-content"><p>This is definitely not rigorous. What I&#8217;m saying is that the order of the norm used to define the edges should roughly reflect the order of the interaction between input variables. But even if your target function is separable or if the input variables interaction order exactly matches that of the norm you use (eg if you wanted to model the f(x,  y, z) = sqrt(x**2 + y**2 + z**2) function), you have no guarantee that the <em><strong>actual</strong></em> transfer function of your trained NN involves interaction of this exact order.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-10" href="#footnote-anchor-10" class="footnote-number" contenteditable="false" target="_self">10</a><div class="footnote-content"><p>In fact, upon closer study, it seems that this lack of observed interaction comes from a lack of data: you need to have a great deal of data points to figure out the interaction between multiple input variables, and there are only so many genomes you can sequence. So, not the best example I could give actually.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-11" href="#footnote-anchor-11" class="footnote-number" contenteditable="false" target="_self">11</a><div class="footnote-content"><p>Not sure of the impact of the absolute value here. Not given it too much thought.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-12" href="#footnote-anchor-12" class="footnote-number" contenteditable="false" target="_self">12</a><div class="footnote-content"><p>Looks like this in 2D:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!N0FC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f12b608-9f12-4a9c-8de6-d880a16028b6_864x904.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!N0FC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f12b608-9f12-4a9c-8de6-d880a16028b6_864x904.png 424w, https://substackcdn.com/image/fetch/$s_!N0FC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f12b608-9f12-4a9c-8de6-d880a16028b6_864x904.png 848w, https://substackcdn.com/image/fetch/$s_!N0FC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f12b608-9f12-4a9c-8de6-d880a16028b6_864x904.png 1272w, https://substackcdn.com/image/fetch/$s_!N0FC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f12b608-9f12-4a9c-8de6-d880a16028b6_864x904.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!N0FC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f12b608-9f12-4a9c-8de6-d880a16028b6_864x904.png" width="326" height="341.0925925925926" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7f12b608-9f12-4a9c-8de6-d880a16028b6_864x904.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:904,&quot;width&quot;:864,&quot;resizeWidth&quot;:326,&quot;bytes&quot;:719019,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.eloidereynal.com/i/161097453?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f12b608-9f12-4a9c-8de6-d880a16028b6_864x904.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!N0FC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f12b608-9f12-4a9c-8de6-d880a16028b6_864x904.png 424w, https://substackcdn.com/image/fetch/$s_!N0FC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f12b608-9f12-4a9c-8de6-d880a16028b6_864x904.png 848w, https://substackcdn.com/image/fetch/$s_!N0FC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f12b608-9f12-4a9c-8de6-d880a16028b6_864x904.png 1272w, https://substackcdn.com/image/fetch/$s_!N0FC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f12b608-9f12-4a9c-8de6-d880a16028b6_864x904.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-13" href="#footnote-anchor-13" class="footnote-number" contenteditable="false" target="_self">13</a><div class="footnote-content"><p>input dim = 3, <br>hidden dim = 32, <br>LeakyReLU activation, <br>1 to 3 hidden layers, <br>output dim = 1,<br>1e6 data points, sampling rate proportional to frequency.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-14" href="#footnote-anchor-14" class="footnote-number" contenteditable="false" target="_self">14</a><div class="footnote-content"><p>Oh, and the difference with  &lt; 3 layers really is significant, it&#8217;s not just a case of &#8220;try enough things and at some point you&#8217;ll get a significant result&#8221;. I&#8217;ve done a few training runs with different hidden dims and activations and the same effect can be observed. For example, the results are more impressive with a hidden dim of 64, and seem to remain significant for deeper networks.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-15" href="#footnote-anchor-15" class="footnote-number" contenteditable="false" target="_self">15</a><div class="footnote-content"><p>Indeed, there are far fewer local minima in high-dimensional spaces: the odds of having a positive second derivative and a null derivative of the loss wrt every dimension become exponentially lower as the number of dimension grows. GIVEN ONLY THAT the partial derivatives are linearly independent from one another, which is not the case because of the deep structure of NN.</p></div></div>]]></content:encoded></item><item><title><![CDATA[A (partially) failed attempt at improving the Transformer architecture.]]></title><description><![CDATA[Why it failed, explained at length. Nice cattle pictures as a bonus.]]></description><link>https://www.eloidereynal.com/p/a-partially-failed-attempt-at-improving</link><guid isPermaLink="false">https://www.eloidereynal.com/p/a-partially-failed-attempt-at-improving</guid><dc:creator><![CDATA[Eloi de Reynal]]></dc:creator><pubDate>Wed, 26 Mar 2025 01:07:33 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!agNy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F014abdd4-2a33-44f2-8ef4-9fbf69103579_800x536.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This post is a bit technical. I&#8217;ve tried to make it as simple as possible and to not hide a lack of intuitive understanding behind complex mathematical expressions. However, I believe it&#8217;s still fairly complex. By carefully reviewing this post, you'll gain a deeper understanding of key ML concepts. If you&#8217;re not familiar with the Transformer architecture, I strongly recommend 3b1b&#8217;s <a href="https://www.youtube.com/watch?v=wjZofJX0v4M&amp;t=183s&amp;pp=ygUQM2IxYiB0cmFuc2Zvcm1lcg%3D%3D">videos</a> <a href="https://www.youtube.com/watch?v=eMlx5fFNoYc">on</a> <a href="https://www.youtube.com/watch?v=9-Jl0dxWQs8">the subject</a><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>. I made this post for people who like to dive deep into engineering problems and their (absence of) solutions. Please read the footnotes.</em></p><p>The final hidden state of a Transformer's channel &#8211; that is, the embedding vector after the last decoder layer &#8211; theoretically contains n_embd * 32 (or 16, depending on the configuration) bits of information. This is because it's composed of n_embd components, each encoded as a 32-bit floating-point number (fp32).</p><p>This final hidden state is then transformed into a probability distribution across all possible tokens. We sample from this distribution, resulting in a single token chosen from a vocabulary of vocab_size possibilities. The amount of information contained in this individual token can be quantified as log2(vocab_size).</p><p>The Llama 3-8B model has an embedding size of 4096 and a vocabulary size of 128,256.  This means the unquantized last hidden state contains 4096 * 32 = 131,072 bits of information. However, this gets compressed down to approximately log2(128256) &#8776; 17 bits when predicting the next token. </p><p>That's a significant reduction in information, which seems especially important considering that Transformers, as autoregressive models, rely on the generated tokens up to step n to predict the next token at step n+1.</p><h4>Proposed architecture improvement: enriching token embeddings with the last hidden state of their generation step.</h4><p>This idea sounded great, I was already picturing myself getting the Turin Award, making the Times cover and receiving a lot of emails from Altman, Amodei &amp; al begging me to honor them by joining their companies. The <strong>Stateful GPT</strong> would give me fame, money and girls.</p><p>It turns out this wasn't as great an idea as I initially thought, for several reasons. The most obvious reasons are not the most critical ones.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!agNy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F014abdd4-2a33-44f2-8ef4-9fbf69103579_800x536.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!agNy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F014abdd4-2a33-44f2-8ef4-9fbf69103579_800x536.jpeg 424w, https://substackcdn.com/image/fetch/$s_!agNy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F014abdd4-2a33-44f2-8ef4-9fbf69103579_800x536.jpeg 848w, https://substackcdn.com/image/fetch/$s_!agNy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F014abdd4-2a33-44f2-8ef4-9fbf69103579_800x536.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!agNy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F014abdd4-2a33-44f2-8ef4-9fbf69103579_800x536.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!agNy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F014abdd4-2a33-44f2-8ef4-9fbf69103579_800x536.jpeg" width="800" height="536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/014abdd4-2a33-44f2-8ef4-9fbf69103579_800x536.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:536,&quot;width&quot;:800,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Cattle - Wikipedia&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Cattle - Wikipedia" title="Cattle - Wikipedia" srcset="https://substackcdn.com/image/fetch/$s_!agNy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F014abdd4-2a33-44f2-8ef4-9fbf69103579_800x536.jpeg 424w, https://substackcdn.com/image/fetch/$s_!agNy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F014abdd4-2a33-44f2-8ef4-9fbf69103579_800x536.jpeg 848w, https://substackcdn.com/image/fetch/$s_!agNy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F014abdd4-2a33-44f2-8ef4-9fbf69103579_800x536.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!agNy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F014abdd4-2a33-44f2-8ef4-9fbf69103579_800x536.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">A respectable being that doesn&#8217;t care about enriching token embeddings</figcaption></figure></div><h2>1st Challenge: this is basically an RNN, so what about training parallelization?</h2><p>When I first posted my idea on Reddit to get some feedback on this idea from people smarter than me, they all said &#8220;you&#8217;re basically re-inventing Recurrent Neural Networks. The exact architecture that was made obsolete by Transformers. Not good. A Transformer&#8217;s training can be parallelized, your architecture&#8217;s cannot.&#8221; It turns out they&#8217;re partially right on that, but only partially.</p><p>First, I&#8217;d like to take a step back and explain why RNN training is difficult to  parallelize and why it matters. It is only loosely related to the main subject of this post but I find it interesting enough.</p><h4>Why the training of RNNs is difficult to parallelize</h4><p>I won&#8217;t dive too deep into how RNNs work, and as most blog posts do a very poor job at explaining them<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>, I&#8217;ll link <a href="https://grok.com/share/bGVnYWN5_b32a3c6e-6790-43c7-abd4-1abc5f6f9606">a prompt I gave Grok</a>. It did great, you can trust the answer. The core concept here is that in an RNN, to predict token n, you first have to predict all tokens up to n-1. So training an RNN on a text sequence of length n involves at least n sequential steps, each dependent on the completion of the last one. This is the exact opposite of parallelized training.</p><p>By contrast, a Transformer can be fed a n-length sequence and be trained to predict each token k from 2 to n+1 in parallel. If we take the training sequence '&#8220;the cat sat on the mat&#8221;, we get (6-1 =) 5 examples.</p><p>1.  "the" -&gt; "cat"<br>2.  "the cat" -&gt; "sat"<br>3.  "the cat sat" -&gt; "on"<br>4.  "the cat sat on" -&gt; "the"<br>5.  "the cat sat on the" -&gt; "mat"</p><p>These predictions can be performed in parallel thanks to the attention mechanism&#8217;s ability to handle variable context sizes and process each position's prediction independently (given the shared input). This ability is slightly hampered by my proposed architecture update, but we&#8217;ll see that later.</p><p>Now, you can actually parallelize the training of RNNs by using a large batch size: you can backpropagate through multiple full-sequences in parallel. The first problem is that the memory overhead is substantial: for each sequence, you have to store the whole, sequential, computation graph<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a> to then backprop through it. The second one, even worse, is that each sequence will still be processed sequentially.</p><p>Transformers&#8217; computation graph, on the other hand, is quite light thanks to the absence of recurrence. And the only sequential thing about them is the layer-by-layer processing.</p><p>All in all, it&#8217;s more accurate to state that the training of RNNs is <em>not very </em>-yet still somewhat- parallelizable.</p><h4>Why it matters</h4><p>Although RNNs like LSTMs or GRUs are competitive with Transformers in terms of end loss (log-likelihood) vs. training compute (in flops), they are impractical due to their inability to make full use of GPUs.</p><p>In fact, if money could buy clock frequency (instead of more GPUs), Transformers wouldn&#8217;t have been needed: an RNN would train just as well (maybe even better?) than a Transformer on a single-core, 100THz CPU. But such a high frequency can&#8217;t be attained, for a bunch of reasons<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>. On the other hand, doubling compute power by doubling the number of transistors and logical cores is quite straightforward. Hence the need for a highly parallelized architecture.</p><h4>Why the Stateful GPT&#8217;s training is still quite parallelizable</h4><p>First, I&#8217;d like to give you a nice picture of Venice. It&#8217;s an incredible city and I&#8217;d like to visit it some day. I hitchhiked there once but I was in a hurry and had to go to Albania.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pcdM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1b5e6f2-ce9b-4a7b-9ba8-1dae5c93ec4c_900x550.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pcdM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1b5e6f2-ce9b-4a7b-9ba8-1dae5c93ec4c_900x550.jpeg 424w, https://substackcdn.com/image/fetch/$s_!pcdM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1b5e6f2-ce9b-4a7b-9ba8-1dae5c93ec4c_900x550.jpeg 848w, https://substackcdn.com/image/fetch/$s_!pcdM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1b5e6f2-ce9b-4a7b-9ba8-1dae5c93ec4c_900x550.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!pcdM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1b5e6f2-ce9b-4a7b-9ba8-1dae5c93ec4c_900x550.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pcdM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1b5e6f2-ce9b-4a7b-9ba8-1dae5c93ec4c_900x550.jpeg" width="900" height="550" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b1b5e6f2-ce9b-4a7b-9ba8-1dae5c93ec4c_900x550.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:550,&quot;width&quot;:900,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Venice Districts | visitingvenice.net&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Venice Districts | visitingvenice.net" title="Venice Districts | visitingvenice.net" srcset="https://substackcdn.com/image/fetch/$s_!pcdM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1b5e6f2-ce9b-4a7b-9ba8-1dae5c93ec4c_900x550.jpeg 424w, https://substackcdn.com/image/fetch/$s_!pcdM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1b5e6f2-ce9b-4a7b-9ba8-1dae5c93ec4c_900x550.jpeg 848w, https://substackcdn.com/image/fetch/$s_!pcdM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1b5e6f2-ce9b-4a7b-9ba8-1dae5c93ec4c_900x550.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!pcdM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1b5e6f2-ce9b-4a7b-9ba8-1dae5c93ec4c_900x550.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">A lot of beautiful things happen when you&#8217;re looking for trouble. &#8220;Hey, let&#8217;s build a city on a big mudflat by the sea&#8221; =&gt; Most beautiful city in the world.</figcaption></figure></div><p>Back to the point.</p><p>Of course, the Stateful GPT behaves like a normal RNN during inference. As a consequence, its training should be difficult to parallelize.</p><p>Except if you accept a little discrepancy between what you train the model for and how you use it.</p><p>Here&#8217;s how the training goes, why it&#8217;s different from inference and why it doesn&#8217;t matter much.</p><h5>Training:</h5><ol><li><p><strong>Parallel Training Step:</strong> A standard, fully parallelized training step is performed on the Transformer. No information flows through the recurrent connection at this point. Cross-entropy loss (CE-loss) is computed, and normal backpropagation is performed. The last hidden states are stored.</p></li><li><p><strong>Recurrent Training Step: </strong>The last hidden states stored in Step 1 are fed back into the model to enrich the token embeddings. Backpropagation is performed as usual, and new hidden states are stored.</p></li><li><p><strong>Iterative Recurrence:</strong> Step 2 can be repeated multiple times, using the hidden states from the <em>previous</em> iteration, at the cost of additional training steps.</p></li></ol><p>Here&#8217;s how training &amp; inference differ:</p><ul><li><p><strong>Training:</strong> The recurrence depth is set to a fixed and finite number.</p></li><li><p><strong>Inference:</strong> The effective recurrence depth is <em>generated_sequence_length - 1</em>. The first token is generated without recurrence. Token 2 uses Token 1's hidden state (1 degree of recurrence), Token 3 uses Token 2's hidden state (2 degrees of recurrence), and so on.</p></li></ul><p>The discrepancy between training and inference is addressed by ensuring stability during training. If the <a href="https://grok.com/share/bGVnYWN5_1f76e105-be62-42bd-94cf-5b4e2322e5e4">CE loss</a> remains controlled with increased recurrence depth during training, the model should be stable even with theoretically infinite recurrence depth during inference.</p><p>More formally, training ensures that CEloss(Tn)&lt;CEloss(T0) for every recurrence depth n trained for. Otherwise, the model would just learn to ignore the hidden state information. Although it doesn't formally imply that this inequality holds even for recurrence depths not seen during training, it is empirically observed.</p><p>Empirically again, I could observe that a single recurrence step during training gives 95% of the gains I would get by using, say 10 recurrence steps.</p><p>I even made a little test, by encoding the training step&#8217;s recurrence depth and giving this information to the recurrent layer, so as to track the &#8220;optimal&#8221; norm of the token enrichment term vs. the recurrence depth. </p><p>The norm indeed grew with depth. This suggests that the Stateful GPT incorporates progressively more information into the token prediction as recurrence depth increases. The effect was very small, though: a 1% increase in average norm going from depth = 1 to depth = 3.</p><p>Of course, the compute-intensiveness of each training epoch is dependent on the recurrence depth: we set depth = 2 for example, we&#8217;ll have to make 3 forward/backward passes for each batch: 1 for the standard Transformer training + 1 for each recurrent step. Each epoch thus costs (recurrence_depth + 1) times more than for a standard Transformer.</p><p>We&#8217;ll see if that&#8217;s a problem.</p><p>For now, let&#8217;s check the architecture of the Stateful GPT.</p><h2>A deep dive into the Stateful GPT architecture</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ng5H!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f3ee5fd-c9ae-44a2-a92b-44baff046714_687x454.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ng5H!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f3ee5fd-c9ae-44a2-a92b-44baff046714_687x454.png 424w, https://substackcdn.com/image/fetch/$s_!Ng5H!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f3ee5fd-c9ae-44a2-a92b-44baff046714_687x454.png 848w, https://substackcdn.com/image/fetch/$s_!Ng5H!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f3ee5fd-c9ae-44a2-a92b-44baff046714_687x454.png 1272w, https://substackcdn.com/image/fetch/$s_!Ng5H!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f3ee5fd-c9ae-44a2-a92b-44baff046714_687x454.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ng5H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f3ee5fd-c9ae-44a2-a92b-44baff046714_687x454.png" width="687" height="454" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5f3ee5fd-c9ae-44a2-a92b-44baff046714_687x454.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:454,&quot;width&quot;:687,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:47050,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.eloidereynal.com/i/158656884?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f3ee5fd-c9ae-44a2-a92b-44baff046714_687x454.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ng5H!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f3ee5fd-c9ae-44a2-a92b-44baff046714_687x454.png 424w, https://substackcdn.com/image/fetch/$s_!Ng5H!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f3ee5fd-c9ae-44a2-a92b-44baff046714_687x454.png 848w, https://substackcdn.com/image/fetch/$s_!Ng5H!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f3ee5fd-c9ae-44a2-a92b-44baff046714_687x454.png 1272w, https://substackcdn.com/image/fetch/$s_!Ng5H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f3ee5fd-c9ae-44a2-a92b-44baff046714_687x454.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Weather forecasts for Verkhoyansk as of March 10, 12am UTC. Look how fast it warms up in this period: we&#8217;re near the max derivative of daylight hours (happening circa 3/21), and the thermal inertia of land is low.</figcaption></figure></div><p>Now it gets a bit technical.</p><p>The naive ways of enriching token embeddings with the last hidden state before their generation are either to:</p><ol><li><p>Concatenate the standard token embedding and the last hidden state, and down project / transform them back to a dim_embedding-sized vector.</p></li><li><p>Sum the standard token embedding and some transformation of the hidden state.</p><p></p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Wuli!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff932cd62-39f7-4ea0-9ae8-f176a092ac61_1054x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Wuli!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff932cd62-39f7-4ea0-9ae8-f176a092ac61_1054x1536.png 424w, https://substackcdn.com/image/fetch/$s_!Wuli!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff932cd62-39f7-4ea0-9ae8-f176a092ac61_1054x1536.png 848w, https://substackcdn.com/image/fetch/$s_!Wuli!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff932cd62-39f7-4ea0-9ae8-f176a092ac61_1054x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!Wuli!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff932cd62-39f7-4ea0-9ae8-f176a092ac61_1054x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Wuli!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff932cd62-39f7-4ea0-9ae8-f176a092ac61_1054x1536.png" width="348" height="507.1423149905123" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f932cd62-39f7-4ea0-9ae8-f176a092ac61_1054x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1536,&quot;width&quot;:1054,&quot;resizeWidth&quot;:348,&quot;bytes&quot;:84003,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.eloidereynal.com/i/158656884?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff932cd62-39f7-4ea0-9ae8-f176a092ac61_1054x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Wuli!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff932cd62-39f7-4ea0-9ae8-f176a092ac61_1054x1536.png 424w, https://substackcdn.com/image/fetch/$s_!Wuli!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff932cd62-39f7-4ea0-9ae8-f176a092ac61_1054x1536.png 848w, https://substackcdn.com/image/fetch/$s_!Wuli!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff932cd62-39f7-4ea0-9ae8-f176a092ac61_1054x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!Wuli!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff932cd62-39f7-4ea0-9ae8-f176a092ac61_1054x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(1) First technique: quite simple. Wait, is this image related to the post?!</figcaption></figure></div><p></p><p>These two options are valid. The first one is more compute intensive than the second, as it deals with vectors of dimension 2*<em>dim_embedding</em> instead of just <em>dim_embedding</em>. On the other hand, it allows for a real interaction between the last hidden state and the token embedding. It&#8217;s not just blindly adding information.</p><p>I decided to merge both approaches with  a custom architecture halfway between the input gates of LSTMs and the attention mechanism of Transformers. The idea is to compute a component-wise <strong>attention</strong> score between the token embedding and the hidden state and <strong>update</strong> the former accordingly. From now on, I will refer to the token embedding as &#8220;x&#8221; and to the hidden state as &#8220;h&#8221;.</p><p>The high-level idea of this technique is to:</p><ol><li><p>Check what information the hidden state can enrich the token embedding with. Compute an element-wise &#8220;attention&#8221; score that basically says, for each component, how much the token vector should be updated.</p></li><li><p>Project the hidden state into a space where it can be summed with the token embedding vector</p></li><li><p>Add the projected hidden state to the token embedding in order to enrich it. The sum is conditioned by the attention scores, meaning that the update&#8217;s magnitude will differ for each of the vector&#8217;s components.</p></li></ol><p>So it goes like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PFcA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28e39cad-2aae-4ced-9ef1-128c8101fde1_696x1748.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PFcA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28e39cad-2aae-4ced-9ef1-128c8101fde1_696x1748.png 424w, https://substackcdn.com/image/fetch/$s_!PFcA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28e39cad-2aae-4ced-9ef1-128c8101fde1_696x1748.png 848w, https://substackcdn.com/image/fetch/$s_!PFcA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28e39cad-2aae-4ced-9ef1-128c8101fde1_696x1748.png 1272w, https://substackcdn.com/image/fetch/$s_!PFcA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28e39cad-2aae-4ced-9ef1-128c8101fde1_696x1748.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PFcA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28e39cad-2aae-4ced-9ef1-128c8101fde1_696x1748.png" width="324" height="813.7241379310345" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/28e39cad-2aae-4ced-9ef1-128c8101fde1_696x1748.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1748,&quot;width&quot;:696,&quot;resizeWidth&quot;:324,&quot;bytes&quot;:93392,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.eloidereynal.com/i/158656884?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28e39cad-2aae-4ced-9ef1-128c8101fde1_696x1748.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PFcA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28e39cad-2aae-4ced-9ef1-128c8101fde1_696x1748.png 424w, https://substackcdn.com/image/fetch/$s_!PFcA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28e39cad-2aae-4ced-9ef1-128c8101fde1_696x1748.png 848w, https://substackcdn.com/image/fetch/$s_!PFcA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28e39cad-2aae-4ced-9ef1-128c8101fde1_696x1748.png 1272w, https://substackcdn.com/image/fetch/$s_!PFcA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28e39cad-2aae-4ced-9ef1-128c8101fde1_696x1748.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Proposed enrichment technique. Explanations below.</figcaption></figure></div><p>More formally,</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathbf{x} = \\mathbf{x} + \\text{ReLU}((\\mathbf{h} \\mathbf{W}_k) \\odot (\\mathbf{x} \\mathbf{W}_q)) \\odot (\\mathbf{h} \\mathbf{W}_v)&quot;,&quot;id&quot;:&quot;SPHSNCMOXP&quot;}" data-component-name="LatexBlockToDOM"></div><p>Let me explain this equation carefully.</p><p>So, each token embedding gets enriched (x = x + something) this way:</p><ol><li><p>First, a learned matrix projects the hidden state into a nice (key) space where we can compute its element-wise affinity with the original token embedding. That&#8217;s the h@Wk part (in PyTorch, @ refers to matrix multiplication).</p></li><li><p>Second, a different learned matrix projects the token embeddings the same, into a &#8220;query&#8221; space. That&#8217;s the x@Wq part.</p></li><li><p>Then component-wise pseudo-attention scores are computed by multiplying each component of the projected token embedding by those of the projected hidden state. Let&#8217;s imagine x is the embedding of the word &#8220;car&#8221;. During learning, the query projection matrix has learned that this kind of tokens (nouns) are always willing to get their physical characteristics refined, for example by color or shape adjectives. So there will be high values for the components coding for &#8220;I want an update on my color, I have no idea what it is&#8221;. Meanwhile, the key matrix, through which the hidden state is projected, has learned to project the hidden state so that there is a clear &#8220;color&#8221; component. When you make a component-wise multiplication between the two vectors, you actually enable a dialogue like &#8220;Token embedding: I want my color updated, please.<br>Hidden State: I can definitely provide this information =&gt; high attention score for component &#8216;color&#8217;. <br>Token embedding: I&#8217;d like to know how fast I am.<br>Hidden state: sorry, I don&#8217;t know about that =&gt; low attention score for component &#8216;speed&#8217;. <br>Token Embedding: tbh, I don&#8217;t quite care about [something].<br>Hidden state: Too bad, I could have told you =&gt; low score.<br>Token Embedding: I don&#8217;t care about [some other thing].<br>Hidden state: Neither can I tell you about it =&gt; low score.&#8221; <br>In the equation, that&#8217;s the whole (h@W_k) * (x@W_q) term. <br>The negative scores are set to 0 by the ReLU activation function. This is not strictly necessary, but it doesn&#8217;t hurt performance and I like the increased interpretability it brings. Please note this example is anthropocentric and it doesn&#8217;t literally reflect what actually happens.</p></li><li><p>So now, we have a component-wise affinity between the token embedding and the hidden state. We have to bring the information where it&#8217;s due. The hidden state is first projected by the Wv matrix into a value space (compatible with the token embedding), meaning that this projection tries to present the hidden state&#8217;s information in such a way that it can be added to the token embedding. If, for example, the color components of token embeddings are typically in position 29-31, but are typically in position 4, 8, 12 in the hidden_state vector, the Wv matrix will soon learn to have a 1 in positions (29,4), (30,8) and (31,12). Now that we have a nice projected hidden state vector, we just have to multiply it component-wise by the attention scores to get the enrichment term.</p></li><li><p>We just add this enrichment term to our initial token embedding.</p></li></ol><p>That&#8217;s it. The rest of the Transformer is kept identical.</p><p>If you are familiar with attention, you may have noticed that this architecture is very similar to cross-attention, except that it&#8217;s element-wise, and not token-wise.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cWuf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c5c34cd-1e2e-418b-88a5-188c694ff4f0_2664x3092.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cWuf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c5c34cd-1e2e-418b-88a5-188c694ff4f0_2664x3092.png 424w, https://substackcdn.com/image/fetch/$s_!cWuf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c5c34cd-1e2e-418b-88a5-188c694ff4f0_2664x3092.png 848w, https://substackcdn.com/image/fetch/$s_!cWuf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c5c34cd-1e2e-418b-88a5-188c694ff4f0_2664x3092.png 1272w, https://substackcdn.com/image/fetch/$s_!cWuf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c5c34cd-1e2e-418b-88a5-188c694ff4f0_2664x3092.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cWuf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c5c34cd-1e2e-418b-88a5-188c694ff4f0_2664x3092.png" width="578" height="670.8928571428571" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1c5c34cd-1e2e-418b-88a5-188c694ff4f0_2664x3092.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1690,&quot;width&quot;:1456,&quot;resizeWidth&quot;:578,&quot;bytes&quot;:400656,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.eloidereynal.com/i/158656884?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c5c34cd-1e2e-418b-88a5-188c694ff4f0_2664x3092.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cWuf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c5c34cd-1e2e-418b-88a5-188c694ff4f0_2664x3092.png 424w, https://substackcdn.com/image/fetch/$s_!cWuf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c5c34cd-1e2e-418b-88a5-188c694ff4f0_2664x3092.png 848w, https://substackcdn.com/image/fetch/$s_!cWuf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c5c34cd-1e2e-418b-88a5-188c694ff4f0_2664x3092.png 1272w, https://substackcdn.com/image/fetch/$s_!cWuf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c5c34cd-1e2e-418b-88a5-188c694ff4f0_2664x3092.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Standard Transformer on the left, Stateful Transformer on the right. Please note that the hidden state feedback loop can be turned off.</figcaption></figure></div><p>Let&#8217;s see how the it all performs.</p><h2>Empirical results</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Sig-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b42a5dd-d381-420a-8d3c-3d0450ab9c2c_1280x975.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Sig-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b42a5dd-d381-420a-8d3c-3d0450ab9c2c_1280x975.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Sig-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b42a5dd-d381-420a-8d3c-3d0450ab9c2c_1280x975.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Sig-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b42a5dd-d381-420a-8d3c-3d0450ab9c2c_1280x975.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Sig-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b42a5dd-d381-420a-8d3c-3d0450ab9c2c_1280x975.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Sig-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b42a5dd-d381-420a-8d3c-3d0450ab9c2c_1280x975.jpeg" width="1280" height="975" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8b42a5dd-d381-420a-8d3c-3d0450ab9c2c_1280x975.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:975,&quot;width&quot;:1280,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Vache en estive pyr&#233;n&#233;enne&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Vache en estive pyr&#233;n&#233;enne" title="Vache en estive pyr&#233;n&#233;enne" srcset="https://substackcdn.com/image/fetch/$s_!Sig-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b42a5dd-d381-420a-8d3c-3d0450ab9c2c_1280x975.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Sig-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b42a5dd-d381-420a-8d3c-3d0450ab9c2c_1280x975.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Sig-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b42a5dd-d381-420a-8d3c-3d0450ab9c2c_1280x975.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Sig-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b42a5dd-d381-420a-8d3c-3d0450ab9c2c_1280x975.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">&#8220;The ReLU non linearity is not necessary, but it improves interpretability&#8221;</figcaption></figure></div><p>As my goal was to test the idea as fast as possible, I chose to go with a character-level Transformer, to save a layer of complexity.</p><p>I trained all the Transformers on a Gutenberg 10MB custom dataset, comprising a few books stitched together. This dataset is deeply flawed: test and val splits are qualitatively different, as they likely don&#8217;t even come from the same book/author. But I decided not to care.</p><h4>Different flavors of Stateful Transformer</h4><p>Among the different possible designs for the token enrichment mechanism, the component-wise attention performs best, with the fewest parameter count. </p><p>The concatenation approach (shown in the first real figure) works nicely too, in terms of minimum loss vs. number of params, but its drawback is that you can&#8217;t easily turn off the recurrence: As you&#8217;re dealing with a MLP that takes both a hidden state and a token embedding (and outputs an enriched embedding), you can&#8217;t decide to just feed it a token embedding (and zero-pad the hidden state placeholder) and expect it to output a coherent token embedding.</p><h4>Standard vs. Stateful Transformer</h4><p>I first trained a very small and especially shallow Transformer<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a> for 40 epochs. Here is how the stateful Transformer (with recurrence depth = 1) compares to the standard one.</p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/iMc54/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5cccf50e-19ef-4048-9743-188d4606bb57_1260x660.png&quot;,&quot;thumbnail_url_full&quot;:&quot;&quot;,&quot;height&quot;:613,&quot;title&quot;:&quot;Standard vs. Stateful GPT: training run&quot;,&quot;description&quot;:&quot;Create interactive, responsive &amp; beautiful charts &#8212; no code required.&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/iMc54/1/" width="730" height="613" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p>The Stateful Transformer&#8217;s run looks much better. And indeed it is: the Standard Transformer&#8217;s train loss after 40 epochs is reached only after 16 training epochs by the Stateful Transformer. It&#8217;s more than twice as fast, which means that, even accounting for compute overhead, it outperforms the Standard Transformer.</p><p>Interestingly, the stateful Transformer seems to be a bit more prone to overfitting. Indeed:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\frac{\\text{val_loss(Stateful Transformer)}}{\\text{val_loss(Standard Transformer)}} > \\frac{\\text{train_loss(Stateful Transformer)}}{\\text{train_loss(Standard Transformer)}}&quot;,&quot;id&quot;:&quot;BBUHEEMGMK&quot;}" data-component-name="LatexBlockToDOM"></div><p>Meaning, in plain English, that the Stateful Transformer does not generalize as well as the Standard one. If you have any idea why, please tell me in comments. I have a few hypotheses, but none of them is fully satisfactory. By the way, the Stateful Transformer has 140,000 params and the Standard Transformer only has 128,000. That&#8217;s a difference of 8.6%, which is significant but unlikely to explain the difference in performance between the two architectures.</p><p>When I saw this kind of results, I thought the Stateful Transformer would work great. I believed it might even scale well, meaning that the performance upgrade would be even greater for bigger Transformers (see next section for more details).</p><p>In fact, it&#8217;s the exact opposite. The following is another training run, comparing loss vs. training epoch for larger Transformers (4 layers instead of 2, 950,000 params).</p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/8bwyp/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/59a3b0d8-9ab7-453d-88f0-b98d74e48d6c_1260x660.png&quot;,&quot;thumbnail_url_full&quot;:&quot;&quot;,&quot;height&quot;:495,&quot;title&quot;:&quot;Standard vs. Stateful GPT: BIG training run&quot;,&quot;description&quot;:&quot;BIGGER (0.47M params) Standard vs. Stateful Transformer: training run&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/8bwyp/1/" width="730" height="495" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p>The stateful GPT here performs marginally better than the standard one. But it uses twice as much compute during training. In one word, it&#8217;s not worth it. When I figured this out, I used all available resources of intellectual dishonesty to make my stateful GPT work better. Maybe if I train it first without recursion and just fine-tune it on a few recursion epochs? Nope, doesn&#8217;t work. Well, it does, up to a certain point, but it&#8217;s not scalable. Maybe if I freeze all the weights except for the enrichment mechanism? Nope, doesn&#8217;t work either. I tested a lot of things but, at the end of the day, the compute overhead was just not worth it. </p><h4>Is the train/val loss a valid metric?</h4><p>By the way, you may be wondering if the train/val loss really is a good metric for performance, considering that the inference process is slightly different from the training process. To figure it out, I trained a &#8220;big&#8221; (5M params) on the same dataset to assess the output of these two small models. </p><p>Doing so isn&#8217;t as straightforward as it seems, as merely using the big model&#8217;s loss on text generated by the small models can be insufficient. Indeed, a string like &#8220;things of the street of the things of the street of the things&#8230;&#8221; is technically sensible (big model&#8217;s loss on this string is low), but its information content is very low. </p><p>So I wanted to assess both the information content of a text and its &#8220;sensibleness&#8221;, measured respectively by the compressed size of the text and the loss of the big model on it.</p><p>The randomness of a Transformer&#8217;s output grows with the temperature of its softmax. <br>Also, the compressibility of a string decreases with its randomness (a perfectly random string is impossible to compress).<br>Finally, the loss of a bigger model on text generated by smaller models grows with these smaller models&#8217; inference temperature.</p><p>Below are some graphs showing some empirical results on this matter. I compared two models, a Stateful GPT and a standard Transformer, trained until they had the exact same validation loss.</p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/keLNl/3/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/74761063-61ad-4dcd-bf87-c4f85790541b_1260x660.png&quot;,&quot;thumbnail_url_full&quot;:&quot;&quot;,&quot;height&quot;:657,&quot;title&quot;:&quot;Entropy of Generated Text vs. Temperature&quot;,&quot;description&quot;:&quot;Horizontal axis is Temperature, vertical axis is ratio of LZMA-compressed to raw text size. The models here shown are not the same as the ones of the training run.&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/keLNl/3/" width="730" height="657" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/Qr9tA/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/66b5562e-ebce-4182-a414-f781c424405f_1260x660.png&quot;,&quot;thumbnail_url_full&quot;:&quot;&quot;,&quot;height&quot;:666,&quot;title&quot;:&quot;Bigger model's loss on text generated by the smaller models vs. temperature&quot;,&quot;description&quot;:&quot;(Cross Entropy Loss)&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/Qr9tA/1/" width="730" height="666" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/WAaiJ/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c2d37862-d53e-41bc-ac02-8e8c4a77af1b_1260x660.png&quot;,&quot;thumbnail_url_full&quot;:&quot;&quot;,&quot;height&quot;:674,&quot;title&quot;:&quot;Text Entropy vs Big model's Loss&quot;,&quot;description&quot;:&quot;x axis: Big model's CE loss on Base and Stateful models generated texts.   y axis: Uncompressed to compressed size ratio (LZMA algorithm)    This graph was obtained by tuning the softmax temperature of the generative models.&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/WAaiJ/1/" width="730" height="674" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p></p><p>What we can see here is that, for a given validation loss, the Stateful Transformer performs exactly the same as the Standard one, meaning that the validation loss is a valid proxy for the stateful model&#8217;s inference performance.</p><p>I think it&#8217;s time for another unrelated picture.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!V0_9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad9cdddc-7238-41ba-9e40-9b7d5875c313_1280x1198.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!V0_9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad9cdddc-7238-41ba-9e40-9b7d5875c313_1280x1198.jpeg 424w, https://substackcdn.com/image/fetch/$s_!V0_9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad9cdddc-7238-41ba-9e40-9b7d5875c313_1280x1198.jpeg 848w, https://substackcdn.com/image/fetch/$s_!V0_9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad9cdddc-7238-41ba-9e40-9b7d5875c313_1280x1198.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!V0_9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad9cdddc-7238-41ba-9e40-9b7d5875c313_1280x1198.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!V0_9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad9cdddc-7238-41ba-9e40-9b7d5875c313_1280x1198.jpeg" width="1280" height="1198" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ad9cdddc-7238-41ba-9e40-9b7d5875c313_1280x1198.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1198,&quot;width&quot;:1280,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:558856,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!V0_9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad9cdddc-7238-41ba-9e40-9b7d5875c313_1280x1198.jpeg 424w, https://substackcdn.com/image/fetch/$s_!V0_9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad9cdddc-7238-41ba-9e40-9b7d5875c313_1280x1198.jpeg 848w, https://substackcdn.com/image/fetch/$s_!V0_9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad9cdddc-7238-41ba-9e40-9b7d5875c313_1280x1198.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!V0_9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad9cdddc-7238-41ba-9e40-9b7d5875c313_1280x1198.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The pillars of creation. Original <a href="https://esawebb.org/images/pillarsofcreation_composite/">here</a> (and <a href="https://www.nasa.gov/wp-content/uploads/2022/10/stsci-01gfnn3pwjmy4rqxkz585bc4qh.png">here</a>). Why do we even bother with math and ML when we know that such majestic stuff floats in the ether? Incredible to know the pluto-sun distance would easily fit inside one pixel of this image (even at full res).</figcaption></figure></div><h2>Why it only works for shallow Transformers</h2><h4>A delusional hope</h4><p>At first, I thought my architecture tweak would work better for bigger Transformers.</p><p>The amount of information carried by an input vector depends on the transfer function of the network it is fed into. Quite obviously, if you take a dead neural net whose output is constant, you won&#8217;t be able to infer anything about the input. So, whatever the amount of information theoretically present in the input vector (32 bits * dim), you won&#8217;t be able to extract any. </p><p>Conversely, a big, well-trained neural network has a lot of decision boundaries, meaning that a small variation in the input vector will have a big effect on the output. I think the amount of information you can read from a vector is approximately equal to the number of decision boundaries of the NN, which is dependent on the number of parameters and the shape of the activation functions<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a>.</p><p>Let&#8217;s keep this idea in mind: the bigger a neural net, the more sensitive its output is to small input variations.</p><p>Now, let&#8217;s look at something else: in a Transformer, the vocab size is finite, and so is the embedding dimension. When you embed tokens (the first thing you do in the forward pass of the Transformer), you look up the token id in a dictionary and get its associated vector. So, in the continuous vector space of dimension n_embedding, you only have a discrete set of vocab_size vectors you can feed the Transformer with. And I thought that the Stateful Transformer&#8217;s usefulness was in padding this discrete set to make it more continuous.</p><p>And I thought that this padding effect would be especially beneficial if the subsequent Transformer has the complexity to make use of the additional precision. Namely, if it&#8217;s bigger.</p><p>Also, if you look closely, the size of the gaps in the token embeddings space usually grows with the size of the model: the average distance between closest tokens grows like<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;average\\_min\\_distance \\propto \\frac{1}{\\sqrt[embedding\\_dim]{vocab\\_size}}&quot;,&quot;id&quot;:&quot;CFYSAROGOP&quot;}" data-component-name="LatexBlockToDOM"></div><p>Anyway, two good reasons to believe, at first sight, that the Stateful Transformer architecture would perform better on big Transformers.</p><p>It doesn&#8217;t.</p><h4>Reality strikes back</h4><p>Let me show the architecture again:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5F-j!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30f080c8-a57a-46eb-8f01-3c5d3a3130fb_2664x3092.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5F-j!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30f080c8-a57a-46eb-8f01-3c5d3a3130fb_2664x3092.png 424w, https://substackcdn.com/image/fetch/$s_!5F-j!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30f080c8-a57a-46eb-8f01-3c5d3a3130fb_2664x3092.png 848w, https://substackcdn.com/image/fetch/$s_!5F-j!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30f080c8-a57a-46eb-8f01-3c5d3a3130fb_2664x3092.png 1272w, https://substackcdn.com/image/fetch/$s_!5F-j!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30f080c8-a57a-46eb-8f01-3c5d3a3130fb_2664x3092.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5F-j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30f080c8-a57a-46eb-8f01-3c5d3a3130fb_2664x3092.png" width="572" height="663.9285714285714" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/30f080c8-a57a-46eb-8f01-3c5d3a3130fb_2664x3092.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1690,&quot;width&quot;:1456,&quot;resizeWidth&quot;:572,&quot;bytes&quot;:400754,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.eloidereynal.com/i/158656884?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30f080c8-a57a-46eb-8f01-3c5d3a3130fb_2664x3092.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5F-j!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30f080c8-a57a-46eb-8f01-3c5d3a3130fb_2664x3092.png 424w, https://substackcdn.com/image/fetch/$s_!5F-j!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30f080c8-a57a-46eb-8f01-3c5d3a3130fb_2664x3092.png 848w, https://substackcdn.com/image/fetch/$s_!5F-j!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30f080c8-a57a-46eb-8f01-3c5d3a3130fb_2664x3092.png 1272w, https://substackcdn.com/image/fetch/$s_!5F-j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30f080c8-a57a-46eb-8f01-3c5d3a3130fb_2664x3092.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Blue: parallel processing layers, Pink: token mixing layers.</figcaption></figure></div><p>All the blue steps here are steps where token embeddings are processed in parallel, with no communication between them. </p><p>The pink steps, conversely, are steps where each token vector gets updated based on the other tokens. Best example for that is attention. </p><p>I decided to give the enrichment mechanism a light pink tint, as the n-th token&#8217;s embedding is updated based on the (n-1)-th last hidden state (which was actually the one used to predict the n-th token). So, technically speaking, there is <em>some </em>token-to-token interaction happening, but it&#8217;s both local and unidirectional.</p><p>What&#8217;s interesting is that token mixing happens in each decoder layer. Let&#8217;s take a 50 layer-deep Transformer, trying to predict the n-th token of a sequence. As they flow through the Transformer, all the previous tokens&#8217; hidden vectors provide information to the n-th token&#8217;s vector, through the attention mechanism. So, the (n-1)-th token&#8217;s penultimate hidden state freely provides information to the n-th token. As for the very last couple of hidden states (after the last attention layer), they don&#8217;t, because there are no token-mixing steps left.</p><p>What it all means is that the Stateful Transformer&#8217;s architecture update is interesting only in that it allows the <em>very</em> <em>last</em> hidden state of token n-1 to provide information to token n&#8217;s embedding, instead of the <em>penultimate</em> hidden state. Of course, the communication medium between vectors is completely different for these two cases<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-9" href="#footnote-9" target="_self">9</a>, but the end result is the same: <strong>tokens communicate up to the last layer in the standard Transformer, while the Stateful Transformer goes the extra mile and allows the tokens to communicate after the Feed Forward Network part of the last decoder layer. This update is really just about not discarding the work done by the last FFN on the previous token embeddings: the work done by all previous layers is still available thanks to the attention mechanism.</strong></p><p>And this extra mile is not that valuable.</p><p>Intuitively, the last hidden state doesn&#8217;t carry much more information than the penultimate hidden state if there are a lot of decoder layers: each layer (including the last one) loses importance when there are a lot of them<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-10" href="#footnote-10" target="_self">10</a>.</p><p>So, this is the reason why the stateful Transformer architecture tweak only works for shallow models.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vgqD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a466bf-b7cc-432f-8e18-5e6abaecbbc8_2560x2129.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vgqD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a466bf-b7cc-432f-8e18-5e6abaecbbc8_2560x2129.jpeg 424w, https://substackcdn.com/image/fetch/$s_!vgqD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a466bf-b7cc-432f-8e18-5e6abaecbbc8_2560x2129.jpeg 848w, https://substackcdn.com/image/fetch/$s_!vgqD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a466bf-b7cc-432f-8e18-5e6abaecbbc8_2560x2129.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!vgqD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a466bf-b7cc-432f-8e18-5e6abaecbbc8_2560x2129.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vgqD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a466bf-b7cc-432f-8e18-5e6abaecbbc8_2560x2129.jpeg" width="1456" height="1211" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a9a466bf-b7cc-432f-8e18-5e6abaecbbc8_2560x2129.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1211,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;undefined&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="undefined" title="undefined" srcset="https://substackcdn.com/image/fetch/$s_!vgqD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a466bf-b7cc-432f-8e18-5e6abaecbbc8_2560x2129.jpeg 424w, https://substackcdn.com/image/fetch/$s_!vgqD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a466bf-b7cc-432f-8e18-5e6abaecbbc8_2560x2129.jpeg 848w, https://substackcdn.com/image/fetch/$s_!vgqD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a466bf-b7cc-432f-8e18-5e6abaecbbc8_2560x2129.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!vgqD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a466bf-b7cc-432f-8e18-5e6abaecbbc8_2560x2129.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Cependant s&#8217;&#233;lan&#231;ant de la fl&#232;che gothique / un son religieux se r&#233;pand dans les airs. / Le voyageur s&#8217;arr&#234;te et la cloche rustique / aux derniers bruits du jour m&#234;le de saints concert. Painting is <a href="https://en.wikipedia.org/wiki/The_Angelus_%28painting%29">L&#8217;Ang&#233;lus</a>, from Millet. <a href="https://www.poetica.fr/poeme-482/alphonse-de-lamartine-isolement/">Text</a> is from Lamartine.</figcaption></figure></div><h2>Conclusion:</h2><p>This architecture tweak sucks, like nearly every other Transformer &#8220;improvement&#8221;.</p><p>Maybe you&#8217;re not convinced that &#8220;the last layer of a Transformer isn&#8217;t that important if there are a lot of them&#8221;, I wasn&#8217;t either. But upon closer look, if it&#8217;s indeed important, why not just add one more layer? It will do the same job as the recurrent architecture trick, while being more straightforward to train and use.</p><p>My next post will either be about some things I learned when researching this idea<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-11" href="#footnote-11" target="_self">11</a>, or a credit scoring model I&#8217;d like to train.</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Only one little error <a href="https://youtu.be/eMlx5fFNoYc?t=1294">here</a>: Attention head results are concatenated, not summed. But technically, you could argue that a concatenation of vectors is equivalent to the sum of their expanded (ie. padded with 0s) version, so it&#8217;s just a slight imprecision.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>In what fucking world do their authors live to believe that their readers will enjoy a bunch of formulas full of unspecified terms at the beginning of a &#8220;RNNs simply explained article&#8221;? Most of them go from &#8220;Here is the very high level idea (so high-level it doesn&#8217;t tell anything)&#8221; to &#8220;So, h_t = f(W * h_{t-1} + U * x_t + b) and y_t = g(V * h_t + c)&#8221;. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Backprop is basically about moving the parameters of a model in the direction that makes them decrease the loss. So, first, you have to compute the gradient of the loss w.r.t the parameters. You could use the finite difference method (the  (f(x+h) - f(x)) / h stuff) along with the chain rule, but it&#8217;s very inefficient. So all ML frameworks store the computation graph of the network to be able to tell things like &#8220;the input was transformed in such and such way when going through the neural net, so the gradient of the loss w.r.t each layer&#8217;s params is such and such&#8221;.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>First, the Field Effect Transistors used in integrated circuits act like capacitors. Meaning that switching them on and off involves a transfer of electrons which potential energy is equal to their capacitance multiplied by the voltage squared. This is just lost energy and heat. This lost power/heat is proportional to the rate at which you turn these transistors on and off, aka the clock speed. Heat management is one of the hardest problems of chips. Second limit is just the speed of electricity, which is approx 10&#8312; m/s in silicon. A 100THz chip would need the signal path lengths max deviation to be less than 1 micrometer to ensure correct computations.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>n_embd = 64, n_heads = 8, n_layers = 2, block_size = 33, batch_size = 2048</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>If I were to make a guess, I&#8217;d say that, for a MLP with ReLU activation the number of decision boundaries is basically lower than or equal to (2*dim)^depth, assuming that the dim is constant.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>Think of a chess board: the closest distance between two neighboring cells is</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\frac{1}{\\sqrt[2]{64}} = 1/8&quot;,&quot;id&quot;:&quot;YUOSIDVZRJ&quot;}" data-component-name="LatexBlockToDOM"></div><p>Meaning that if you had 2-dimensional embeddings with a vocab size of 64, the average distance between two neighboring embeddings would be about one eighth of the maximum norm of these embeddings.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>In fact, for practical model sizes, the average min distance between neighboring embeddings is terribly close to their norm. Meaning that they are as distant as can be. A surprising consequence of the curse of dimensionality.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-9" href="#footnote-anchor-9" class="footnote-number" contenteditable="false" target="_self">9</a><div class="footnote-content"><p>Global attention between two token&#8217;s equally high-level hidden states vs. enrichment of a low-level token embedding with a high-level last hidden state.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-10" href="#footnote-anchor-10" class="footnote-number" contenteditable="false" target="_self">10</a><div class="footnote-content"><p>Especially when you account for the &#8220;<a href="https://arxiv.org/pdf/2502.05795">curse of depth</a>&#8221; in LLMs</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-11" href="#footnote-anchor-11" class="footnote-number" contenteditable="false" target="_self">11</a><div class="footnote-content"><p>Namely:<br>1) how the Fourier transform of the activation function used in an NN relates to the Fourier transform of the NN&#8217;s prediction surface. Or, put more practically, why you shouldn&#8217;t use ReLUs when approximating a low-frequency function and conversely shouldn&#8217;t use GeLUs when modeling highly discontinuous phenomena. <br>2) When a higher hidden dim is needed vs. when a deeper network works best<br>3) Why and how the function to model conditions the shape of the best_attainable_loss vs. number of layers curve.</p></div></div>]]></content:encoded></item><item><title><![CDATA[Boring stocks vs Growth stocks]]></title><description><![CDATA[A data-driven attempt to settle the debate.]]></description><link>https://www.eloidereynal.com/p/growth-vs-value-investing</link><guid isPermaLink="false">https://www.eloidereynal.com/p/growth-vs-value-investing</guid><dc:creator><![CDATA[Eloi de Reynal]]></dc:creator><pubDate>Wed, 05 Feb 2025 23:21:44 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/82671072-6ec8-4598-8ad3-ab3e5627f807_520x396.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Everything is priced in, right? </p><p>Well, let&#8217;s see.</p><p>Using data from <a href="https://site.financialmodelingprep.com/?utm_source=substack&amp;utm_medium=substack&amp;utm_campaign=eloi1">Financial Modeling Prep</a>&#8217;s API, I have compiled statistics on more than 8000 companies across the US, EUR and AU markets to investigate potential systematic biases in business valuation.</p><p><strong>These results are based on data from 1980 to 2023.</strong></p><h2>Ratio vs. Annualized returns</h2><p>The following table shows the correlations between various Valuation Ratios and Annualized Returns (with reinvested dividends) across all companies -regardless of industry.</p><p>Usually, price is in the numerator of valuation ratios but I reversed them, because it makes more sense for correlation computation<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>.</p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/lbHcm/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/39e998cd-9885-4106-b80b-61a8528d8c8b_1260x660.png&quot;,&quot;thumbnail_url_full&quot;:&quot;&quot;,&quot;height&quot;:362,&quot;title&quot;:&quot;Valuation ratios vs Annualized Returns&quot;,&quot;description&quot;:&quot;N Samples = 14134&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/lbHcm/1/" width="730" height="362" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p></p><p>So far, the positive correlations support the value investing approach.</p><p>The P/B ratio has the best correlation with annualized returns, despite being less popular than P/E or P/FCF.</p><p>Let&#8217;s break it down by industry. Doing so decreases sample sizes and results in correlations that are less reliable. So take these figures with a grain of salt, especially when the sample size is below 100.</p><p>Here, a negative correlation suggests investors overemphasize current valuations, potentially overlooking negative future developments (or conversely, dismissing positive forecasts). A positive correlation suggests investors tend to be overconfident in their ability to predict the future.</p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/JbUr1/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2e61fb32-092f-4e87-bc3d-4c5291624c13_1260x660.png&quot;,&quot;thumbnail_url_full&quot;:&quot;&quot;,&quot;height&quot;:1348,&quot;title&quot;:&quot;Correlations between Valuation Ratios and Annualized Return  Across Industries&quot;,&quot;description&quot;:&quot;Each value represents the correlation between a specific ratio and annualized returns within a certain industry.&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/JbUr1/1/" width="730" height="1348" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p>It looks like a win for value investors. Most of the time, undervalued companies (in terms of valuation ratios) tend to outperform the ones with better &#8220;potential&#8221;.</p><p>But let&#8217;s take a step back.</p><h2>What about 5-year returns?</h2><p><em>Average(Annualized return) &#8800; Annualized Average Return</em></p><p>When I averaged the annual returns in my dataset, I was surprised to get a negative number. I assumed I had made a mistake because everyone knows that, on average and in the long run, investing in stocks yields a positive return. So I checked every single line of code. There was no obvious blunder&#8230;</p><p>In fact, I realized that gains compound, but losses don't offset them linearly. For example, consider two investments held equally: one yielding a 50% annual return over five years and the other yielding -50% per year. The overall return would be 281%, not zero<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>. When I compounded each annual return over five years and then averaged the results, I obtained a 4% positive average annual return. Given that this is an unweighted average, it seems reasonable.</p><p>The preceding correlations don't account for the asymmetry between positive and negative annualized returns. They treat both equally<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>. However, growth stocks are considered high-risk, high-reward investments. This implies that, for a similar average annualized return, they should outperform value stocks.</p><p>To illustrate this phenomenon, the following graph shows the returns of 2 hypothetical portfolios of 2 stocks.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ock_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3541755-aa53-4360-a5f4-351099e47d21_925x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ock_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3541755-aa53-4360-a5f4-351099e47d21_925x630.png 424w, https://substackcdn.com/image/fetch/$s_!ock_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3541755-aa53-4360-a5f4-351099e47d21_925x630.png 848w, https://substackcdn.com/image/fetch/$s_!ock_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3541755-aa53-4360-a5f4-351099e47d21_925x630.png 1272w, https://substackcdn.com/image/fetch/$s_!ock_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3541755-aa53-4360-a5f4-351099e47d21_925x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ock_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3541755-aa53-4360-a5f4-351099e47d21_925x630.png" width="925" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c3541755-aa53-4360-a5f4-351099e47d21_925x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:925,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:52376,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ock_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3541755-aa53-4360-a5f4-351099e47d21_925x630.png 424w, https://substackcdn.com/image/fetch/$s_!ock_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3541755-aa53-4360-a5f4-351099e47d21_925x630.png 848w, https://substackcdn.com/image/fetch/$s_!ock_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3541755-aa53-4360-a5f4-351099e47d21_925x630.png 1272w, https://substackcdn.com/image/fetch/$s_!ock_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3541755-aa53-4360-a5f4-351099e47d21_925x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Light blue: Average(annualized_returns) = (1.4+0.4)/2 = 0.9   Dark blue: Average(annualized_returns) = (1.01+1.1)/2 = 1.055.</figcaption></figure></div><p>Despite a lower <em>Average(annualized_returns)</em>, the high risk/high reward portfolio outperforms the other on a 5-year period.</p><p>So, let&#8217;s check the correlations between valuation ratios and 5-year returns.</p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/L1o9o/2/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cf6c8844-533b-47e0-b5db-e9b46b2d9334_1260x660.png&quot;,&quot;thumbnail_url_full&quot;:&quot;&quot;,&quot;height&quot;:362,&quot;title&quot;:&quot;Valuation ratios vs 5y Returns&quot;,&quot;description&quot;:&quot;N Samples = 8725&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/L1o9o/2/" width="730" height="362" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p></p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/wAVM0/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5bd892eb-7262-4b7a-9fed-d2e99db3c677_1260x660.png&quot;,&quot;thumbnail_url_full&quot;:&quot;&quot;,&quot;height&quot;:693,&quot;title&quot;:&quot;Correlations between Valuation Ratios and 5y Return  Across Industries&quot;,&quot;description&quot;:&quot;Each value represents the correlation between a specific ratio and 5y return within a certain industry.&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/wAVM0/1/" width="730" height="693" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p></p><p>These correlations remain significant, but they are much lower than for <em>annualized returns</em>. </p><p>A caveat: Calculating 5-year returns by exponentiating values, as done above, increases sensitivity to data inaccuracies. Correlations with 10-year returns are excluded, largely due to this issue. These correlations were all positive but insignificant, with the exception of Book Value / Price.</p><h2>Valuation Ratios vs. Stock Volatility</h2><p>We've established that 'transversal variance' is a good thing. A portfolio with some strong winners and some weak performers is preferable to a uniformly 'meh' one. In the long run, that unevenness pays off.</p><p>But volatility (internal variance) is terrible. <a href="https://en.wikipedia.org/wiki/Kelly_criterion">Kelly's criterion</a> demonstrates how damaging it is to ROI. You can't risk everything on volatile stocks (especially debt). The money sitting on the sidelines won't generate returns.</p><p>So here are the correlations between valuation ratios and the standard deviation of year-over-year returns. </p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/mHOTv/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d86e5602-3337-4baa-9b5a-dcbef5a3ba04_1260x660.png&quot;,&quot;thumbnail_url_full&quot;:&quot;&quot;,&quot;height&quot;:379,&quot;title&quot;:&quot;Valuation ratios vs 5y Returns, adjusted for risk&quot;,&quot;description&quot;:&quot;N Samples = 8725   Investments size have been adjusted to take risk into account using Kelly's coefficient.&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/mHOTv/1/" width="730" height="379" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p></p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/BtnIq/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8a8bf829-2358-43d8-8154-7a594702b635_1260x660.png&quot;,&quot;thumbnail_url_full&quot;:&quot;&quot;,&quot;height&quot;:783,&quot;title&quot;:&quot;Correlations between Valuation Ratios and 5y Return  Across Industries , adjusted for risk&quot;,&quot;description&quot;:&quot;Each value represents the correlation between a specific ratio and 5y return within a certain industry.   Investments size have been adjusted to take risk into account using Kelly's coefficient.&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/BtnIq/1/" width="730" height="783" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p>I find it a bit odd: I thought undervalued companies were considered riskier, and that their risk was a reason for their undervaluation. But they are actually <strong>less</strong> volatile.</p><p>Banks and asset management firms are exceptions. Investors seem especially good at assessing and pricing in their risks. These are the only industries where undervaluation reliably correlates <strong>positively</strong> (and reliably, i.e, with a large sample size) with stock volatility.</p><p>Still, the STD of yearly returns gives an incomplete picture. Using the <a href="https://en.wikipedia.org/wiki/Beta_(finance)">beta coefficient</a> instead would have been a better option. Please contact me if you want me to add the Ratios vs Beta correlations to this analysis; I haven't computed those yet.</p><h2>Conclusion</h2><p>Investors seem to exhibit a slight, systematic bias toward believing they can predict the future and capitalize on it. Value investing appears somewhat more effective than growth investing.</p><h2>Appendix: methodology</h2><p>I used <a href="https://site.financialmodelingprep.com/?utm_source=substack&amp;utm_medium=substack&amp;utm_campaign=eloi1">FMP</a>&#8217;s API to gather financial data, focusing on US, European, and Australian public companies due to their more complete and reliable information. I screened approximately 15,000 businesses, initially filtering for:</p><ul><li><p>Companies listed for over 5 years</p></li><li><p>Data accuracy (excluding outliers or inconsistencies like incorrect financial figures or improperly adjusted share counts for stock splits)</p></li></ul><p>This process narrowed the dataset to 8,725 companies. </p><p>To calculate returns, I excluded the first two years of each company's public data and then computed returns with reinvested dividends, up to the last available data points.</p><p>Ratios were calculated using each company's third public financial statement, along with the stock price following its release.</p><p><a href="https://github.com/edereynaldesaintmichel/value-investing">Github repo</a></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>P/E ratios of -500 or +500 have basically the same &#8220;meaning&#8221;, while they are polar opposites. Reversing the ratios gives -0.002 and +0.002, which are close together. It effectively reflects that you&#8217;re paying a lot for current profits. The 1/x function is not monotonic, which is bad for correlation computation.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>In the no-rebalancing case the &#8220;average&#8221; overall multiplier (when combining the two separate outcomes) would be the mean of (1.5&#8309;) and (0.5&#8309;); that is, (7.59 + 0.03)/2 &#8776; 3.81, corresponding to a total return of about +281%</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>You can get a sense of it by looking at the formula.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2xBV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c347caf-ca3b-4ba1-b0d5-9909c6990276_483x142.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2xBV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c347caf-ca3b-4ba1-b0d5-9909c6990276_483x142.png 424w, https://substackcdn.com/image/fetch/$s_!2xBV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c347caf-ca3b-4ba1-b0d5-9909c6990276_483x142.png 848w, https://substackcdn.com/image/fetch/$s_!2xBV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c347caf-ca3b-4ba1-b0d5-9909c6990276_483x142.png 1272w, https://substackcdn.com/image/fetch/$s_!2xBV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c347caf-ca3b-4ba1-b0d5-9909c6990276_483x142.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2xBV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c347caf-ca3b-4ba1-b0d5-9909c6990276_483x142.png" width="483" height="142" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9c347caf-ca3b-4ba1-b0d5-9909c6990276_483x142.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:142,&quot;width&quot;:483,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:14702,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2xBV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c347caf-ca3b-4ba1-b0d5-9909c6990276_483x142.png 424w, https://substackcdn.com/image/fetch/$s_!2xBV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c347caf-ca3b-4ba1-b0d5-9909c6990276_483x142.png 848w, https://substackcdn.com/image/fetch/$s_!2xBV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c347caf-ca3b-4ba1-b0d5-9909c6990276_483x142.png 1272w, https://substackcdn.com/image/fetch/$s_!2xBV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c347caf-ca3b-4ba1-b0d5-9909c6990276_483x142.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p> </p></div></div>]]></content:encoded></item><item><title><![CDATA[[Tech report] How I analyzed 337,576 financial statements]]></title><description><![CDATA[Some fun math and a bit of right-wing politics.]]></description><link>https://www.eloidereynal.com/p/tech-report-how-i-analyzed-337576</link><guid isPermaLink="false">https://www.eloidereynal.com/p/tech-report-how-i-analyzed-337576</guid><dc:creator><![CDATA[Eloi de Reynal]]></dc:creator><pubDate>Fri, 17 Jan 2025 23:12:06 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/6378557c-7b39-43cd-b1e0-6ac9d24936ad_1820x1023.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This post is for those who like Machine Learning and Math. The next post will be about business/finance so feel free to skip this one! Or else, just jump to the &#8220;What about politics&#8221; section.</em></p><p>When I had the idea of completing the financial statements analysis, I genuinely believed it would take no more than 8 hours. I ended up spending over 40.</p><h2>Data acquisition</h2><p>Everything begins with data. I got in touch with <a href="https://site.financialmodelingprep.com/?utm_source=substack&amp;utm_medium=substack&amp;utm_campaign=eloi">Financial Modeling Prep</a> to see if they could grant me a full, free access to their API in exchange for a bit of promotion. They agreed.</p><p>Downloading the data was quite straightforward. The only challenge was dumping JSON objects larger than 1GB. To avoid overloading the RAM of my old laptop, I split the data into multiple files. In total, I obtained financial reports from 25,866 companies, with up to 30 years of historical data each.</p><p>It's important to note that a value of 0 for a line item can have multiple meanings. It could indicate that the line item is not applicable to the specific financial statement (e.g., Cost of Revenue or Inventory for a Bank). Alternatively, it might mean the data scraper at Financial Modeling Prep was unable to locate the data, or it could truly represent a value of 0. <strong>So, a 0 in the data does not necessarily signify a value between -0.1 and 0.1, but rather signifies one of the following: a non-applicable item, a value that was not found, or an actual value of 0.</strong></p><h2>Data cleaning</h2><p>Data cleaning proved to be more difficult than anticipated. I initially assumed the data would be clean. But when I realized that the average Revenue in my whole dataset was 1e28 dollars and the average net profit margin was 5000% (net income 50 times higher than revenue on average) I suspected that maybe something was wrong. So I asked Claude to come up with upper and lower bounds for each of the report lines. After a few back and forth passes, it gave me sensible limits.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!I6Zc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F666840c9-fea0-472a-9bc8-79381beeb259_715x151.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!I6Zc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F666840c9-fea0-472a-9bc8-79381beeb259_715x151.png 424w, https://substackcdn.com/image/fetch/$s_!I6Zc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F666840c9-fea0-472a-9bc8-79381beeb259_715x151.png 848w, https://substackcdn.com/image/fetch/$s_!I6Zc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F666840c9-fea0-472a-9bc8-79381beeb259_715x151.png 1272w, https://substackcdn.com/image/fetch/$s_!I6Zc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F666840c9-fea0-472a-9bc8-79381beeb259_715x151.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!I6Zc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F666840c9-fea0-472a-9bc8-79381beeb259_715x151.png" width="715" height="151" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/666840c9-fea0-472a-9bc8-79381beeb259_715x151.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:151,&quot;width&quot;:715,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:48657,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!I6Zc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F666840c9-fea0-472a-9bc8-79381beeb259_715x151.png 424w, https://substackcdn.com/image/fetch/$s_!I6Zc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F666840c9-fea0-472a-9bc8-79381beeb259_715x151.png 848w, https://substackcdn.com/image/fetch/$s_!I6Zc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F666840c9-fea0-472a-9bc8-79381beeb259_715x151.png 1272w, https://substackcdn.com/image/fetch/$s_!I6Zc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F666840c9-fea0-472a-9bc8-79381beeb259_715x151.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><p>By the way, since a good portion of the statements were reported in a currency other than the USD, I had to convert the numbers before applying the limits. And as some of the line items were ratios, I had to exclude them from the exchange-rate conversion process. </p><p>I tried two different approaches about what to do with invalid statements:</p><ul><li><p>Discarding the statement and all other reports from the same company</p></li><li><p>Replacing the invalid value with an arbitrary value, such as <strong>0.01</strong>, unlikely to exist elsewhere in the dataset. (Remember this value, as I will refer to it later.)</p></li></ul><p>The first one resulted in over 80% of the statements being discarded, which made the dataset too small: my model was immediately overfitting and the val loss was terrible.</p><p>The second one resulted in a very sparse dataset. On average, approximately 40% of the inputs were either zeros (ie not applicable, such as Cost of Revenue in bank&#8217;s statements) or <strong>0.01</strong> (ie invalid data).</p><p>Which is not ideal, because these are false zeros. They don&#8217;t carry the same information as a real value of 0. They are qualitatively different from &#8220;something small, between -0.1 and 0.1&#8221;, and they should be treated accordingly.</p><p>On the one hand, I want my model to be quite linear because my problem is also largely linear, and I don't want the model to be so complex that it's prone to overfitting. On the other hand, 40% of my values need to be treated in a completely non-linear fashion. There needs to be a clear distinction between a 0 (not applicable), a 0.01 (invalid data, which should be disregarded), and a 0.02 (a normal, genuine near-zero value). This behavior is not typical for MLPs, as it's inherently non-linear. Hence, a slightly fancier architecture can be useful.</p><p>Now, you can&#8217;t just feed a model with raw financial data: having some inputs with mean and sd below 1 (like ratios) alongside Revenue figures in the billions is far from best practice. So one of the best things you can do is normalize the data so that mean = 0 and sd = 1, for all input features. For example, if the average revenue across the whole dataset is 1 billion, and its standard deviation is also about 1 billion, we should convert each revenue value using the formula:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;normalized\\_revenue = \\frac{revenue - 10&#8313;}{10&#8313;}&quot;,&quot;id&quot;:&quot;DZKFUODYAY&quot;}" data-component-name="LatexBlockToDOM"></div><p>Or, more generally:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;normalized\\_value = \\frac{value - mean}{standard\\_deviation}&quot;,&quot;id&quot;:&quot;SMZWJANNQG&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>Let&#8217;s imagine we exclude the original invalid values from normalization. That&#8217;s to say we normalize everything except for 0 and 0.01, that we keep as they are. By doing this, we are effectively ensuring that these unknown or not-applicable values are treated as average values. Indeed, 0 is now the exact average (by construction) of our new value distribution, and 0.01 is close to that average, being just 0.01 standard deviations above it. This approach is beneficial as it aligns with the natural heuristic: "If I don't know the exact figure, I assume it's not too far from average," which we humans use when guess-timating.</p><p>Now that we have made everything possible on the dataset side to mitigate the problems caused by these strange values, let&#8217;s see how architecture comes into play.</p><h2>Over engineering a NN architecture: the ultimate guide</h2><p>Here are the general features of the model:</p><ul><li><p>Input dim: 101 features/year * 3 years = 303 features</p></li><li><p>Output dim: 8 features for next year <a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>.</p></li><li><p>Training data: </p><ul><li><p>inputs: financial statements for years n-3, n-2 and n-1 (every possible one until last_year-2)</p></li><li><p>outputs: 8 features from year n&#8217;s financial statement (until last_year-1)</p></li></ul></li><li><p>Val data:</p><ul><li><p>inputs: financial statements for last_year-3, last_year-2 and last_year-1</p></li><li><p>outputs: 8 features from last year&#8217;s financial statements</p></li></ul></li></ul><p>I can hear people thinking &#8220;there is massive data leakage, beginner&#8217;s mistake, haha&#8221;. I kindly suggest the people thinking that <a href="https://www.youtube.com/watch?v=jYFefppqEtE">should have their kneecaps split</a> (link is safe, no worries). </p><p>If you're curious about that concern, please refer to the Appendix section. For now, let's move on to the network's architecture.</p><h3>Network architecture choice</h3><p>When predicting sequence data, RNNs and Transformers are often considered the go-to models. In our case, they would be unsuitable.<br>As we saw in the <a href="https://eligius.substack.com/p/i-analyzed-337576-income-statements?r=4ksqg3">previous post</a>, a persistence model is very effective and it doesn&#8217;t get more linear than that. <strong>When a simple linear regression does an acceptable job, there is usually no need for a fancier architecture than a simple <a href="https://en.wikipedia.org/wiki/Multilayer_perceptron">MLP</a></strong>. </p><p>As explained in the Data cleaning section, the values 0 and 0.01 should ideally be treated differently from typical near-zero numbers. So I tried to design an architecture that allows for that without departing too much from a standard <a href="https://www.geeksforgeeks.org/feedforward-neural-network/">FFN</a><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>. </p><p>I first had an idea that was quite mathematically sound but turned out to be a parameter-inefficient, hardly convergent, overfitting-prone mess<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>.</p><p>Then, a second approach, simpler and quite similar to the former one from a mathematical standpoint, avoided matrix computations. It proved to be a partial failure: it was slightly less parameter efficient than a simple MLP, while being better at handling very sparse data<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>.</p><p>Finally, a third approach worked perfectly. It achieved a lower test loss while using less parameters than a MLP and resulted in cleaner gradients which were easier to interpret.</p><p>The basic, non-optimized version is incredibly simple.</p><p>First, for each input vector (a flat vector with 303 features &#8211; 101 for each year over 3 years), a <em>`mask_invalid`</em> vector is created. This is a 303-dimensional vector where each position has a 1 if the corresponding value in the input vector is 0.01 (the code for 'off-bounds'), and a 0 otherwise.<br> It&#8217;s really a kind of grid that, overlaid on the original input vector, would show where the invalid values are.</p><p>Next, a <em>`mask_zero`</em> vector is defined in the same way. This vector indicates the positions of the 0 values in the input vector.</p><p>Then, these three vectors are concatenated and given to a simple Linear layer + LeakyReLU. The first third of the output vector is considered the next layer&#8217;s <em>input_vector</em>, and the other two thirds are the <em>mask_invalid</em> and <em>mask_zero</em> vectors. It&#8217;s equivalent to a simple MLP which input_vector is <br><em>concat(values_input_vector, mask_invalid, mask_zero)</em> and output is a 8-dimensional prediction vector.</p><p>So this un-optimized network consisted of two Masked Layers (it&#8217;s the name I used for them) and a linear output layer.</p><p>If we look at the parameter count of one Masked Layer, we have:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;N = (3 \\cdot input\\_size) \\cdot (3 \\cdot output\\_size) + 3 \\cdot output\\_size&quot;,&quot;id&quot;:&quot;KWNJKRONUV&quot;}" data-component-name="LatexBlockToDOM"></div><p>The term before the &#8220;+&#8221; refers to the size of the transformation matrix and the term after refers to the bias of our layer.</p><p>This is quite a lot. Of course, it&#8217;s easily handled by my laptop, but let&#8217;s remember that parameter count directly influences overfitting proneness on small datasets (like ours). The more parameters a model has, the more it will be able to learn the noise of the training dataset and struggle with test loss.</p><p>Moreover, we can definitely sense that such an architecture is not optimal. Two thirds of the concatenated input vector are very light on information (a mere 303*2 = 606 bits, compared to 303*32 = 9,696 bits for the first third<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>) and two-thirds of the matrix&#8217;s parameters are wasted just to handle this small amount of information.</p><p>The first thing that could be done to decrease the number of parameters is to project each of the <em>input_values</em> and <em>masks</em> into a smaller sub-space before concatenating them.</p><p>At first I tried projecting them into different sub-spaces through different projection matrices, and it yielded good results.</p><p>Then I tried using the same projection matrix for all three projections, on the very scientific ground of &#8220;Hell yeah, I bet it works&#8221; and you bet it worked fine, achieving even lower test loss (while increasing train loss), which is perfect.</p><p>So the general architecture of one layer looked like that:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NyA6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2e4be0a-b4f9-43ce-b0d4-9a530657a773_1209x487.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NyA6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2e4be0a-b4f9-43ce-b0d4-9a530657a773_1209x487.png 424w, https://substackcdn.com/image/fetch/$s_!NyA6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2e4be0a-b4f9-43ce-b0d4-9a530657a773_1209x487.png 848w, https://substackcdn.com/image/fetch/$s_!NyA6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2e4be0a-b4f9-43ce-b0d4-9a530657a773_1209x487.png 1272w, https://substackcdn.com/image/fetch/$s_!NyA6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2e4be0a-b4f9-43ce-b0d4-9a530657a773_1209x487.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NyA6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2e4be0a-b4f9-43ce-b0d4-9a530657a773_1209x487.png" width="1209" height="487" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a2e4be0a-b4f9-43ce-b0d4-9a530657a773_1209x487.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:487,&quot;width&quot;:1209,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:121494,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NyA6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2e4be0a-b4f9-43ce-b0d4-9a530657a773_1209x487.png 424w, https://substackcdn.com/image/fetch/$s_!NyA6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2e4be0a-b4f9-43ce-b0d4-9a530657a773_1209x487.png 848w, https://substackcdn.com/image/fetch/$s_!NyA6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2e4be0a-b4f9-43ce-b0d4-9a530657a773_1209x487.png 1272w, https://substackcdn.com/image/fetch/$s_!NyA6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2e4be0a-b4f9-43ce-b0d4-9a530657a773_1209x487.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>I then computed the condition number of each matrix and got quite a high number (~1e3) for each layer&#8217;s <em>result_proj</em> matrix. It was a clear indication that I could LoRA<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a> them. I did so with a compression ratio of 3.</p><p>Then we have the final architecture of our layers, that looks like the following: </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_zHY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5fac7ce-26f3-45fb-93f0-bfde97d1195f_1291x559.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_zHY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5fac7ce-26f3-45fb-93f0-bfde97d1195f_1291x559.png 424w, https://substackcdn.com/image/fetch/$s_!_zHY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5fac7ce-26f3-45fb-93f0-bfde97d1195f_1291x559.png 848w, https://substackcdn.com/image/fetch/$s_!_zHY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5fac7ce-26f3-45fb-93f0-bfde97d1195f_1291x559.png 1272w, https://substackcdn.com/image/fetch/$s_!_zHY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5fac7ce-26f3-45fb-93f0-bfde97d1195f_1291x559.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_zHY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5fac7ce-26f3-45fb-93f0-bfde97d1195f_1291x559.png" width="1291" height="559" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b5fac7ce-26f3-45fb-93f0-bfde97d1195f_1291x559.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:559,&quot;width&quot;:1291,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:163283,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_zHY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5fac7ce-26f3-45fb-93f0-bfde97d1195f_1291x559.png 424w, https://substackcdn.com/image/fetch/$s_!_zHY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5fac7ce-26f3-45fb-93f0-bfde97d1195f_1291x559.png 848w, https://substackcdn.com/image/fetch/$s_!_zHY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5fac7ce-26f3-45fb-93f0-bfde97d1195f_1291x559.png 1272w, https://substackcdn.com/image/fetch/$s_!_zHY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5fac7ce-26f3-45fb-93f0-bfde97d1195f_1291x559.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>All in all, the model has 24,625 parameters and achieves 3.5% better test loss compared to the best MLP I came up with, that has 49,459 parameters. </p><p></p><h2>Of Convos and LoRAs: why this architecture works so well.</h2><p>The first part of the forward pass is equivalent to a 1D convolution with <em>out_features</em> filters, <em>in_features</em> stride and filter_size + flattening with interleaving each filter&#8217;s result.</p><p>What this means is that the projection matrix <em>main_proj</em> gets optimized to select filters that maximize information retained from each of the input vector&#8217;s 3 parts. Just like a normal learned projection matrix does, you may say, but the difference is that it has to make concessions to find the best filters for all three sub parts. This is analogous to a CNN learning the best filters for processing images.</p><p>And in our case, it works fine: the similarity in data presentation means that a single filter is likely to effectively process both the main values and their corresponding masks.</p><p>The contrast between the two stages of the forward pass is particularly interesting. The first stage achieves parameter efficiency through weight-sharing and convolutions, while the second achieves it by LoRA-ing the transformation matrix.</p><p>Here is why it works:</p><ul><li><p>Convolutions aim to find the (usually) <strong>full-rank</strong> down projection that best captures <strong>local features</strong> across the image, using a fixed number of parameters. The only global interaction between features in convolutions comes from the competition to change the filters' weights during back-propagation.</p></li><li><p>LoRAs are about finding the down projection that best captures <strong>global features/interaction</strong>, and expanding the result in a meaningful way. It results in a <strong>low-rank</strong> transformation.</p></li></ul><p>Basically, the convolution finds a common ground for features interaction, and the LoRA projection enforces this interaction, all in a parameter efficient manner.</p><h2>Training</h2><p>Here is a comparison between the best MLP and the best Masked Net (the above mentioned architecture) I could design.</p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/u27UP/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/65904f6c-eee5-4774-be90-68d4dc6d55e1_1260x660.png&quot;,&quot;thumbnail_url_full&quot;:&quot;&quot;,&quot;height&quot;:554,&quot;title&quot;:&quot;MLP vs Masked Net: training run&quot;,&quot;description&quot;:&quot;&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/u27UP/1/" width="730" height="554" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p>We can see that there is likely some room for improvement in weight initialization. Furthermore, the training loss is higher for the masked network compared to the MLP, while the validation loss is lower. This suggests reduced risk of overfitting and better predictive potential.<br></p><h2>Conclusion/Executive summary</h2><ul><li><p>Data Preprocessing Matters. Careful data cleaning and normalization, including separate handling of invalid data and genuine zeros, was critical.</p></li><li><p>Masked Nets perform great. Link to github repo is in the comments section if you need to use them.</p></li><li><p>Parameter efficiency is key. A parameter-inefficient net is not only compute intensive to train, but also at risk of overfitting on small datasets.</p><p></p></li></ul><p></p><h2>Appendix</h2><h3>Data leakage concerns</h3><p>Let&#8217;s focus on a single company. During training, the model will be presented with data from say, 2009, 2010, 2011 and asked to predict the 8 desired features for year 2012. It will also see, either before or after the aforementioned training example (training examples are randomly presented) another one that looks like 2008, 2009, 2010, and asked about 2011. At such point the model could think &#8220;Haha, I know what I have to predict because I have seen the data sometime before in my inputs&#8221;. So it will just parrot back the 2011 report as &#8220;stored&#8221; in its weights. The model is just learning to parrot things back, not to predict them. </p><p>The problem is especially noticeable in the case of transformers. If you don&#8217;t mask the upper part of the attention matrix, the transformer will be able to peek into future tokens and output them without trying to predict them<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a>. Even if you properly mask it, presenting a transformer with overlapping sequences will favor such behavior: it will learn sequences rather than language rules.</p><p>So this concern is theoretically valid. But it doesn&#8217;t mean that ML zealots have the right to live.</p><h3>Why it isn&#8217;t a big deal.</h3><p>Three reasons:</p><ul><li><p>In order to learn sequences, a model must have a lot of &#8220;storage space&#8221;, meaning a lot of parameters relative to the size of the training dataset. Which is not the case for our ~24k parameters model, trained on a 130MB dataset.</p></li><li><p>Models prone to data leakage are those whose behavior is very non-linear, such as transformers and RNNs. Our prediction model is quite linear, using only two low-dimensional LeakyReLU layers, which is nowhere near the 3rd degree polynomial attention seen in transformers (not even accounting for the softmax).</p></li><li><p>Predicting financial statements is fundamentally linear. One of the best naive prediction model is just a persistence model (ie predicting the numbers will stay the same as before): an identity matrix with no bias, it doesn&#8217;t get more linear than that.<br>So it&#8217;s unreasonable to think that the Adam optimizer will tune weights to encourage sequence memorizing while both the data and the network&#8217;s architecture favor a quasi-linear behavior. <a href="https://ludic.mataroa.blog/blog/i-will-fucking-piledrive-you-if-you-mention-ai-again/">ML fanatics should be piledrived</a> (link is safe).</p></li></ul><h3>A real world, political analogy</h3><p>Let&#8217;s imagine three people who need to talk to smooth over some minor disagreement. <br>The first is a heavy weight boxer with a fair sized chip on his shoulder.<br>The second one is a non-violent communicator who holds little grudge against the two others because he has learned to process his <s>bad</s> <s>negative</s> painful feelings. <br>The third is a social justice warrior, full of indignation against the world, including the other two.<br>One of their common friends, who&#8217;s called <em><strong>Convolution</strong></em> (a name he didn't choose), tries to find the best place for all of them to &#8220;talk&#8221;. The boxer immediately calls for a box ring. The non-violent guy wants to talk through the disagreement in a nice, calm place and suggests the others should come over at his place. The SJW doesn&#8217;t want to talk at all (sorry guys I tried to keep the post non political until now but it&#8217;s impossible due to the deeply controversial nature of the topic). <br><em><strong>Convolution</strong></em> then says, 'Well, SJW doesn't want to communicate; the Non-Violent one wants a calm place but doesn't hold a grudge, so his opinion doesn't matter as much, and the Boxer wants a ring, so we'll go with the ring.<br>Convolution just found the best place for each of the three to talk. <br><br>I skip the part where Non-Violent submits a kind and empathetic disagreement to Convolution on that matter and SJW calls out the systemic violence of the capitalist system that oppresses the weak. They all meet at the local gym, on the box ring.</p><p>At first, Non-Violent and SJW are a bit hesitant and cling to the ropes. Then the referee, whose name is <em><strong>LoRA Down Projection</strong></em> (obviously didn&#8217;t choose either) pulls them off the ropes and shoves them into a smaller space with Boxer. He then says &#8220;I want the disagreement settled in ten minutes&#8221;. Boxer needs less time to explain with loads of compassion that he was hurt by the two others and that he&#8217;s happy to have a thoughtful conversation to figure out what went wrong. And you know, he likes to talk with his hands. <br>Seven minutes later, they all come to the common conclusion that everything is fine, it&#8217;s never been finer, boxer, &#8220;you&#8217;re my best friend we&#8217;re brothers in arms forever (but please release mine, I want to use them again some time)&#8221;. A lawyer passing by, whose name is <em><strong>LoRA Up</strong></em> (you bet) decides to translate this informal agreement into legal terms. So, he expands the &#8220;everything&#8217;s fine&#8221; conclusion to 35 pages and sends them to the judge.</p><p>Blah blah ever after.</p><p></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>["revenue", "netIncome", "eps", "epsdiluted", "freeCashFlow", "totalStockholdersEquity", "operatingCashFlow", "dividendsPaid"]</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>FFN (feed forward network) and MLP (multi-layer perceptron) can be used interchangeably as they basically mean the same thing: linear layers with non-linearity between.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>It was about interpolating missing values using a regression matrix (that basically said &#8220;Here is my best guess for metric X based on the other 302 metrics&#8221;, like &#8220;Oh, the Revenue line is missing but I will be able to infer it from COGS, Operating expenses and EBITDA&#8221;), with each &#8220;guess&#8221; weight-averaged by a Softmaxed learned covariance-like matrix, parameterized by the mask of missing/invalid values. Mathematically speaking, it was the &#8220;soundest&#8221; architecture, but it had too many parameters, involved complex matrix computations and was an absolute mess to train. I couldn&#8217;t use LayerNorms because they messed with the regression nature of the task. This architecture was an utter failure. It didn&#8217;t converge most of the time and when it did, there were obvious signs of overfitting.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>Before going through the main neural net, every missing metric in the input was inferred from the others using a simple Linear layer, the result was multiplied by a mask vector (1 where data is missing/invalid, 0 otherwise), then added to the main input vector, resulting in a full input vector with each 0 replaced by a nice inferred value. This architecture kind of worked: replacing Linear layers by these &#8220;masked interpolation layers&#8221; resulted in a net improvement, but it was less parameter-efficient than a MLP due to a much higher parameter count in each layer. It was a partial failure, but I used this architecture for my previous post, out of pride to use MY architecture instead of a basic one.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>This is just an estimate, as the &#8220;real&#8221; information stored in the vectors is also dependent on their data distribution and the architecture they are fed into.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>Here, LoRA is the acronym for Low Rank <strong>Approximation</strong> and not Low Rank <strong>Adaptation.</strong></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>You can check my french posts on how ChatGPT works. Maybe I&#8217;ll translate them but it might be hard as the examples I used don&#8217;t work in English.</p></div></div>]]></content:encoded></item><item><title><![CDATA[I analyzed 337,576 Financial Statements]]></title><description><![CDATA[Here are the results.]]></description><link>https://www.eloidereynal.com/p/i-analyzed-337576-income-statements</link><guid isPermaLink="false">https://www.eloidereynal.com/p/i-analyzed-337576-income-statements</guid><dc:creator><![CDATA[Eloi de Reynal]]></dc:creator><pubDate>Fri, 10 Jan 2025 23:36:44 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!zbkx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cdbd9a3-d155-4490-ba13-629039246cea_1260x660.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This post has been made possible by <a href="https://site.financialmodelingprep.com/?utm_source=substack&amp;utm_medium=substack&amp;utm_campaign=eloi">Financial Modeling Prep</a>, which gave me a full and free API access to their excellent database. This post is more about Business Analysis than Machine Learning, but I&#8217;ll write a technical one soon.</em></p><p>Drawing conclusions from Income/Cash Flow/Balance Sheet statements requires two things:</p><ol><li><p>A deep understanding of finance and how different items in financial statements reflect real-world business operations</p></li><li><p>Good intuition, developed through experience</p></li></ol><p>While machine learning models struggle with the first aspect, they excel at the second. That's why I decided to train a neural network on virtually every public financial statement issued since 2000 to discover what statistical and financial patterns a model could learn from experience.</p><p>When analyzing financial statements for investment decisions, I typically ask questions like:</p><ul><li><p>If a company's revenue grew strongly from year n-3 to year n-1, should I expect continued growth or regression to the mean?</p></li><li><p>Is long-term debt an indicator of healthy investment and long-term planning, or poor financial management?</p></li><li><p>When a company performs well, is it more likely to issue new stocks (increasing capital) or buy them back (returning capital to shareholders)? In other words, should I expect dilution?</p></li><li><p>More broadly, is studying financial statements worth my time before investing?</p></li></ul><p>The model I have trained provides some clues. It has been trained to predict 8 metrics<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> for next years&#8217; financial statements based on the last three years of history, each comprising 101 components<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>.</p><h2>Is it possible to predict the future by looking at financial statements alone?</h2><p>To some extent.</p><p>A naive -yet surprisingly effective- approach is to simply carry over figures from year n-1 to year n. A small corrective factor can be applied to account for year-over-year average growth.</p><p>Any improvement in accuracy beyond this baseline can be considered successful. Below, I compare the model's performance to a basic persistence model. The metrics listed are those I consider most relevant for investors and stockholders.</p><p>Here, the &#8220;loss&#8221; is the opposite of accuracy. It is a common machine learning term, which, in this case, is the average prediction error, in standard deviations<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>.</p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/qpMWD/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3cdbd9a3-d155-4490-ba13-629039246cea_1260x660.png&quot;,&quot;thumbnail_url_full&quot;:&quot;&quot;,&quot;height&quot;:779,&quot;title&quot;:&quot;Predictibility of Financial metrics&quot;,&quot;description&quot;:&quot;Persistence Model vs ML model loss (Unit: standard deviation of the data in the whole dataset)&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/qpMWD/1/" width="730" height="779" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p></p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/zh38i/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/24587cca-475d-4155-a096-8a37d130c0e8_1260x660.png&quot;,&quot;thumbnail_url_full&quot;:&quot;&quot;,&quot;height&quot;:525,&quot;title&quot;:&quot;Models predictive performances (Copy)&quot;,&quot;description&quot;:&quot;&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/zh38i/1/" width="730" height="525" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p>On average, my model is 16% more precise than a persistence model for these metrics (0.1341 vs 0.1596 average loss). It performs best on the first five metrics but shows poor performance on the last three.</p><p>This discrepancy exists because the heuristics the model learned to predict dividends, revenue, and stockholders' equity don't effectively apply to future predictions. While these patterns were learned from historical data (2000 to 2022/2023), they prove counterproductive when forecasting future performance<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a></p><p>It looks like <strong>there is no way to reliably predict the </strong><em><strong>Revenue</strong></em><strong> of a company just by looking at its past financial statements. The same holds true for </strong><em><strong>Dividends Paid</strong></em><strong> and </strong><em><strong>Stockholders&#8217; Equity</strong></em><strong>.</strong></p><p>The narratives &#8220;revenue should keep on growing like the last years&#8221; and &#8220;revenue will have to regress to the mean&#8221; are undecidable, and mostly cancel out. If anything, the latter is slightly more accurate (see footnote <a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>). This answers my first question.</p><p>In contrast, metrics related to net income show much higher predictability.</p><h3>Interpreting predictability</h3><p>First of all, it should be noted that the least predictable features are also the least volatile. On average, in the ~25,000 public companies I analyzed, the y.o.y changes in <em>Revenue</em> accounted for only 7% of the standard deviation of revenue across companies. To put it plainly: revenue varies much more across companies than it does y.o.y within the same company. The same can be said of <em>Stockholders Equity</em>.</p><p>It is surprising, though, that these metrics can&#8217;t be predicted <strong>at all</strong>. After reflection, I think it boils down to this:</p><p><strong>We have to know the initial state of a system to predict its evolution</strong></p><p>Financial statements are made to reflect the financial state of a company. The metrics they display have survived a selective pressure from investors and regulators to  convey the most important and useful financial information. </p><p>But they don&#8217;t say much about the operational state of a business (which predicts revenue) or the emotional state of board members (which predicts Dividends and Stockholders&#8217; Equity). The initial state of a system is necessary to predict its evolution, and failing to capture it impairs prediction. Because it lacks the information to make a decent prediction, the model learns its noise and makes prediction based on nonsensical metrics it has constructed. </p><p>It's like a gambler thinking, "Each time red came up twice in a row, green followed." He has learned a rule that was true for some past examples but is no more likely than random to hold true in the future. The same gambler, if he has lots of experience but little mathematical ability, will also have observed (rather than calculated) that black and red have about the same probability to come up. As for this rule, it is valid.</p><p>My model has learned the same kind of simple (and valid) rule: the best bet on Dividends, Revenue, and Stockholders' Equity evolution is that they remain the same year over year. It has just "improved" this somewhat, lowering the error it yields on past data at the expense of its real prediction ability (just like devising complex roulette rules beyond "black and red are roughly 50-50" is harmful to one's wallet).</p><p>In the coming weeks, I'm going to try to build a model capable of digesting operational and unstructured data, like the text of SEC filings and company websites.</p><h2>What Has the Model Learned? (And How Useful Is It?)</h2><p>Having a black box that makes predictions can be useful, but understanding its reasoning is even better. With a bit of math, we can peek inside.</p><h3>How does it predict EPS?</h3><p>EPS is arguably the most important metric for a shareholder to follow. Let's examine how the model predicts it<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a>.</p><p>The table below shows the relative importance of each metric in predicting EPS. Negative numbers indicate that the factor has a negative influence on EPS. However, "influence" should be understood more as correlation. Since the model doesn't fully grasp how different numbers interact, it sometimes makes counterintuitive connections. For example, it considers Cost of Revenue a positive predictor of future EPS, simply because it tends to grow at about the same rate as Revenue itself. Therefore, please interpret the following tables with caution.</p><p>The numbers listed below are the gradients of EPS with regard to the different inputs. It gives an idea of the relative weights of the input metrics for EPS forecasting. They have been normalized so that the most important gradient is 1.</p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/1EyPG/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d3697e82-f4e5-4626-af3f-034a662b27e2_1260x660.png&quot;,&quot;thumbnail_url_full&quot;:&quot;&quot;,&quot;height&quot;:915,&quot;title&quot;:&quot;Most important metrics to predict EPS&quot;,&quot;description&quot;:&quot;Searchable &amp; Sortable table&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/1EyPG/1/" width="730" height="915" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p>A few observations stand out: naturally, EPS and Diluted EPS are the best predictors of themselves, following the "persistence-is-the-best-default-model" rule. Beyond that, the results become harder to interpret. At first glance, they seem counterintuitive:</p><ul><li><p> Free Cash Flow, CapEx, and Net Cash Provided By Operating Activities show positive gradients, while Operating Cash Flow doesn't&#8212;despite being closely related.</p></li><li><p>Net Income has the most negative gradient, while Operating Income shows a reasonably positive one.</p></li><li><p>Total Debt and Net Debt display opposite gradients across all years.</p></li></ul><p>What's happening here is that the model is creating its own formulas and constructing metrics it needs to predict future EPS. It essentially arrives at the realization: "Hmm, future EPS seems to be related to the previous year's Operating Cash Flow minus previous year's CapEx &#8211; there might be something here. And interestingly, most of the time, the previous year's FCF is nearly identical to this subtraction. However, when there's a discrepancy, I'd rather trust FCF, so I'm going to assign a large positive weight to the previous year's FCF, a positive weight to CapEx, and a negative weight to Operating CF."<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a></p><p>It kind of hedges its bets.</p><h3>What about Free Cash Flow?</h3><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/74fMF/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2fd9b80a-2723-47e9-b3e0-c6831bebc58f_1260x660.png&quot;,&quot;thumbnail_url_full&quot;:&quot;&quot;,&quot;height&quot;:915,&quot;title&quot;:&quot;Most important metrics to predict Free Cash Flow&quot;,&quot;description&quot;:&quot;Searchable &amp; Sortable table&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/74fMF/1/" width="730" height="915" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p>The model is still hedging, but the results are far more explainable. It appears that the best predictor of Free Cash Flow (FCF) &#8211; aside from FCF itself &#8211; is the "net change from year n-3 to year n-1 in things that are, represent, or consume cash (and in that last case, are not strictly mandatory)." This includes investments, equity, retained earnings&#8230; essentially, everything that represents the company's ability to generate cash, even if it doesn't appear as FCF.</p><p>Interestingly, historical data is much more important here than it is for predicting EPS. EPS can be reliably predicted<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a> using only year n-1 data, or very nearly so. In contrast, when predicting FCF, data from years n-2 and n-3 combined have nearly the same significance as data from year n-1. </p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/uzgHw/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f26e0487-d9e7-49aa-9602-719338a83796_1260x660.png&quot;,&quot;thumbnail_url_full&quot;:&quot;&quot;,&quot;height&quot;:228,&quot;title&quot;:&quot;Relative importance of year n-3, n-2 and n-1&quot;,&quot;description&quot;:&quot;&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/uzgHw/1/" width="730" height="228" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p></p><h3>Now for Net Income, then I&#8217;ll go to bed</h3><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/ULnxP/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f589cb3b-ab50-4a57-a2dc-4ad2add9e21b_1260x660.png&quot;,&quot;thumbnail_url_full&quot;:&quot;&quot;,&quot;height&quot;:915,&quot;title&quot;:&quot;Most important metrics for Net Income forecasting&quot;,&quot;description&quot;:&quot;Searchable &amp; Sortable table&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/ULnxP/1/" width="730" height="915" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p>That's quite interesting. The model almost didn't hedge here. Two main takeaways:</p><ol><li><p>Revenue shows a positive gradient in year <em>n</em>-1 but a negative one in year <em>n</em>-3. This suggests that, <strong>regarding revenue's impact on future Net Income, growth is more crucial than absolute figures.</strong></p></li><li><p>The best negative predictor is dividends paid. So, the notion of board members milking a company dry before things go south might not be just a legend after all. On a personal note, I once received a 28% dividend yield from a Polish company before things took a turn for the worse (the company isn't defunct yet, but it's definitely past its prime). This illustrates that while predicting dividend payments might not be reliable, the payment itself can carry significant predictive weight.</p></li></ol><p></p><h2>Conclusion / Executive Summary</h2><p>Forecasting company financials based on past financials does work, but only for metrics loosely connected to operations and board members sentiment. </p><p>Free Cash Flow, EPS and Net Income are quite predictable while Revenue and Dividends are not. This last observation suggests Financial Statements alone don&#8217;t say anything valuable about the mindset of board members and the operational state of a business.</p><p>Model hedging makes interpreting EPS prediction heuristics difficult. However the prediction heuristics for Cash Flow and Net Income are easier to grasp. You can sort and search the gradient tables to have a better sense of the influence of each financial statement metric on predictions.</p><p>Using only financial statements to predict EPS, the prior year's data provides 70% of the predictive value. This figure is down to 55% for Free Cash Flow and 57% for Net Income.</p><p></p><h2>Further Work</h2><p>It is surprising that a prediction model could work at all; I wasn&#8217;t expecting such significant results. Still, much more can be done, by including unstructured/text data. For example, I&#8217;ve always wondered if a business&#8217; performance could be predicted from the content of its website.</p><p>Also, it would be a good idea to divide this analysis by sector. Banks and manufacturing companies don&#8217;t include the same items in their financial statements. And even though it doesn&#8217;t matter much (the model is able to say &#8220;Oh, cost of Revenue is 0, it&#8217;s more likely to be a bank than something else&#8221;), it will definitely help with interpreting gradients.</p><p>Unfortunately, I might need to sell one or two kidneys to afford enough compute to train a model for that...</p><p>Another post will follow soon, discussing the technical aspects of this project, including some interesting math and a novel (yet simple) architecture I developed to handle the data's sparsity.</p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.eloidereynal.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">You can subscribe if you want! Or else, just come back here some time.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>["revenue", "netIncome", "eps", "epsdiluted", "freeCashFlow", "totalStockholdersEquity", "operatingCashFlow", "dividendsPaid"]</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>So the input dim is 303 + 1 for currency embedding (Currency symbol, as a string, is difficult to feed into a NN. Representing it as a float would likely be too compressive, so I chose a 2-dimensional embedding to represent it)</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>The data was normalized beforehand.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>In fact, if we dive even deeper on the last three metrics, we can see something interesting: the performance of the average of persistence model + my model is better than both. What it means is that my model has learned something valuable from the past data, but is over-applying it to the future. By a factor of 2 approximately. Meaning that (Model + PeristenceModel*2)/3 is the best prediction model for these metrics.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>This regression-to-the-mean narrative is the one that prevails in past data, as it is what the model has learned. The gradients of the predicted revenue wrt revenue for years n-2 and n-3 are positive, which means that the model has learned that past glory is a better indicator of future revenue than strong growth. The effect is so small that this nitpicking only deserves a footnote.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>As this is a business post, I won&#8217;t go too deep into the math, but the general idea is that, as the model is nearly linear (it involves only two low-dimensional LeakyRELU non-linearities), its gradients are not too far from constant. So the local sensitivity of my model wrt its inputs is pretty much constant and I can average it over a batch to get a good idea of the input&#8217;s influence over the predicted output.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>FCF &#8776; Operating Cash Flow - CapEx, but with low reliability (the input data wasn&#8217;t always clean), and it looks like lambda * FCF - (1-lambda) * (Operating Cash Flow - CapEx) is a better proxy for FCF than FCF itself. The gradients reflect that, giving CapEx a positive influence.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>If a 30% improvement over persistence model is deemed reliable</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[The law of hidden costs]]></title><description><![CDATA[When driving drunk is an act of citizenship]]></description><link>https://www.eloidereynal.com/p/the-law-of-hidden-costs</link><guid isPermaLink="false">https://www.eloidereynal.com/p/the-law-of-hidden-costs</guid><dc:creator><![CDATA[Eloi de Reynal]]></dc:creator><pubDate>Sun, 22 Dec 2024 20:19:06 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/5ecf8df9-0ee5-46fd-8aa0-afe9561b9bd2_1312x736.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>"It's stupid not to wear a helmet when cycling: it doesn't cost anything and it can save a life." This is an increasingly common opinion on the subject, and it's defensible.</p><p>Here's another one: "Since we're all going to die one day, might as well be while cycling than shitting myself in a hospital bed." It's also defensible, but it's less convincing.</p><p>And here's another one, even less convincing but true in my opinion:</p><p>"Wearing a helmet when cycling can save a life, but it's not worth it. There are too many hidden costs associated with it."</p><p>Here are the hidden costs I'm thinking of:</p><ul><li><p>Time spent putting on and taking off the helmet.</p></li><li><p>The maintenance, through a physical habit, of the feeling of external danger linked to cycling.</p></li></ul><ul><li><p>The bulkiness of the helmet.</p></li><li><p>The mental load associated with carrying a helmet around: the fear of forgetting it, etc.</p></li><li><p>The decline in interest in cycling due to the inconvenience of wearing a helmet.</p><p></p></li></ul><p>In the debates I've had, these arguments failed to convince my friends. Indeed, the aforementioned costs are difficult to measure, while the benefit (a life saved) is measurable.</p><p>The debate becomes interesting when we actually bring in measurements and statistics. Some studies seem to indicate that wearing a helmet <a href="https://www.cyclehelmets.org/1012.html">does not significantly reduce mortality</a>, but like Rousseau, we're going to <a href="http://classiques.uqac.ca/classiques/Rousseau_jj/discours_origine_inegalite/origine_inegalite_intro.html#:~:text=Commen%C3%A7ons%20donc%20par%20%C3%A9carter%20tous%20les%20faits%2C%20car%20ils%20ne%20touchent%20point%20%C3%A0%20la%20question.">temporarily disregard the facts</a> to do some back-of-the-envelope calculations. Indeed, it's very probable, even obvious, that in the event of an accident, wearing a helmet reduces mortality. Too many factors come into play in more general studies, which explains their counterintuitive conclusions.</p><p>While sparing you the various hypotheses mentioned in <a href="https://docs.google.com/spreadsheets/d/1DF0sfWvc77g8VrHcOQ_D5Zqi0Gwjqvuofn_ZIQJaXjQ/edit?gid=0#gid=0">this spreadsheet</a>, it would seem that, all things being equal, you gain about 16 seconds of life per trip by wearing a helmet. Everyone can draw their own conclusions, but for my part, I am willing to lose 30 seconds of my life to live more freely and not bother with a helmet. Of course, it would take me more time to buy it, put it on and take it off than it would save me, which in itself is already a deal-breaker, but that's not even the heart of the problem: I find it hard to combine a form of carefree attitude, necessary for happiness, with a mindset so cautious that it pushes me to wear a helmet to save 16 seconds of life. For me, that's the overriding hidden cost.</p><p>A similar calculation can be made for the 80 km/h speed limit on roads, the covid lockdowns, etc...</p><p>Nowadays, we try to rationalize everything, to reduce measurable risks, without ever taking into account those that are not. However, pusillanimity is a real and serious risk, which is favored by wearing a helmet on a bicycle, refraining from talking to strangers (even if they offer candy), putting on a sweater when you go outside and it's cold... But it's not measurable, so it's always neglected.</p><h4>Some examples in business</h4><p>This phenomenon is very similar to the <a href="https://en.wikipedia.org/wiki/Streetlight_effect">streetlight effect</a>. It is widespread in almost all dimensions of a business: let's imagine, to talk about marketing, that to increase the visit rate of my company's website (a measurable metric), I change the tone of our newsletter to make it more artificially personal, more sales-oriented, etc... We might manage to improve this metric, but probably at the cost of a legitimate decline in trust from readers and a subsequent drop in sales, which are much more obscure.</p><p>Now, when managing &#8220;human resources&#8221;. It's always good to "process" everything, in order to reduce the error rate and depend less on the individual performance of employees. But we rarely think about the effects of such a company policy on employee engagement and subsequent turnover. If an employee feels that we want to make them replaceable and that we have limited confidence in their ability to do their job, they are unlikely to be willing to stay for long and be effective.</p><p>In engineering school, a mechanics professor who had worked for 20 years at Renault told me, about lean manufacturing that we were studying: "I'm not at all convinced by this practice. Maybe the sub-optimal layout of the production lines allows operators to move around the factory a bit and take advantage of it to rest. We don't know if it&#8217;s not bad for a company to optimize everything, all things considered."</p><h4>In management information systems</h4><p>In IT, we often see similar paradoxes. Complete digitization seems to improve efficiency, but it can damage employee communication. With efficient messaging systems and easy access to data, employees have fewer reasons to talk in person. So much communication is lost in the process!</p><h4>The perverse effects of "rights" in software</h4><p>The most insidious problem, I think, often lies in a seemingly small detail of ERP systems: access rights.</p><p>The typical thinking goes like this: "Employee X is in role Y, they don't need to access menu Z, which is for employees in role W. They might break things or steal data. And anyway, it's not their job."</p><p>That's understandable, and sometimes it's even necessary. For instance, it's perfectly reasonable that an employee shouldn't see their colleagues' pay slips.</p><p>However, more often than not, this restrictive approach causes more harm than good.</p><p>Firstly, if the ERP system is built well, and the support team is on the ball, it's virtually impossible to "break everything." Regular backups ensure that, in a worst-case scenario, we'd only lose a few hours' worth of data entry. That's a cost of maybe tens of thousands of euros at most, nowhere near the millions an employee could cause by physically sabotaging production. It's actually kind of strange that the physical machinery is often left more accessible than the highly restricted ERP access.</p><p>Secondly, there's a real disconnect between the message leaders often send and the way they act. On one hand, they'll give these empowering, motivational speeches, saying "Be proactive, give us your ideas, I trust you, drive the company forward!" On the other, these very same leaders implement restrictions that shout: "Stay in your lane, don't touch that, I don't trust you enough to give you these accesses." When words and actions clash like that, people will believe the actions. If ERP access is being used as a safety net, then either there&#8217;s a problem of trust from the top, or you have a weak team. Either way, it's a human or organizational issue, not an IT one.</p><h4>Driving drunk: an act of citizenship</h4><p>I'm reaching the heights of hypocrisy here, because I've never really drunk alcohol and don't plan to start anytime soon.</p><p>However, we can imagine the following situation: you are in a little party with your friends and the atmosphere is great. It would be even better with 2-3 more drinks. Only, you are afraid of having an accident on the way home, if you drink too much. So you stop there. The obvious cost (dying in a car accident) makes you forget the hidden benefit (having a very good time with your friends).</p><p>This last example was perhaps not necessary, but it justifies the subtitle of this post, which I like.</p>]]></content:encoded></item></channel></rss>