MIME-Version: 1.0 Date: Thu, 29 Jan 2026 09:33:41 -0500 References: In-Reply-To: Message-ID: Subject: Re: Quanta discovery and Hessian eigenspectrum From: Andy Arditi To: Eric Michaud Content-Type: multipart/alternative; boundary="0000000000003c12ce064987be16" --0000000000003c12ce064987be16 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Thanks for sharing. Seems great that they're trying to scale the susceptibilities stuff. I can't say I understand all the theory fully yet. But the idea in 2.3 of doing an SVD on q(y|x) where q is the true distribution of language is nice; I don't think I've seen that view made so explicit. But I'm not really sure these "modes" of the true data distribution are what we want? (E.g., I think we're more interested in the mechanisms that the model learned / the structure that the model identified, rather than some "true" underlying structure in the data. This reminds me of the epiplexity idea - a computationally-weaker model might learn more interesting structure than a computationally-stronger model, e.g., a neural net chess bot must learn patterns/structure while a brute-force tree search bot does not; it is the neural net that learns non-trivial patterns and insights, and those are the things we're interested in uncovering.) I'm also not really clear about the relation between "modes"/"patterns" of the data and mechanisms in the model. For example, would mechanisms such as binding of entities be considered as modes? AFIK SAEs also don't capture binding mechanisms. But for example, binding mechanisms would definitely be considered a "quanta" in your framework (I think? It's a ~discrete mechanism that the model learns, and it helps it solve a bunch of next-token predictions). I'm curious what you think of their discussion of modes vs quanta. I personally feel pretty confused about the concepts of "modes" vs "quanta" vs "SAE features" (or rather, I feel like the paper doesn't give adequate clarification on the difference between these ideas). To me, all three seem quite distinct (mode =3D patter= n in the data distribution; quanta =3D unit of computation/skill learned by t= he model; SAE feature =3D unit of representation learned by the model); but in the paper, they seem to imply that all three are sort of the same? (Sure they are certainly all related; but they seem like different things to me.) (I need to study their theory more; I reserve the right to erase all the above messy ramblings lol.) Andy On Sat, Jan 24, 2026 at 7:40=E2=80=AFPM Eric Michaud wrote: > Just started reading but seems good so far: > https://arxiv.org/abs/2601.12703 > > > > On Thu, Jan 15, 2026 at 7:09 PM, Eric Michaud > wrote: > >> Glad you liked the post! You're a crazy person for reading the whole >> thing in that much detail. It's long. I'm surprised by how viral it's go= ne >> so far. >> >> Sounds like a good plan! >> >> >> On Wed, Jan 14, 2026 at 11:01 AM, Andy Arditi wrote: >> >>> Hey Eric, >>> >>> Congrats!! Adam and Paul are great (I know Paul from MATS 5.0, and I've >>> met Adam a couple times around Berkeley); I don't know much about Aster= a, >>> but it seems like there are a lot of great minds there. Seems like a gr= eat >>> setup for you! >>> >>> I also just finished reading your whole post; loved it, thanks for >>> sharing :). (Aside from the content, I love footnote 2,and also the 100 >>> page pdf :P.) I took a bit of a break over the past month, but am back = to >>> thinking about the loss landscape <> mechanisms stuff, and potentially >>> building off of your quanta hypothesis. Maybe we can plan to catch up i= n a >>> few weeks, hopefully I'll have some ideas to chat about. >>> >>> Congrats again on the new job! I'm sure it'll be fun! >>> >>> Andy >>> >>> On Tue, Jan 13, 2026 at 4:50=E2=80=AFPM Eric Michaud >>> wrote: >>> >>>> Hey Andy, >>>> >>>> For the moment, I've accepted a job with Adam Shai at the Astera >>>> Institute / Simplex, so expect to have some freedom for miscellaneous >>>> research projects. I'm not sure yet what I'll be wanting to prioritize= , but >>>> happy to stay in touch about any ideas. >>>> >>>> Also, I've written up a big blog post reflecting on the quantization >>>> model paper and its relationship to interp, which you might enjoy: >>>> ericjmichaud.com/quanta >>>> >>>> Eric >>>> >>>> >>>> >>>> On Wed, Dec 10, 2025 at 2:32 PM, Eric Michaud >>> > wrote: >>>> >>>>> Depending on where I'm working next, collaborating a bit could be >>>>> sweet. Let's keep each other posted :) >>>>> >>>>> Ah I had forgotten the rainbow serpent paper's methodology, indeed >>>>> does seem related, but agree one can do much more here. >>>>> >>>>> Eric >>>>> >>>>> >>>>> >>>>> On Wed, Dec 10, 2025 at 2:02 PM, Andy Arditi >>>>> wrote: >>>>> >>>>>> It was great meeting you as well!! >>>>>> >>>>>> Thanks for writing and sharing these notes. I'm definitely intereste= d >>>>>> in thinking about this direction more - would be great to stay in to= uch and >>>>>> potentially collaborate on something (David's also interested in the= se >>>>>> ideas; he's been dreaming about unsupervised methods for mechanism >>>>>> discovery). >>>>>> >>>>>> Re using loss kernel for quanta discovery: check out the "rainbow >>>>>> serpent" paper if you haven't >>>>>> already; it's similar to idea 2, although I think there's still a lo= t of >>>>>> stuff to push on there. They work with a 2-layer attention-only >>>>>> transformer, and have S=3D16 perturbed networks (each perturbed netw= ork >>>>>> corresponds to just knocking out a single attention head, iiuc), and= then >>>>>> visualize these 16-dimensional vectors via UMAP; even with this simp= le >>>>>> method they find some interesting structures, including an induction >>>>>> cluster. >>>>>> >>>>>> I'll probably do some thinking around this stuff over the next month >>>>>> - will keep you posted! >>>>>> >>>>>> Andy >>>>>> >>>>>> On Tue, Dec 9, 2025 at 4:54=E2=80=AFPM Eric Michaud >>>>>> wrote: >>>>>> >>>>>>> Hi Andy, >>>>>>> >>>>>>> Was really great meeting and chatting last week. I had a couple of >>>>>>> ideas I wanted to send your way. I forget if we talked about these = exact >>>>>>> ideas during our chat. If not, they are certainly closely related. >>>>>>> >>>>>>> *1. The Hessian eigenspectrum may be of interest. * >>>>>>> >>>>>>> We might be able to measure the "quanta" distribution from the >>>>>>> Hessian eigenspectrum. Let's assume that the overall language model= ing loss >>>>>>> decomposes into subtasks: >>>>>>> L =3D \sum_i p_i E_{x ~ subtask_i} L(x) >>>>>>> Let's assume that on each subtask, the Hessian eigenvalues are >>>>>>> almost all zero. This makes sense since each subtask is simple, and= can be >>>>>>> solved with some solution which has low complexity. For the sake of >>>>>>> argument, let's say that the Hessian on subtask i, H_i =3D v_i^T v_= i, a >>>>>>> rank-1 matrix. Let's assume that the v_i are orthogonal for differe= nt >>>>>>> subtasks and have unit norm. In this setup, the Hessian eigenspectr= um is >>>>>>> exactly the subtask distribution p_i, since the gradient is a linea= r >>>>>>> operator and the overall loss is a sum of per-subtask losses. >>>>>>> >>>>>>> While in practice the per-subtask/"quanta" Hessians won't be rank-1 >>>>>>> and won't have exactly orthogonal high-curvature subspaces, in the >>>>>>> aggregate I'd still expect the Hessian eigenvalues to follow the p_= i >>>>>>> distribution, at least for the top eigenvalues. A paper from this y= ear >>>>>>> measured the Hessian eigenspectrum in GPT-2 and also in vision mode= ls, and >>>>>>> found power laws: https://openreview.net/pdf?id=3Do62ZzfCEwZ (see e= sp >>>>>>> Figure 12b). >>>>>>> >>>>>>> I wonder what a more through and scaled up analysis of the Hessian >>>>>>> eigenvalues in LLMs would yield. Doing this would be one way of emp= irically >>>>>>> getting at the quanta hypothesis, and one could do a paper with som= e theory >>>>>>> and toy experiments too justifying the analysis. >>>>>>> >>>>>>> *2. The SLT "loss kernel" **may enable much more efficient quanta >>>>>>> discovery* >>>>>>> >>>>>>> The "loss kernel" (https://arxiv.org/abs/2509.26537) may actually >>>>>>> be a much more tractable to scale than our quanta discovery method = using >>>>>>> model gradient similarity. The issue with using gradient similarity= is that >>>>>>> one has to compute forward and backwards passes separately on each >>>>>>> token/sample one wants to cluster over. The situation is even worse= than >>>>>>> this though, since in practice it is impossible to store the gradie= nts for >>>>>>> all such tokens/samples simultaneously, so one has to re-compute th= e >>>>>>> gradients for each sample multiple times as one computes the simila= rity >>>>>>> matrix block by block. >>>>>>> >>>>>>> However, Jesse et al.'s method avoids this, and I think is extremel= y >>>>>>> amenable to batched computation. For each noised parameter vector w= _k in >>>>>>> the basin, one can just do forward passes across the whole corpus o= ne cares >>>>>>> about, and store the per-token losses across the corpus. If one doe= s this >>>>>>> for S steps of SGLD, and there are D tokens in the corpus one wants= to >>>>>>> analyze, then one gets an (S, D) matrix "L". One can compute the pa= irwise >>>>>>> covariances by mean-centering the columns of "L" and then computing= C >>>>>>> \propto L^T L. Hopefully this matrix converges for relatively smal= l S << N >>>>>>> the number of network parameters. Furthermore, one can do interesti= ng >>>>>>> things with the matrix L without computing the full pairwise simila= rities, >>>>>>> like sparse dictionary learning or other sorts of matrix factorizat= ions or >>>>>>> clustering the columns of S with k-means. >>>>>>> >>>>>>> Cheers, >>>>>>> Eric >>>>>>> >>>>>> > --0000000000003c12ce064987be16 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Thanks for sharing. Seems great that they're trying to= scale the susceptibilities stuff.

I can't say I und= erstand all the theory fully yet. But the idea in 2.3 of doing an SVD on q(= y|x) where q is the true distribution of language is nice; I don't thin= k I've seen that view made so explicit. But I'm not really sure the= se "modes" of the true data distribution are what we want? (E.g.,= I think we're more interested in the mechanisms that the model learned= / the structure that the model identified, rather than some "true&quo= t; underlying structure in the data. This reminds me of the epiplexity idea - a computationally-weake= r model might learn more interesting structure than a computationally-stron= ger model, e.g., a neural net chess bot must learn patterns/structure while= a brute-force tree search bot does not; it is the neural net that learns n= on-trivial patterns and insights, and those are the things we're intere= sted in uncovering.)

I'm also not really clear= about the relation between "modes"/"patterns" of the d= ata and mechanisms in the model. For example, would mechanisms such as binding of entities=C2=A0be co= nsidered as modes? AFIK SAEs also don't capture binding mechanisms. But= for example, binding mechanisms would definitely be considered a "qua= nta" in your framework (I think? It's a ~discrete mechanism that t= he model learns, and it helps it solve a bunch of next-token predictions).= =C2=A0I'm curious what you think of their discussion of modes vs quanta= . I personally feel pretty confused about the concepts of "modes"= vs "quanta" vs "SAE features" (or rather, I feel like = the paper doesn't give adequate clarification on the difference between= these ideas). To me, all three seem quite distinct (mode =3D pattern in th= e data distribution; quanta =3D unit of computation/skill learned by the mo= del; SAE feature =3D unit of representation learned by the model); but in t= he paper, they seem to imply that all three are sort of the same? (Sure the= y are certainly all related; but they seem like different things to me.)

(I need to study their theory more; I reserve the ri= ght to erase all the above messy ramblings lol.)

A= ndy

On Sat, Jan 24, 2026 at 7:40=E2=80=AFPM Eric= Michaud <eric.michaud99@gma= il.com> wrote:
Just started reading but seems good so far:= =C2=A0https:= //arxiv.org/abs/2601.12703


On Thu, Jan 15, 2026 at 7:09 PM, Eric Michaud <eric.michaud99@= gmail.com> wrote:
=
Glad you liked the post! You're a crazy person for reading th= e whole thing in that much detail. It's long. I'm surprised by how = viral it's gone so far.

Sounds like a good= plan!


On Wed, Jan 14, 2026 at 11:01 AM, Andy= Arditi <andyrdt@gmail.com> wrot= e:
Hey Eric,

Congrats!! Adam and Paul are great (I know Paul from MATS 5.0, and I&= #39;ve met Adam a couple times around Berkeley); I don't know much abou= t Astera, but it seems like there are a lot of great minds there. Seems lik= e a great setup for you!

I also just finished read= ing your whole post; loved it, thanks for sharing :). (Aside from the conte= nt, I love footnote 2,and also the 100 page pdf :P.) I took a bit of a brea= k over the past month, but am back to thinking about the loss landscape <= ;> mechanisms stuff, and potentially building off of your quanta hypothe= sis. Maybe we can plan to catch up in a few weeks, hopefully I'll have = some ideas to chat about.

Congrats again on the ne= w job! I'm sure it'll be fun!

Andy

On= Tue, Jan 13, 2026 at 4:50=E2=80=AFPM Eric Michaud <eric.= michaud99@gmail.com> wrote:
Hey Andy,

For the moment, I've accepted a job with Adam Shai at the Aster= a Institute / Simplex, so expect to have some freedom for miscellaneous res= earch projects. I'm not sure yet what I'll be wanting to prioritize= , but happy to stay in touch about any ideas.

= Also, I've written up a big blog post reflecting on the quantization mo= del paper and its relationship to interp, which you might enjoy: ericjmichaud.com/quanta

Eric
=



On Wed, Dec 10, 2025 at 2:32 PM, Er= ic Michaud <eric.michaud99@gmail.com> wrote:
Dep= ending on where I'm working next, collaborating a bit could be sweet. L= et's keep each other posted :)

Ah I had fo= rgotten the rainbow serpent paper's methodology, indeed does seem relat= ed, but agree one can do much more here.

Eric<= br>



On Wed, Dec 10, 2025 at 2:= 02 PM, Andy Arditi <andyrdt@gmail.com><= /span> wrote:
It was great= meeting you as well!!

Thanks for writing and sharing th= ese notes. I'm definitely interested in thinking about=C2=A0this direct= ion more - would be great to stay in touch and potentially collaborate on s= omething (David's also interested in these ideas; he's been dreamin= g about unsupervised methods for mechanism discovery).

=
Re using=C2=A0loss kernel for quanta discovery: check out the "rainbow serpent" paper if you haven't already; = it's similar to idea 2, although I think there's still a lot of stu= ff to push on there. They work with a 2-layer attention-only transformer, a= nd have S=3D16 perturbed=C2=A0networks (each perturbed network corresponds = to just knocking out a single attention head, iiuc), and then visualize the= se 16-dimensional vectors via UMAP; even with this simple method they find = some interesting structures, including an induction cluster.

=
I'll probably do some thinking around this stuff over the ne= xt month - will keep you posted!

Andy
<= br>
On Tue,= Dec 9, 2025 at 4:54=E2=80=AFPM Eric Michaud <eric.michau= d99@gmail.com> wrote:
Hi Andy,

Was = really great meeting and chatting last week. I had a couple of ideas I want= ed to send your way. I forget if we talked about these exact ideas during o= ur chat. If not, they are certainly closely related.

1.=C2=A0The Hessian eigenspectrum may be of interest.=C2=A0

We might be able to measure the "qu= anta" distribution from the Hessian eigenspectrum. Let's assume th= at the=C2=A0overall language modeling loss decomposes into subtasks:
L =3D \sum_i p_i E_{x ~ subtask_i} L(x)
Let's ass= ume that on each subtask, the Hessian eigenvalues are almost all zero. This= makes sense since each subtask is simple, and can be solved with some solu= tion which has low complexity. For the sake of argument, let's say that= the Hessian on subtask i, H_i =3D v_i^T v_i, a rank-1 matrix. Let's as= sume that the v_i are orthogonal for different subtasks and have unit norm.= In this setup, the Hessian eigenspectrum is exactly the subtask distributi= on p_i, since the gradient is a linear operator and the overall loss is a s= um of per-subtask losses.

While in practice th= e per-subtask/"quanta" Hessians won't be rank-1 and won't= have exactly orthogonal high-curvature subspaces, in the aggregate I'd= still expect the Hessian eigenvalues to follow the p_i distribution, at le= ast for the top eigenvalues. A paper from this year measured the Hessian ei= genspectrum in GPT-2 and also in vision models, and found power laws:=C2=A0= https://openreview.net/pdf?id=3Do62ZzfCEwZ=C2= =A0(see esp Figure 12b).

I wonder what a more = through and scaled up analysis of the Hessian eigenvalues in LLMs would yie= ld. Doing this would be one way of empirically getting at the quanta hypoth= esis, and one could do a paper with some theory and toy experiments too jus= tifying the analysis.

2. The SLT "loss ker= nel"=C2=A0may enable muc= h more efficient quanta discovery

T= he "loss kernel" (https://arxiv.org/abs/2509.2653= 7) may actually be a much more tractable to scale than our quanta disco= very method using model gradient similarity. The issue with using gradient = similarity is that one has to compute forward and backwards passes separate= ly on each token/sample one wants to cluster over. The situation is even wo= rse than this though, since in practice it is impossible to store the gradi= ents for all such tokens/samples simultaneously, so one has to re-compute t= he gradients for each sample multiple times as one computes the similarity = matrix block by block.

However, Jesse et al.&#= 39;s method avoids this, and I think is extremely amenable to batched compu= tation. For each noised parameter vector w_k in the basin, one can just do = forward passes across the whole corpus one cares about, and store the per-t= oken losses across the corpus. If one does this for S steps of SGLD, and th= ere are D tokens in the corpus one wants to analyze, then one gets an (S, D= ) matrix "L". One can compute the pairwise covariances by mean-ce= ntering the columns of "L" and then computing C \propto=C2=A0 L^T= L. Hopefully this matrix converges for relatively small S << N the n= umber of network parameters. Furthermore, one can do interesting things wit= h the matrix L without computing the full pairwise similarities, like spars= e dictionary learning or other sorts of matrix factorizations or clustering= the columns of S with k-means.=C2=A0

Cheers,<= br>
Eric

--0000000000003c12ce064987be16--