MIME-Version: 1.0
Date: Thu, 29 Jan 2026 09:33:41 -0500
References: <CAPw99a8uSoZMjg+VcWhkMCDVL0ZcN+OuEePDcTDUSYnVtpcsnQ@mail.gmail.com> <CAMyXcJPLLxz53UC0zjBcdiZJpGFgQQhFUm0FPXZvFg7Y=NAmBQ@mail.gmail.com> <CAPw99a-Wt-kPfuHuTMN6cUsa1fCHgHFziRibx5sspC4xywLPrA@mail.gmail.com> <CAPw99a-AVdis7-6H9SiX03CyPR=fG+caxnCbMfj_5sPYt0pWig@mail.gmail.com> <CAMyXcJMsbVqx8p1doEhXGqb_XBEY0w2cP1=ebdE=UrWrXid30Q@mail.gmail.com> <CAPw99a-0npGmntcwr6zi3xDENqLwMX9YqnFQJhSsvdQj-whd1w@mail.gmail.com> <CAPw99a82j56z5=-1+590u3ThqyCAd_qoKire510kf6K7ibD75Q@mail.gmail.com>
In-Reply-To: <CAPw99a82j56z5=-1+590u3ThqyCAd_qoKire510kf6K7ibD75Q@mail.gmail.com>
Message-ID: <CAMyXcJP8xgjQ2_JnA3eR-=+RcbhGiUc51i=3ZpNtUAPWNJeTAA@mail.gmail.com>
Subject: Re: Quanta discovery and Hessian eigenspectrum
From: Andy Arditi <andyrdt@gmail.com>
To: Eric Michaud <eric.michaud99@gmail.com>
Content-Type: multipart/alternative; boundary="0000000000003c12ce064987be16"

--0000000000003c12ce064987be16
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Thanks for sharing. Seems great that they're trying to scale the
susceptibilities stuff.

I can't say I understand all the theory fully yet. But the idea in 2.3 of
doing an SVD on q(y|x) where q is the true distribution of language is
nice; I don't think I've seen that view made so explicit. But I'm not
really sure these "modes" of the true data distribution are what we want?
(E.g., I think we're more interested in the mechanisms that the model
learned / the structure that the model identified, rather than some "true"
underlying structure in the data. This reminds me of the epiplexity idea
<https://arxiv.org/abs/2601.03220> - a computationally-weaker model might
learn more interesting structure than a computationally-stronger model,
e.g., a neural net chess bot must learn patterns/structure while a
brute-force tree search bot does not; it is the neural net that learns
non-trivial patterns and insights, and those are the things we're
interested in uncovering.)

I'm also not really clear about the relation between "modes"/"patterns" of
the data and mechanisms in the model. For example, would mechanisms
such as binding
of entities <https://arxiv.org/abs/2310.17191> be considered as modes? AFIK
SAEs also don't capture binding mechanisms. But for example, binding
mechanisms would definitely be considered a "quanta" in your framework (I
think? It's a ~discrete mechanism that the model learns, and it helps it
solve a bunch of next-token predictions). I'm curious what you think of
their discussion of modes vs quanta. I personally feel pretty confused
about the concepts of "modes" vs "quanta" vs "SAE features" (or rather, I
feel like the paper doesn't give adequate clarification on the difference
between these ideas). To me, all three seem quite distinct (mode =3D patter=
n
in the data distribution; quanta =3D unit of computation/skill learned by t=
he
model; SAE feature =3D unit of representation learned by the model); but in
the paper, they seem to imply that all three are sort of the same? (Sure
they are certainly all related; but they seem like different things to me.)

(I need to study their theory more; I reserve the right to erase all the
above messy ramblings lol.)

Andy

On Sat, Jan 24, 2026 at 7:40=E2=80=AFPM Eric Michaud <eric.michaud99@gmail.=
com>
wrote:

> Just started reading but seems good so far:
> https://arxiv.org/abs/2601.12703
>
>
>
> On Thu, Jan 15, 2026 at 7:09 PM, Eric Michaud <eric.michaud99@gmail.com>
> wrote:
>
>> Glad you liked the post! You're a crazy person for reading the whole
>> thing in that much detail. It's long. I'm surprised by how viral it's go=
ne
>> so far.
>>
>> Sounds like a good plan!
>>
>>
>> On Wed, Jan 14, 2026 at 11:01 AM, Andy Arditi <andyrdt@gmail.com> wrote:
>>
>>> Hey Eric,
>>>
>>> Congrats!! Adam and Paul are great (I know Paul from MATS 5.0, and I've
>>> met Adam a couple times around Berkeley); I don't know much about Aster=
a,
>>> but it seems like there are a lot of great minds there. Seems like a gr=
eat
>>> setup for you!
>>>
>>> I also just finished reading your whole post; loved it, thanks for
>>> sharing :). (Aside from the content, I love footnote 2,and also the 100
>>> page pdf :P.) I took a bit of a break over the past month, but am back =
to
>>> thinking about the loss landscape <> mechanisms stuff, and potentially
>>> building off of your quanta hypothesis. Maybe we can plan to catch up i=
n a
>>> few weeks, hopefully I'll have some ideas to chat about.
>>>
>>> Congrats again on the new job! I'm sure it'll be fun!
>>>
>>> Andy
>>>
>>> On Tue, Jan 13, 2026 at 4:50=E2=80=AFPM Eric Michaud <eric.michaud99@gm=
ail.com>
>>> wrote:
>>>
>>>> Hey Andy,
>>>>
>>>> For the moment, I've accepted a job with Adam Shai at the Astera
>>>> Institute / Simplex, so expect to have some freedom for miscellaneous
>>>> research projects. I'm not sure yet what I'll be wanting to prioritize=
, but
>>>> happy to stay in touch about any ideas.
>>>>
>>>> Also, I've written up a big blog post reflecting on the quantization
>>>> model paper and its relationship to interp, which you might enjoy:
>>>> ericjmichaud.com/quanta
>>>>
>>>> Eric
>>>>
>>>>
>>>>
>>>> On Wed, Dec 10, 2025 at 2:32 PM, Eric Michaud <eric.michaud99@gmail.co=
m
>>>> > wrote:
>>>>
>>>>> Depending on where I'm working next, collaborating a bit could be
>>>>> sweet. Let's keep each other posted :)
>>>>>
>>>>> Ah I had forgotten the rainbow serpent paper's methodology, indeed
>>>>> does seem related, but agree one can do much more here.
>>>>>
>>>>> Eric
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Dec 10, 2025 at 2:02 PM, Andy Arditi <andyrdt@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> It was great meeting you as well!!
>>>>>>
>>>>>> Thanks for writing and sharing these notes. I'm definitely intereste=
d
>>>>>> in thinking about this direction more - would be great to stay in to=
uch and
>>>>>> potentially collaborate on something (David's also interested in the=
se
>>>>>> ideas; he's been dreaming about unsupervised methods for mechanism
>>>>>> discovery).
>>>>>>
>>>>>> Re using loss kernel for quanta discovery: check out the "rainbow
>>>>>> serpent" paper <https://arxiv.org/abs/2508.00331> if you haven't
>>>>>> already; it's similar to idea 2, although I think there's still a lo=
t of
>>>>>> stuff to push on there. They work with a 2-layer attention-only
>>>>>> transformer, and have S=3D16 perturbed networks (each perturbed netw=
ork
>>>>>> corresponds to just knocking out a single attention head, iiuc), and=
 then
>>>>>> visualize these 16-dimensional vectors via UMAP; even with this simp=
le
>>>>>> method they find some interesting structures, including an induction
>>>>>> cluster.
>>>>>>
>>>>>> I'll probably do some thinking around this stuff over the next month
>>>>>> - will keep you posted!
>>>>>>
>>>>>> Andy
>>>>>>
>>>>>> On Tue, Dec 9, 2025 at 4:54=E2=80=AFPM Eric Michaud <eric.michaud99@=
gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Andy,
>>>>>>>
>>>>>>> Was really great meeting and chatting last week. I had a couple of
>>>>>>> ideas I wanted to send your way. I forget if we talked about these =
exact
>>>>>>> ideas during our chat. If not, they are certainly closely related.
>>>>>>>
>>>>>>> *1. The Hessian eigenspectrum may be of interest. *
>>>>>>>
>>>>>>> We might be able to measure the "quanta" distribution from the
>>>>>>> Hessian eigenspectrum. Let's assume that the overall language model=
ing loss
>>>>>>> decomposes into subtasks:
>>>>>>> L =3D \sum_i p_i E_{x ~ subtask_i} L(x)
>>>>>>> Let's assume that on each subtask, the Hessian eigenvalues are
>>>>>>> almost all zero. This makes sense since each subtask is simple, and=
 can be
>>>>>>> solved with some solution which has low complexity. For the sake of
>>>>>>> argument, let's say that the Hessian on subtask i, H_i =3D v_i^T v_=
i, a
>>>>>>> rank-1 matrix. Let's assume that the v_i are orthogonal for differe=
nt
>>>>>>> subtasks and have unit norm. In this setup, the Hessian eigenspectr=
um is
>>>>>>> exactly the subtask distribution p_i, since the gradient is a linea=
r
>>>>>>> operator and the overall loss is a sum of per-subtask losses.
>>>>>>>
>>>>>>> While in practice the per-subtask/"quanta" Hessians won't be rank-1
>>>>>>> and won't have exactly orthogonal high-curvature subspaces, in the
>>>>>>> aggregate I'd still expect the Hessian eigenvalues to follow the p_=
i
>>>>>>> distribution, at least for the top eigenvalues. A paper from this y=
ear
>>>>>>> measured the Hessian eigenspectrum in GPT-2 and also in vision mode=
ls, and
>>>>>>> found power laws: https://openreview.net/pdf?id=3Do62ZzfCEwZ (see e=
sp
>>>>>>> Figure 12b).
>>>>>>>
>>>>>>> I wonder what a more through and scaled up analysis of the Hessian
>>>>>>> eigenvalues in LLMs would yield. Doing this would be one way of emp=
irically
>>>>>>> getting at the quanta hypothesis, and one could do a paper with som=
e theory
>>>>>>> and toy experiments too justifying the analysis.
>>>>>>>
>>>>>>> *2. The SLT "loss kernel" **may enable much more efficient quanta
>>>>>>> discovery*
>>>>>>>
>>>>>>> The "loss kernel" (https://arxiv.org/abs/2509.26537) may actually
>>>>>>> be a much more tractable to scale than our quanta discovery method =
using
>>>>>>> model gradient similarity. The issue with using gradient similarity=
 is that
>>>>>>> one has to compute forward and backwards passes separately on each
>>>>>>> token/sample one wants to cluster over. The situation is even worse=
 than
>>>>>>> this though, since in practice it is impossible to store the gradie=
nts for
>>>>>>> all such tokens/samples simultaneously, so one has to re-compute th=
e
>>>>>>> gradients for each sample multiple times as one computes the simila=
rity
>>>>>>> matrix block by block.
>>>>>>>
>>>>>>> However, Jesse et al.'s method avoids this, and I think is extremel=
y
>>>>>>> amenable to batched computation. For each noised parameter vector w=
_k in
>>>>>>> the basin, one can just do forward passes across the whole corpus o=
ne cares
>>>>>>> about, and store the per-token losses across the corpus. If one doe=
s this
>>>>>>> for S steps of SGLD, and there are D tokens in the corpus one wants=
 to
>>>>>>> analyze, then one gets an (S, D) matrix "L". One can compute the pa=
irwise
>>>>>>> covariances by mean-centering the columns of "L" and then computing=
 C
>>>>>>> \propto  L^T L. Hopefully this matrix converges for relatively smal=
l S << N
>>>>>>> the number of network parameters. Furthermore, one can do interesti=
ng
>>>>>>> things with the matrix L without computing the full pairwise simila=
rities,
>>>>>>> like sparse dictionary learning or other sorts of matrix factorizat=
ions or
>>>>>>> clustering the columns of S with k-means.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Eric
>>>>>>>
>>>>>>
>

--0000000000003c12ce064987be16
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Thanks for sharing. Seems great that they&#39;re trying to=
 scale the susceptibilities stuff.<div><br></div><div>I can&#39;t say I und=
erstand all the theory fully yet. But the idea in 2.3 of doing an SVD on q(=
y|x) where q is the true distribution of language is nice; I don&#39;t thin=
k I&#39;ve seen that view made so explicit. But I&#39;m not really sure the=
se &quot;modes&quot; of the true data distribution are what we want? (E.g.,=
 I think we&#39;re more interested in the mechanisms that the model learned=
 / the structure that the model identified, rather than some &quot;true&quo=
t; underlying structure in the data. This reminds me of the <a href=3D"http=
s://arxiv.org/abs/2601.03220">epiplexity idea</a> - a computationally-weake=
r model might learn more interesting structure than a computationally-stron=
ger model, e.g., a neural net chess bot must learn patterns/structure while=
 a brute-force tree search bot does not; it is the neural net that learns n=
on-trivial patterns and insights, and those are the things we&#39;re intere=
sted in uncovering.)</div><div><br></div><div>I&#39;m also not really clear=
 about the relation between &quot;modes&quot;/&quot;patterns&quot; of the d=
ata and mechanisms in the model. For example, would mechanisms such as <a h=
ref=3D"https://arxiv.org/abs/2310.17191">binding of entities</a>=C2=A0be co=
nsidered as modes? AFIK SAEs also don&#39;t capture binding mechanisms. But=
 for example, binding mechanisms would definitely be considered a &quot;qua=
nta&quot; in your framework (I think? It&#39;s a ~discrete mechanism that t=
he model learns, and it helps it solve a bunch of next-token predictions).=
=C2=A0I&#39;m curious what you think of their discussion of modes vs quanta=
. I personally feel pretty confused about the concepts of &quot;modes&quot;=
 vs &quot;quanta&quot; vs &quot;SAE features&quot; (or rather, I feel like =
the paper doesn&#39;t give adequate clarification on the difference between=
 these ideas). To me, all three seem quite distinct (mode =3D pattern in th=
e data distribution; quanta =3D unit of computation/skill learned by the mo=
del; SAE feature =3D unit of representation learned by the model); but in t=
he paper, they seem to imply that all three are sort of the same? (Sure the=
y are certainly all related; but they seem like different things to me.)</d=
iv><div><br></div><div>(I need to study their theory more; I reserve the ri=
ght to erase all the above messy ramblings lol.)</div><div><br></div><div>A=
ndy</div></div><br><div class=3D"gmail_quote gmail_quote_container"><div di=
r=3D"ltr" class=3D"gmail_attr">On Sat, Jan 24, 2026 at 7:40=E2=80=AFPM Eric=
 Michaud &lt;<a href=3D"mailto:eric.michaud99@gmail.com">eric.michaud99@gma=
il.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"m=
argin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left=
:1ex"><div><div><div><div><div>Just started reading but seems good so far:=
=C2=A0<a href=3D"https://arxiv.org/abs/2601.12703" target=3D"_blank">https:=
//arxiv.org/abs/2601.12703</a><br></div><div><br></div></div><div></div><br=
><div class=3D"gmail_signature"></div></div><br><div><div class=3D"gmail_qu=
ote">On Thu, Jan 15, 2026 at 7:09 PM, Eric Michaud <span dir=3D"ltr">&lt;<a=
 href=3D"mailto:eric.michaud99@gmail.com" target=3D"_blank">eric.michaud99@=
gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=
=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding=
-left:1ex"><div class=3D"gmail_extra"><div class=3D"gmail_quote"><div><div>=
<div><div>Glad you liked the post! You&#39;re a crazy person for reading th=
e whole thing in that much detail. It&#39;s long. I&#39;m surprised by how =
viral it&#39;s gone so far.<br></div><div><br></div><div>Sounds like a good=
 plan!</div></div><div></div><br><div class=3D"gmail_signature"></div></div=
><br><div><div class=3D"gmail_quote">On Wed, Jan 14, 2026 at 11:01 AM, Andy=
 Arditi <span dir=3D"ltr">&lt;<a href=3D"mailto:andyrdt@gmail.com" rel=3D"n=
oopener noreferrer" target=3D"_blank">andyrdt@gmail.com</a>&gt;</span> wrot=
e:<br><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;b=
order-left:1px solid rgb(204,204,204);padding-left:1ex"><div class=3D"gmail=
_extra"><div class=3D"gmail_quote"><div dir=3D"ltr">Hey Eric,<div><br></div=
><div>Congrats!! Adam and Paul are great (I know Paul from MATS 5.0, and I&=
#39;ve met Adam a couple times around Berkeley); I don&#39;t know much abou=
t Astera, but it seems like there are a lot of great minds there. Seems lik=
e a great setup for you!</div><div><br></div><div>I also just finished read=
ing your whole post; loved it, thanks for sharing :). (Aside from the conte=
nt, I love footnote 2,and also the 100 page pdf :P.) I took a bit of a brea=
k over the past month, but am back to thinking about the loss landscape &lt=
;&gt; mechanisms stuff, and potentially building off of your quanta hypothe=
sis. Maybe we can plan to catch up in a few weeks, hopefully I&#39;ll have =
some ideas to chat about.</div><div><br></div><div>Congrats again on the ne=
w job! I&#39;m sure it&#39;ll be fun!</div><div><br></div><div>Andy</div></=
div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On=
 Tue, Jan 13, 2026 at 4:50=E2=80=AFPM Eric Michaud &lt;<a href=3D"mailto:er=
ic.michaud99@gmail.com" rel=3D"noopener noreferrer" target=3D"_blank">eric.=
michaud99@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quot=
e" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204)=
;padding-left:1ex"><div><div><div><div><div>Hey Andy,<br></div><div><br></d=
iv><div>For the moment, I&#39;ve accepted a job with Adam Shai at the Aster=
a Institute / Simplex, so expect to have some freedom for miscellaneous res=
earch projects. I&#39;m not sure yet what I&#39;ll be wanting to prioritize=
, but happy to stay in touch about any ideas.<br></div><div><br></div><div>=
Also, I&#39;ve written up a big blog post reflecting on the quantization mo=
del paper and its relationship to interp, which you might enjoy: <a href=3D=
"http://ericjmichaud.com/quanta" rel=3D"noopener noreferrer" target=3D"_bla=
nk">ericjmichaud.com/quanta</a><br></div><div><br></div><div>Eric<br></div>=
<div><br></div></div><div></div><br><div class=3D"gmail_signature"></div></=
div><br><div><div class=3D"gmail_quote">On Wed, Dec 10, 2025 at 2:32 PM, Er=
ic Michaud <span dir=3D"ltr">&lt;<a href=3D"mailto:eric.michaud99@gmail.com=
" rel=3D"noopener noreferrer" target=3D"_blank">eric.michaud99@gmail.com</a=
>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0p=
x 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><d=
iv class=3D"gmail_extra"><div class=3D"gmail_quote"><div><div><div><div>Dep=
ending on where I&#39;m working next, collaborating a bit could be sweet. L=
et&#39;s keep each other posted :)<br></div><div><br></div><div>Ah I had fo=
rgotten the rainbow serpent paper&#39;s methodology, indeed does seem relat=
ed, but agree one can do much more here.<br></div><div><br></div><div>Eric<=
br></div><div><br></div></div><div></div><br><div class=3D"gmail_signature"=
></div></div><br><div><div class=3D"gmail_quote">On Wed, Dec 10, 2025 at 2:=
02 PM, Andy Arditi <span dir=3D"ltr">&lt;<a href=3D"mailto:andyrdt@gmail.co=
m" rel=3D"noopener noreferrer" target=3D"_blank">andyrdt@gmail.com</a>&gt;<=
/span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px =
0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div cla=
ss=3D"gmail_extra"><div class=3D"gmail_quote"><div dir=3D"ltr">It was great=
 meeting you as well!!<div><br></div><div>Thanks for writing and sharing th=
ese notes. I&#39;m definitely interested in thinking about=C2=A0this direct=
ion more - would be great to stay in touch and potentially collaborate on s=
omething (David&#39;s also interested in these ideas; he&#39;s been dreamin=
g about unsupervised methods for mechanism discovery).</div><div><br></div>=
<div>Re using=C2=A0loss kernel for quanta discovery: check out the <a href=
=3D"https://arxiv.org/abs/2508.00331" rel=3D"noopener noreferrer" target=3D=
"_blank">&quot;rainbow serpent&quot; paper</a> if you haven&#39;t already; =
it&#39;s similar to idea 2, although I think there&#39;s still a lot of stu=
ff to push on there. They work with a 2-layer attention-only transformer, a=
nd have S=3D16 perturbed=C2=A0networks (each perturbed network corresponds =
to just knocking out a single attention head, iiuc), and then visualize the=
se 16-dimensional vectors via UMAP; even with this simple method they find =
some interesting structures, including an induction cluster.</div><div><br>=
</div><div>I&#39;ll probably do some thinking around this stuff over the ne=
xt month - will keep you posted!</div><div><br></div><div>Andy</div></div><=
br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Tue,=
 Dec 9, 2025 at 4:54=E2=80=AFPM Eric Michaud &lt;<a href=3D"mailto:eric.mic=
haud99@gmail.com" rel=3D"noopener noreferrer" target=3D"_blank">eric.michau=
d99@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" sty=
le=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);paddi=
ng-left:1ex"><div><div><div><div>Hi Andy,<br></div><div><br></div><div>Was =
really great meeting and chatting last week. I had a couple of ideas I want=
ed to send your way. I forget if we talked about these exact ideas during o=
ur chat. If not, they are certainly closely related.<br></div><div><br></di=
v><div><b>1.=C2=A0The Hessian eigenspectrum may be of interest.=C2=A0</b><b=
><br></b></div><div><br></div><div>We might be able to measure the &quot;qu=
anta&quot; distribution from the Hessian eigenspectrum. Let&#39;s assume th=
at the=C2=A0overall language modeling loss decomposes into subtasks:<br></d=
iv><div>L =3D \sum_i p_i E_{x ~ subtask_i} L(x)<br></div><div>Let&#39;s ass=
ume that on each subtask, the Hessian eigenvalues are almost all zero. This=
 makes sense since each subtask is simple, and can be solved with some solu=
tion which has low complexity. For the sake of argument, let&#39;s say that=
 the Hessian on subtask i, H_i =3D v_i^T v_i, a rank-1 matrix. Let&#39;s as=
sume that the v_i are orthogonal for different subtasks and have unit norm.=
 In this setup, the Hessian eigenspectrum is exactly the subtask distributi=
on p_i, since the gradient is a linear operator and the overall loss is a s=
um of per-subtask losses.<br></div><div><br></div><div>While in practice th=
e per-subtask/&quot;quanta&quot; Hessians won&#39;t be rank-1 and won&#39;t=
 have exactly orthogonal high-curvature subspaces, in the aggregate I&#39;d=
 still expect the Hessian eigenvalues to follow the p_i distribution, at le=
ast for the top eigenvalues. A paper from this year measured the Hessian ei=
genspectrum in GPT-2 and also in vision models, and found power laws:=C2=A0=
<a href=3D"https://openreview.net/pdf?id=3Do62ZzfCEwZ" rel=3D"noopener nore=
ferrer" target=3D"_blank">https://openreview.net/pdf?id=3Do62ZzfCEwZ</a>=C2=
=A0(see esp Figure 12b).<br></div><div><br></div><div>I wonder what a more =
through and scaled up analysis of the Hessian eigenvalues in LLMs would yie=
ld. Doing this would be one way of empirically getting at the quanta hypoth=
esis, and one could do a paper with some theory and toy experiments too jus=
tifying the analysis.</div><div><br></div><div><b>2. The SLT &quot;loss ker=
nel&quot;=C2=A0</b><span style=3D"background-color:unset"><b>may enable muc=
h more efficient quanta discovery</b></span><br></div><div><br></div><div>T=
he &quot;loss kernel&quot; (<a href=3D"https://arxiv.org/abs/2509.26537" re=
l=3D"noopener noreferrer" target=3D"_blank">https://arxiv.org/abs/2509.2653=
7</a>) may actually be a much more tractable to scale than our quanta disco=
very method using model gradient similarity. The issue with using gradient =
similarity is that one has to compute forward and backwards passes separate=
ly on each token/sample one wants to cluster over. The situation is even wo=
rse than this though, since in practice it is impossible to store the gradi=
ents for all such tokens/samples simultaneously, so one has to re-compute t=
he gradients for each sample multiple times as one computes the similarity =
matrix block by block.<br></div><div><br></div><div>However, Jesse et al.&#=
39;s method avoids this, and I think is extremely amenable to batched compu=
tation. For each noised parameter vector w_k in the basin, one can just do =
forward passes across the whole corpus one cares about, and store the per-t=
oken losses across the corpus. If one does this for S steps of SGLD, and th=
ere are D tokens in the corpus one wants to analyze, then one gets an (S, D=
) matrix &quot;L&quot;. One can compute the pairwise covariances by mean-ce=
ntering the columns of &quot;L&quot; and then computing C \propto=C2=A0 L^T=
 L. Hopefully this matrix converges for relatively small S &lt;&lt; N the n=
umber of network parameters. Furthermore, one can do interesting things wit=
h the matrix L without computing the full pairwise similarities, like spars=
e dictionary learning or other sorts of matrix factorizations or clustering=
 the columns of S with k-means.=C2=A0<br></div><div><br></div><div>Cheers,<=
br></div><div>Eric</div></div></div></div></blockquote></div></div></div></=
blockquote></div></div></div></div></div></blockquote></div></div></div></d=
iv></blockquote></div></div></div></blockquote></div></div></div></div></di=
v></blockquote></div></div><br></div></div>
</blockquote></div>

--0000000000003c12ce064987be16--