Want to take part in these discussions? Sign in if you have an account, or apply for one below
Vanilla 1.1.10 is a product of Lussumo. More Information: Documentation, Community Support.
1 to 24 of 24
At last it has happened - Huawei Research collaboration is starting to publish their groundbreaking work on the formalization and theory of deep neural networks and deep learning. https://arxiv.org/abs/2106.14587v1 is the first published preprint (more than 100 pages already) and it references to-be-published preprints and also whole book of papers.
I can handle the theory as a beginner, but this preprint is full of musings and is not formatted in the form of theorems, lemmas and proofs. etc.
So - can we have some discussions about it? Some elaboration? Reformulation?
I have asked the first question in Stackexchange https://math.stackexchange.com/questions/4189699/topos-and-stacks-of-neural-networks-how-to-understand-sentence-values-in-t
But there will be more and more such questions. I hope that preprint will have next versions and I hope to read especailly forthcoming reference: [BBG20] Jean-Claude Belfiore, Daniel Bennequin, and Xavier Giraud. Logical informa-tion cells. Internal technical report, Huawei, 2020.
But DNN theory (especially used for guidance in the selection of DNN architecture and discovery of deep learning algroithms) is so hot topic, that it is impossible to wait.
Actually - this is the moment that Gates and Allen experienced some 40 years ago - trillion$ business opportunity. And huge step for civilization as well, not just in Science.
I have scanned through the article arXiv:2106.14587 now.
The proposed translation from neural networks to toposes seems to happen in section 3.2 (p. 7), where the “possible activities of the population of neurons” arranged in a graph is claimed to be encoded by a (co-)presheaf of sets on (the category generated by) .
This statement is vaguely plausible, but rather unspecific. Where in the article is this perspective put to use for DNNs?
Probably Section 3.4, where back-propagation is meant to be interpreted topos-theoretically. But is it? Equations (2) - (11) recall basics formulas of the standard theory (though it remains unclear what their import is here). Surprisingly, after (11) we see a Theorem 1 on p. 11 suddenly state that “Backpropagation is a flow of natural transformations”. There seems to be no connect between these words and what happened before (or after) this statement.
Yet, this Theorem 1 appears to be the last attempt in the article to relate toposes to DNNs. What follows is a kind of review of a plethora of concepts from category theory and homotopy theory.
For example, on p. 91 we suddenly read “The map ” is an example of a derivator and that: “A derivator generalizes the passage from a category to its topos of presheaves”. But the latter is not a characterization of the concept of derivators, and the former remains unspecified: There is no “map ” around p. 91. Probably this refers way back to section 5, where denoted model categories of -valued presheaves on .
A few pages later the article ends with “The characterization of fibrant and cofibrant objects in was the main result of section 5” (looking through section 5, this claim seems to refer to Theorem 4 (p. 32) which mentions ’type theory structure’ and claims to prove that something is a fibration, essentially by saying that this has been argued elsewhere.)
The last sentence of the article then is:
How to use this fact for functioning networks?
That, indeed, seems to be the general open question here.
NB: There is now a v2 of arXiv:2106.14587, and this comment has been edited to reflect the new pagination and numbering in brackets, while retaining the old ones for the convenience of those who have the older version.
Re. #1:
their groundbreaking work on the formalization and theory of deep neural networks and deep learning
There’s been plenty of work done in this area1 since 2017, when Fong et al. (“FST”) posted Backprop as Functor 2 on the arXiv. Earlier attempts at applying category theory to the study of neural networks go back at least to 1999.3
What the authors (“BB”) claim to have done in arXiv:2106.14587 (“the BB preprint”) is to apply topos theory to the study of equivariance and invariance in deep neural networks (DNNs).
this preprint is full of musings and is not formatted in the form of theorems, lemmas and proofs. etc. […C]an we have some discussions about it? Some elaboration? Reformulation?
The videos of the talks given by the two authors4 at Toposes Online are now available, so you can watch them elaborate on their own work. I’ll just give some motivating remarks and comment on what I think BB are claiming in this preprint.
DISCLAIMER: The following is only my imperfect understanding of the claims in the preprint. Whether or not anything claimed in the preprint is true is outside of the scope of this comment.
A major problem with DNNs now is that our ability to reason about them is inadequate. One possible solution is to see if we can get a topos out of a suitably defined category of DNNs. Since a topos is a locally Cartesian closed category, this allows us to use Martin-Löf dependent type theory and higher-order logic.
The structure of a DNN can be modelled as a directed graph, possibly with cycles, that has labelled edges and vertices. The edge labels are usually called weights, while the vertex labels are called biases by FST and activities in the BB preprint. Graphs are not directly considered in the FST paper, because I think they wanted a more prosaic approach and people didn’t seem to know at the time whether a category of DNNs even makes sense. The point of the FST paper was to show that a category of DNNs exist, but it also pointed out that backprop is needed for composition in the category to work.
The BB preprint comes after all that, and its focus is on analysing the properties of a specific DNN, so a graph is fixed from the start. It also considers functor categories (more precisely, (pre-)sheaves) directly, so it models the edge and vertex labels as functors, unlike the FST approach.5 Since backprop is now modelled as a morphism of functors, rather than a morphism of objects, it is thus a natural transformation rather than a functor.
Note that since BB considered the simplest case of a “chain” (a directed graph without branches) in section 3.2 (v2: 1.2), they needed to deal with branching in section 3.3 (v2: 1.3), something which FST did not have to do. This is done in the BB preprint in a way which I don’t quite understand: the reasoning is given in a “Remarks” paragraph on p. 14 (v2: 16) in section 3.5 (v2: 1.5), and seems to involve the avoidance of some difficulty with constructing a covering.
In any case, section 3.3 (v2: 1.3) also claims to show how the sheaves associated with a DNN can be endowed with a Grothendieck topology, and thus become toposes. For the FST approach, it has only been in the past year or so that preprints have emerged that claim to have constructed a topos for DNNs based on the FST construction.6
Since BB are also interested in the invariance structure of certain DNN architectures, they move on to consider the classifying topos construction, stacks over the DNN topos and model categories. They also claim to have adapted a homological method, which was developed by the second author and Baudot7 to study entropy, to this setting in section 5.4 (v2: 3.4), and to have developed a preliminary theory of “semantic information” in an attempt to explain the efficacy of some DNN architectures in handling natural language processing tasks.
See arXiv:2106.07032 for a review by Shiebler et al., especially section 2. ↩
Michael J Healy, in a paper at IJCNN’99, used category theory as a tool to model concept formation and recall in neural networks. ↩
Links to the first author’s talk and the second author’s talk. ↩
An updated FST approach is given in arXiv:2103.01931, which constructs a bicategory of DNNs. ↩
This work is mostly due to Spivak, see arXiv:2003.04827, arXiv:2005.0189 and arXiv:2103.01189 ↩
Pierre Baudot and Daniel Bennequin, The homological nature of entropy, Entropy 2015, 17, 3253-3318; doi:10.3390/e17053253 ↩
re #3
One possible solution is to see if we can get a topos out of a suitably defined category of DNNs. If we can get a Grothendieck topos out of it, that would be even better, because every such topos is a locally Cartesian closed category (see Proposition 3.1 of this nLab entry), which supports Martin-Löf dependent type theory and higher-order logic.
Huh? I thought any topos is locally Cartesian closed because all of it’s slice categories are toposes - see over-topos. Some other categories may be LCC without being a topos.
Re. #4
I thought any topos is locally Cartesian closed because all of it’s slice categories are toposes - see over-topos. Some other categories may be LCC without being a topos.
Fixed, thank you!
A quick check with the nLab tells me that any quasi-topos is LCC. I think the authors of that preprint are actually working with quasi-toposes in some places, since posets and Heyting algebras also make an appearance.
Has some managed to decipher at least part of the BB preprint and speaches?
E.g. I am stuck with sentence (that precedes formula (5.8) directly in v2 of BB: Natural morphisms between objects (F, A) and (F ′, A′) of AC are defined by a family of functors FU : FU → F ′U, such that for any morphism α : U → U′ in C
So, we have “class of presheaves” AC = Grp∧C (actually, it is category or presheaves, presheave is the functor from C to Set category). So, pre-sheave as functor is the object of A_C. But what this pre-sheave functor denotes? I understand that this pre-sheave is the functor from the category of C (category of graphs, category of architectures/connectivity among neurons) towards category of vectors of weights/biases (I am very uneasey here - there is no category of vectors…)… Maybe one can understand this presheave as the collection of assignments <graph, weights/biases>.
Antoher question is - what distinguishes those pre-sheaves from one antoher? In what manner presheave-functor italic-F_U is different from italic-F_U’? Maybe those presheaves are defined only on some parts of the graph, on some layers? E.g. presheave is collection of assignment of weights/biases to the sub-network?
I can try to understand why we consider the collection of w-b assignments… Maybe different learning algorithms give different assignments, or different data give different assignments…
And, further,… we have decided that presheaves are denoted by italic-F_U and italic-F_U’ for different parts (layers) U and U’ of networks. By why the authors speak about objects (italic-F, A) and (italic-F, A’)? A may be assignment of w-b? Or A may be the group element that corresponds to the actual assignment of w-b? Italic-F as pre-sheaf already contains the assignment of w-b, so, the addition of A means the addition of the interpretation of numerical w-b assignment into the elements of group (or subgroup, or factor-class of group, or another collection of group elements).
As I read this article (and I am reading it a lot and going over video presentations) and as I read the other materials about co-homology of information, toposes, sheaves, I am fighting with trying to make sense what each letter, each denotation, each formula means in this article. It is so hard to understand and check it.
It would be nice it we could be the glossary of terms, letters, denotations used in this article with explicit explanation for each letter (as it is used in several contexts).
It would be great to hear any more thoughts and insights into this. Thanks for sharing!
Re #6:
I am stuck with sentence (that precedes formula (5.8) directly in v2 of BB: Natural morphisms between objects (, ) and (, ) of are defined by a family of functors , such that for any morphism in .
This bit (from the beginning of section 5.2 up to the paragraph preceding the Remark after equation (5.9)) is a summary of the material discussed from the beginning of section 2.2 (v1: 4.2) up to equation (2.6). As indicated in the first paragraph of section 5.2 (v1: 7.2), the reader is referred to sections 3 and 4 of v1(!) (these are sections 1 and 2 of v2) for details.
So, we have “class of presheaves” (actually, it is category or presheaves, presheave is the functor from to Set category). So, pre-sheave as functor is the object of . But what this pre-sheave functor denotes? I understand that this pre-sheave is the functor from the category of (category of graphs, category of architectures/connectivity among neurons) towards category of vectors of weights/biases (I am very uneasey here - there is no category of vectors…)…
They are talking about (pre-)stacks, which are (pre-)sheaves of categories. Section 2.1 discusses their proposal for studying invariance in DNNs, using stacks and their construction of a topos for DNNs, while section 2.2 goes over some theory about stacks. As noted just before section 2.3, the theory in sections 2.2 is treated in greater detail in arXiv:2107.04417, the new preprint by Caramello and Zanfa.
Lets stick to the basics. There is sentence “category freely generated by the graph” in page 5-9 (at the start of Ch 1.2) of BBv2. What does this phrase mean? Is https://en.wikipedia.org/wiki/Free_category the mentioned freely generated category? Do I understand correctly, that is not just collection of graphs (i.e. each object is some graph and each morphism is some transformation between objects/graphs), but that is collection of objects and morphisms that is somehow generated (details in referenced wiki) by some fixed graph (which can be space-time expansion of the graph of some fixed RNN as well)?
p.s. I have read some 80% of the preprint (last portions of the last 2 chapters pending), but now I will go through entire preprint to stick with and clarify every notion (of course - I have lot of category theory books to read about fibred categories, Grothendieck constructions, sheaves myself, so, no questions about them). I hope that we can go together and create a guide for others during our walk. Thanks in advance!
There is sentence “category freely generated by the graph” in page 5-9 (at the start of Ch 1.2) of BBv2. What does this phrase mean? Is https://en.wikipedia.org/wiki/Free_category the mentioned freely generated category?
Yes, see also the nLab page for free category.
is not just collection of graphs (i.e. each object is some graph and each morphism is some transformation between objects/graphs)
is not even that. The objects of are the vertices of , not graphs. The morphisms between two objects are paths that exist on between them.
Let’s stick to the basics.
I agree, which is why reading the FST paper, Backprop as Functor (arXiv:1711.10455, LICS’19, with the arXiv version being the complete version), is strongly recommended for understanding the BB preprint. That paper was published in LICS’19, so it was written specifically for an audience of computer scientists and software engineers who’re not all experts in category theory, unlike the BB preprint. It builds up a category of neural networks (called “learners” there) from the very basics, so you get a pretty good understanding of what it takes to get such a category. That definitely helps with understanding BB, particularly section 1.2.
Hi Rongmin,
thanks for bringing in expertise. Whatever solid statements there are to be made about category theoretic notions brought to bear on neural networks, please feel invited to record them in our entry neural network. Incrementally recording pertinent definitions/theorems/proofs there might also help the discussion here.
HI Urs,
Thank you for pointing out that entry. I was actually wondering where on the nLab I could do that.
I have managed to understood encodings of activations and weights. Now I am reading last paragraph of p5-9 of BBv2 and first full paragraph of p6-10 of BBv2.
There are couple of uncertainties. First - as I understand, then "singleton in " is the same thing as (terminal) "object in " and it is the same terminal object that enters into the fibration/classification square as demonstrated in p3 and p4 of https://arxiv.org/abs/1012.5647v3 - for providing and example how classifying object can be seen as the the Boolean values (true, false) and how fibration be seen as the pre-image of the characteristic function. Why they are using and for the same notion? And about semantics - such fibration square essentially is very bold assertion, which essentially says: specific set of weights (as encoded by specific functor ) can emerge from different set of activations (as encoded by specific functor or set of specific functors). Well - I am not sure - whether different set activations can indeed correspond to the same set of weights or is it otherwise. I have never though about such relationship and categorical formalisation puts pressure on use to take one side…
The next paragraph is the hardest one. What is ? Usually this symbol is used as the classifying object in the fibration. It can be true/false set in logic examples from https://arxiv.org/abs/1012.5647v3 or it can be set of specific -functors from the preceding paragraph (e.g. specific set of weights classifies all the possible sets of activations). But then I don’t understand how (as specific functor - they are using this letter both for the specific functor and for the category ((sub-)topos as part of ) of functors) can be made the container of subobjects. is constant-weight assignment to all the weights and what can be subobject of such assignment? Assignment for only part of the graph? Maybe is the denotation for the set of weights for the subgraph of ? Maybe it is so. By why they are speaking about logic at this moment? Upon the first reading of the paper I had the impression that is somehow denoting the emergent group or semantics from the weights or activations. I.e. maybe the activations or weights of the trained neural network admit the group of transformations that transform some subset of weights/activations into another subset (or something like this) and this set of transformation can be some specific group and this group may be THE semantics of the trained neural network? But they are still speaking about weights/activations only in this paragraph.
Reference "[Pro19] Alain Prout´e. Introduction ‘a la logique cat´egorique. MSc Course, Universit´e Paris Diderot, 2019." seems to be very interesting and accessible, unfortunately, it is in French and I am newbie in French. Maybe there is translation or alternative for that?
And of course - what about (hugely waited for preprints): [BBG20] Jean-Claude Belfiore, Daniel Bennequin, and Xavier Giraud. Logical information cells, Part I. Internal technical report, Huawei, 2020. [BBG21] Jean-Claude Belfiore, Daniel Bennequin, and Xavier Giraud. Logical information cells, Part II. Internal technical report, Huawei, 2021. I have checked Arxiv and they still are not there. Maybe some researchers have access to those preprints that are circulated privately (e.g. with the signed/stated non-disclosure agreement). If so, maybe someone can share them with me? I guess, that many miunderstanding can be solved by those examples.
Re. #12:
“singleton in ” is the same thing as (terminal) “object in ”
Why they are using and for the same notion?
This is because, as Leinster explains in arXiv:1012.5647, “[t]here is no virtue in distinguishing between one-element sets, so we might as well write instead of” choosing a singleton {}.
Note that what BB have written is that a singleton in is the same thing as “a morphism in from the final object ” “to the object ”, i.e. a morphism . This is a consequence of the Yoneda lemma: in the nLab entry I just linked to, see the section “Corollary I: Yoneda embedding” and its interpretation.
The Yoneda lemma is fundamental to understanding BB: Leinster expects readers to be comfortable with it, as he’s stated in p. 1 of his notes, but BB assume you know it thoroughly, and they use it all the time without comment.
it is the same terminal object that enters into the fibration/classification square
It’s known as a pullback square.
such fibration square essentially is very bold assertion, which essentially says: specific set of weights (as encoded by specific functor ) can emerge from different set of activations (as encoded by specific functor or set of specific functors).
It would be bold indeed if they’ve actually asserted that, but all they’re saying is that the system of weights of each edge of their graph , which they write as the functor , corresponds to the fibre of the projection . At this stage in the preprint, is just a chain with no branches, so there is only one weight on each edge.
What is ?
This is the subobject classifier of : see the nLab entry on the category Quiv of quivers (or directed graphs) for more details on how looks like.
Maybe is the denotation for the set of weights for the subgraph of ?
As explained in the nLab entry on Quiv, is the set of edges of , which can be thought of as the set of generic edges of . Note that BB throws away self-loops, but this doesn’t affect the shape of .
why they are speaking about logic at this moment?
Good question. I think what they’re saying is that a neural network builds up a proposition in the language of the topos, and that this proposition would change as you add more layers to it.
And of course - what about (hugely waited for preprints)
I have checked Arxiv and they still are not there.
Authors routinely announce forthcoming papers that may take a long time to arrive. They will be there when the authors feel that they’re ready to be there. It’s better to have a finished product, rather than a rough draft.
many misunderstandings can be solved by those examples.
Those preprints are unlikely to contain examples that would be enlightening to a beginner. They’re meant to be reporting on the results of the experiments that the authors claimed they’ve done, and they’re likely to be making use of the theory described in this preprint to interpret those results.
Hi Bruno, Matteo and any other ACT people following this,
I’m sorry if this thread hasn’t been very stimulating, but I’ve been trying to cater to OP’s level. Please feel free to chip in with any comments or questions that may steer this thread in a more interesting direction. Cheers.
Pleasant news: the first of the promised preprints “Logical information cells. Part 1” has been published by Arxiv this morning https://arxiv.org/abs/2108.04751 .
As expected and warned, it does not clarify the main BB article. Actually - its essence can be summarized as the modelling/simulation research of one kind of neural coding. Neural coding is wide field and it ranges from the attempts to construct minimal neural codes for the logical operations in the setting of hybrid neural-symbolic computing (d’Avila Garcez https://arxiv.org/abs/1905.06088 ) (algebraic semantics of logic allow such mapping to even numbers “Abstract Algebraic Logic. An Introductory Textbook by Josep Maria Font (Author)”) to the general framework of predictive coding (https://arxiv.org/abs/2107.12979). Of course, there is lot of literature on biological neural codes - sparse coding, population codes, all the research that is done in deciphering partial or even full connectomes (e.g. C. elegans. BTW, I guess, I am the first one, who mentions C. elegans here). My understanding is that BB & BBG is close to the consideration of the full space of possible neural codes and category theory can be the tool to do this, but still, they are concerned with few codes. One of the authors of BB is the expert on the coding for communication, so, there is expertise in coding.
I also use this opportunity to thank rongmin for replies. I am still reading papers and I will have more thoughts that I will try to express after week or so. Now I am reading about their papers information co-homology, it seems to me that it is big part of their reasoning: they consider neuron activations as the random variables and - as I start to understand - they consider the derived/emergent/aggregate quantities (who may encode the real semantics, logics) which aggregated neuronal activations and all this forms the information structure.
It is not so much that I am bad with category theory (e.g. Seven Sketches by Spivak is fantastic work about sheafs topos, logic and classifying objects - the 7th chapter), my great confusion arises from not understanding what is meant e.g. by fibration italic-F. Now I perceive italic-F fibrations as the sets of neurons (subnetworks) whose dynamic/values encode the logical entities and which are mapped to the single neurons (that is why italic-F over italic-C). My continuous reading of BB confirms this understanding. From the other side I think that logic/semantics/theory are the emergent variables of the whole network and I am not sure that it can be dissected (fibred) to individual subpopulations of the neurons. Or even more - to the individual subpopulations of the individual layers of the network (they constantly stress that logic/semantic/theory develops into more determined-unambiguous theory with each layer and it is not one theory for which the network does series of inferences/computations).
Where all this leads? What is the benefit of the use of categories? As I understand then there are 3 Holy Grails of NN and General Intelligence: 1) finding the optimal architecture of NN for the learning/inference task. Empirically it can be done by evoluational neural architecture search; 2) finding the optimal learning algorithm; 3) finding the optimal neural coding scheme, i.e. mapping: neural activities to the logical quantities and reasoning algorithms. And this is not only the mapping of the output layer to the real world data, but it is also the internal working/inference of NN to explain the functioning process and to guide this process (e.g. to deliver bias in the form of human generated knowledge to reduce the learning time) So - all this 3 things: architecture (italic-C), algorithm (natural transformation among X-functors) and neural coding (partially expressed by italic-F and functors X) are expressed in categorical terms by BB. But why to do this?
My hope is that category theory can be used to find some kind “optimal category”, “optimal functors” and “optimal natural transformation”… So - I would like to see the example how category theory so far has been applied to determine the optimal category, obejcts, functors and how the real world entities (I mean the actual graph of NN, the actual learning algorithm, the actual coding scheme) can be synthesized from the optimal categorical entities. I don’t know whether the optimal categorial entities (if their exist) give the specification with enough detail (constraints and suggestions) to do such synthesis (e.g. program synthesis) but my understanding is that categorification can be used in such way. There is very scarce references to the optimization https://ncatlab.org/nlab/show/generalisation+as+an+adjunction#4_adjoints_as_the_solution_to_optimisation_problems and Google also finds some articles for “category theory and solution search”
I know that my expectations of category theory are very high and people are usually annoyed by them (and that is why I feel rejected), but it is my intuition that cats are not just for systematic description, but can also be used for the construction and synthesis. I would be glad to hear if there is similar reasoning or references along such lines.
Hi Shiva @16.
“Homotopy Theoretic and Categorical Models of Neural Information Networks” by Yuri Manin and Matilde Marcolli
Thank you for bringing that up, because I followed up on who cited MM and found a bunch of preprints on quivers and machine learning, which may be more accessible:
Marco Antonio Armenta, Pierre-Marc Jodoin, The representation theory of neural networks, arXiv:2007.12213.
George Jeffreys, Siu-Cheong Lau, Kähler geometry of quiver varieties and machine learning, arXiv:2101.11487.
Iordan Ganev, Robin Walters, The QR decomposition for radial neural networks, arXiv:2107.02550.
“Group entropies, correlation laws and zeta functions” by Piergiulio Tempesta.
generalized cohomology theory -> formal group law -> entropic functional?
Thank you for pointing out Tempesta’s work. He’s done a lot of work on how you can get entropic information measures from formal group law, so the “formal group law -> entropic functional” link seems solid enough. I’ll have to read his papers carefully to say more.
Hi Tomr @15.
my expectations of category theory are very high
Given that you’ve claimed to be a beginner, your expectation of how quickly you can understand the category theory needed to read BB is very high indeed. BB’s preprint isn’t an easy read even for me, so I can only imagine how difficult it must be for a beginner. Please be patient with yourself.
hybrid neural-symbolic computing (d’Avila Garcez https://arxiv.org/abs/1905.06088)
general framework of predictive coding (https://arxiv.org/abs/2107.12979)
Thank you for pointing out these preprints.
What is the benefit of the use of categories?
At the moment, people don’t quite understand how to explain what neural networks are doing. This is an important problem, particularly in the EU, since the EU has legislated a general right to explanation for people affected by automated decision-making.
Category theory may be able to help, since it has very deep connections to logic: see the nLab page on computational trilogy for more details. It has certainly proven itself in functional programming, so there are hopes that it could help us understand machine learning as well. This is why BB and others are interested in getting a topos out of DNNs, since that’d lead to the ability to use dependent type theory and higher-order logic on DNNs, and perhaps that could lead us to a better understanding of DNNs.
cats are not just for systematic description, but can also be used for the construction and synthesis.
We’re not yet fully capable of the former, so it’s premature to think about the latter. Being able to do all of the above is the goal, of course, but it will still take time.
@tomr I don’t have access to that, can you do me a favour? Email findable on the nLab.
1 to 24 of 24