Want to take part in these discussions? Sign in if you have an account, or apply for one below
Vanilla 1.1.10 is a product of Lussumo. More Information: Documentation, Community Support.
1 to 12 of 12
At last it has happened - Huawei Research collaboration is starting to publish their groundbreaking work on the formalization and theory of deep neural networks and deep learning. https://arxiv.org/abs/2106.14587v1 is the first published preprint (more than 100 pages already) and it references to-be-published preprints and also whole book of papers.
I can handle the theory as a beginner, but this preprint is full of musings and is not formatted in the form of theorems, lemmas and proofs. etc.
So - can we have some discussions about it? Some elaboration? Reformulation?
I have asked the first question in Stackexchange https://math.stackexchange.com/questions/4189699/topos-and-stacks-of-neural-networks-how-to-understand-sentence-values-in-t
But there will be more and more such questions. I hope that preprint will have next versions and I hope to read especailly forthcoming reference: [BBG20] Jean-Claude Belfiore, Daniel Bennequin, and Xavier Giraud. Logical informa-tion cells. Internal technical report, Huawei, 2020.
But DNN theory (especially used for guidance in the selection of DNN architecture and discovery of deep learning algroithms) is so hot topic, that it is impossible to wait.
Actually - this is the moment that Gates and Allen experienced some 40 years ago - trillion$ business opportunity. And huge step for civilization as well, not just in Science.
I have scanned through the article arXiv:2106.14587 now.
The proposed translation from neural networks to toposes seems to happen in section 3.2 (p. 7), where the “possible activities of the population of neurons” arranged in a graph $\Gamma$ is claimed to be encoded by a (co-)presheaf of sets on (the category generated by) $\Gamma$.
This statement is vaguely plausible, but rather unspecific. Where in the article is this perspective put to use for DNNs?
Probably Section 3.4, where back-propagation is meant to be interpreted topos-theoretically. But is it? Equations (2) - (11) recall basics formulas of the standard theory (though it remains unclear what their import is here). Surprisingly, after (11) we see a Theorem 1 on p. 11 suddenly state that “Backpropagation is a flow of natural transformations”. There seems to be no connect between these words and what happened before (or after) this statement.
Yet, this Theorem 1 appears to be the last attempt in the article to relate toposes to DNNs. What follows is a kind of review of a plethora of concepts from category theory and homotopy theory.
For example, on p. 91 we suddenly read “The map $\mathcal{C} \mapsto \mathcal{M}_{\mathcal{C}}$” is an example of a derivator and that: “A derivator generalizes the passage from a category to its topos of presheaves”. But the latter is not a characterization of the concept of derivators, and the former remains unspecified: There is no “map $\mathcal{C} \mapsto \mathcal{M}_{\mathcal{C}}$” around p. 91. Probably this refers way back to section 5, where $\mathcal{M}_{\mathcal{C}}$ denoted model categories of $\mathcal{M}$-valued presheaves on $\mathcal{C}$.
A few pages later the article ends with “The characterization of fibrant and cofibrant objects in $\mathcal{M}_{\mathcal{C}}$ was the main result of section 5” (looking through section 5, this claim seems to refer to Theorem 4 (p. 32) which mentions ’type theory structure’ and claims to prove that something is a fibration, essentially by saying that this has been argued elsewhere.)
The last sentence of the article then is:
How to use this fact for functioning networks?
That, indeed, seems to be the general open question here.
NB: There is now a v2 of arXiv:2106.14587, and this comment has been edited to reflect the new pagination and numbering in brackets, while retaining the old ones for the convenience of those who have the older version.
Re. #1:
their groundbreaking work on the formalization and theory of deep neural networks and deep learning
There’s been plenty of work done in this area^{1} since 2017, when Fong et al. (“FST”) posted Backprop as Functor ^{2} on the arXiv. Earlier attempts at applying category theory to the study of neural networks go back at least to 1999.^{3}
What the authors (“BB”) claim to have done in arXiv:2106.14587 (“the BB preprint”) is to apply topos theory to the study of equivariance and invariance in deep neural networks (DNNs).
this preprint is full of musings and is not formatted in the form of theorems, lemmas and proofs. etc. […C]an we have some discussions about it? Some elaboration? Reformulation?
The videos of the talks given by the two authors^{4} at Toposes Online are now available, so you can watch them elaborate on their own work. I’ll just give some motivating remarks and comment on what I think BB are claiming in this preprint.
DISCLAIMER: The following is only my imperfect understanding of the claims in the preprint. Whether or not anything claimed in the preprint is true is outside of the scope of this comment.
A major problem with DNNs now is that our ability to reason about them is inadequate. One possible solution is to see if we can get a topos out of a suitably defined category of DNNs. Since a topos is a locally Cartesian closed category, this allows us to use Martin-Löf dependent type theory and higher-order logic.
The structure of a DNN can be modelled as a directed graph, possibly with cycles, that has labelled edges and vertices. The edge labels are usually called weights, while the vertex labels are called biases by FST and activities in the BB preprint. Graphs are not directly considered in the FST paper, because I think they wanted a more prosaic approach and people didn’t seem to know at the time whether a category of DNNs even makes sense. The point of the FST paper was to show that a category of DNNs exist, but it also pointed out that backprop is needed for composition in the category to work.
The BB preprint comes after all that, and its focus is on analysing the properties of a specific DNN, so a graph is fixed from the start. It also considers functor categories (more precisely, (pre-)sheaves) directly, so it models the edge and vertex labels as functors, unlike the FST approach.^{5} Since backprop is now modelled as a morphism of functors, rather than a morphism of objects, it is thus a natural transformation rather than a functor.
Note that since BB considered the simplest case of a “chain” (a directed graph without branches) in section 3.2 (v2: 1.2), they needed to deal with branching in section 3.3 (v2: 1.3), something which FST did not have to do. This is done in the BB preprint in a way which I don’t quite understand: the reasoning is given in a “Remarks” paragraph on p. 14 (v2: 16) in section 3.5 (v2: 1.5), and seems to involve the avoidance of some difficulty with constructing a covering.
In any case, section 3.3 (v2: 1.3) also claims to show how the sheaves associated with a DNN can be endowed with a Grothendieck topology, and thus become toposes. For the FST approach, it has only been in the past year or so that preprints have emerged that claim to have constructed a topos for DNNs based on the FST construction.^{6}
Since BB are also interested in the invariance structure of certain DNN architectures, they move on to consider the classifying topos construction, stacks over the DNN topos and model categories. They also claim to have adapted a homological method, which was developed by the second author and Baudot^{7} to study entropy, to this setting in section 5.4 (v2: 3.4), and to have developed a preliminary theory of “semantic information” in an attempt to explain the efficacy of some DNN architectures in handling natural language processing tasks.
See arXiv:2106.07032 for a review by Shiebler et al., especially section 2. ↩
Michael J Healy, in a paper at IJCNN’99, used category theory as a tool to model concept formation and recall in neural networks. ↩
Links to the first author’s talk and the second author’s talk. ↩
An updated FST approach is given in arXiv:2103.01931, which constructs a bicategory of DNNs. ↩
This work is mostly due to Spivak, see arXiv:2003.04827, arXiv:2005.0189 and arXiv:2103.01189 ↩
Pierre Baudot and Daniel Bennequin, The homological nature of entropy, Entropy 2015, 17, 3253-3318; doi:10.3390/e17053253 ↩
re #3
One possible solution is to see if we can get a topos out of a suitably defined category of DNNs. If we can get a Grothendieck topos out of it, that would be even better, because every such topos is a locally Cartesian closed category (see Proposition 3.1 of this nLab entry), which supports Martin-Löf dependent type theory and higher-order logic.
Huh? I thought any topos is locally Cartesian closed because all of it’s slice categories are toposes - see over-topos. Some other categories may be LCC without being a topos.
Re. #4
I thought any topos is locally Cartesian closed because all of it’s slice categories are toposes - see over-topos. Some other categories may be LCC without being a topos.
Fixed, thank you!
A quick check with the nLab tells me that any quasi-topos is LCC. I think the authors of that preprint are actually working with quasi-toposes in some places, since posets and Heyting algebras also make an appearance.
Has some managed to decipher at least part of the BB preprint and speaches?
E.g. I am stuck with sentence (that precedes formula (5.8) directly in v2 of BB: Natural morphisms between objects (F, A) and (F ′, A′) of AC are defined by a family of functors FU : FU → F ′U, such that for any morphism α : U → U′ in C
So, we have “class of presheaves” AC = Grp∧C (actually, it is category or presheaves, presheave is the functor from C to Set category). So, pre-sheave as functor is the object of A_C. But what this pre-sheave functor denotes? I understand that this pre-sheave is the functor from the category of C (category of graphs, category of architectures/connectivity among neurons) towards category of vectors of weights/biases (I am very uneasey here - there is no category of vectors…)… Maybe one can understand this presheave as the collection of assignments <graph, weights/biases>.
Antoher question is - what distinguishes those pre-sheaves from one antoher? In what manner presheave-functor italic-F_U is different from italic-F_U’? Maybe those presheaves are defined only on some parts of the graph, on some layers? E.g. presheave is collection of assignment of weights/biases to the sub-network?
I can try to understand why we consider the collection of w-b assignments… Maybe different learning algorithms give different assignments, or different data give different assignments…
And, further,… we have decided that presheaves are denoted by italic-F_U and italic-F_U’ for different parts (layers) U and U’ of networks. By why the authors speak about objects (italic-F, A) and (italic-F, A’)? A may be assignment of w-b? Or A may be the group element that corresponds to the actual assignment of w-b? Italic-F as pre-sheaf already contains the assignment of w-b, so, the addition of A means the addition of the interpretation of numerical w-b assignment into the elements of group (or subgroup, or factor-class of group, or another collection of group elements).
As I read this article (and I am reading it a lot and going over video presentations) and as I read the other materials about co-homology of information, toposes, sheaves, I am fighting with trying to make sense what each letter, each denotation, each formula means in this article. It is so hard to understand and check it.
It would be nice it we could be the glossary of terms, letters, denotations used in this article with explicit explanation for each letter (as it is used in several contexts).
It would be great to hear any more thoughts and insights into this. Thanks for sharing!
Re #6:
I am stuck with sentence (that precedes formula (5.8) directly in v2 of BB: Natural morphisms between objects ($\mathcal{F}$, $A$) and ($\mathcal{F}'$, $A'$) of $\mathcal{A}_{\mathcal{C}}$ are defined by a family of functors $F_U \colon \mathcal{F}_U \to \mathcal{F}^{'}_U$, such that for any morphism $\alpha \colon U \to U'$ in $\mathcal{C}$.
This bit (from the beginning of section 5.2 up to the paragraph preceding the Remark after equation (5.9)) is a summary of the material discussed from the beginning of section 2.2 (v1: 4.2) up to equation (2.6). As indicated in the first paragraph of section 5.2 (v1: 7.2), the reader is referred to sections 3 and 4 of v1(!) (these are sections 1 and 2 of v2) for details.
So, we have “class of presheaves” $\mathcal{A}_{\mathcal{C}} = Grp^{\wedge}_{\mathcal{C}}$ (actually, it is category or presheaves, presheave is the functor from $\mathcal{C}$ to Set category). So, pre-sheave as functor is the object of $\mathcal{A}_{\mathcal{C}}$. But what this pre-sheave functor denotes? I understand that this pre-sheave is the functor from the category of $\mathcal{C}$ (category of graphs, category of architectures/connectivity among neurons) towards category of vectors of weights/biases (I am very uneasey here - there is no category of vectors…)…
They are talking about (pre-)stacks, which are (pre-)sheaves of categories. Section 2.1 discusses their proposal for studying invariance in DNNs, using stacks and their construction of a topos for DNNs, while section 2.2 goes over some theory about stacks. As noted just before section 2.3, the theory in sections 2.2 is treated in greater detail in arXiv:2107.04417, the new preprint by Caramello and Zanfa.
Lets stick to the basics. There is sentence “category $C^{o}(\Gamma)$ freely generated by the graph” in page 5-9 (at the start of Ch 1.2) of BBv2. What does this phrase mean? Is https://en.wikipedia.org/wiki/Free_category the mentioned freely generated category? Do I understand correctly, that $C^{o}(\Gamma)$ is not just collection of graphs (i.e. each object is some graph and each morphism is some transformation between objects/graphs), but that $C^{o}(\Gamma)$ is collection of objects and morphisms that is somehow generated (details in referenced wiki) by some fixed graph $\Gamma$ (which can be space-time expansion of the graph of some fixed RNN as well)?
p.s. I have read some 80% of the preprint (last portions of the last 2 chapters pending), but now I will go through entire preprint to stick with and clarify every notion (of course - I have lot of category theory books to read about fibred categories, Grothendieck constructions, sheaves myself, so, no questions about them). I hope that we can go together and create a guide for others during our walk. Thanks in advance!
There is sentence “category $C^{o}(\Gamma)$ freely generated by the graph” in page 5-9 (at the start of Ch 1.2) of BBv2. What does this phrase mean? Is https://en.wikipedia.org/wiki/Free_category the mentioned freely generated category?
Yes, see also the nLab page for free category.
$C^{o}(\Gamma)$ is not just collection of graphs (i.e. each object is some graph and each morphism is some transformation between objects/graphs)
$C^{o}(\Gamma)$ is not even that. The objects of $C^{o}(\Gamma)$ are the vertices of $\Gamma$, not graphs. The morphisms between two objects are paths that exist on $\Gamma$ between them.
Let’s stick to the basics.
I agree, which is why reading the FST paper, Backprop as Functor (arXiv:1711.10455, LICS’19, with the arXiv version being the complete version), is strongly recommended for understanding the BB preprint. That paper was published in LICS’19, so it was written specifically for an audience of computer scientists and software engineers who’re not all experts in category theory, unlike the BB preprint. It builds up a category of neural networks (called “learners” there) from the very basics, so you get a pretty good understanding of what it takes to get such a category. That definitely helps with understanding BB, particularly section 1.2.
Hi Rongmin,
thanks for bringing in expertise. Whatever solid statements there are to be made about category theoretic notions brought to bear on neural networks, please feel invited to record them in our entry neural network. Incrementally recording pertinent definitions/theorems/proofs there might also help the discussion here.
HI Urs,
Thank you for pointing out that entry. I was actually wondering where on the nLab I could do that.
I have managed to understood encodings of activations and weights. Now I am reading last paragraph of p5-9 of BBv2 and first full paragraph of p6-10 of BBv2.
There are couple of uncertainties. First - as I understand, then "singleton $\mathit{w}$ in $\mathbb{W}$" is the same thing as (terminal) "object $\textbf{1}$ in $\mathbb{W}$" and it is the same terminal object that enters into the fibration/classification square as demonstrated in p3 and p4 of https://arxiv.org/abs/1012.5647v3 - for providing and example how classifying object can be seen as the the Boolean values (true, false) and how fibration be seen as the pre-image of the characteristic function. Why they are using $\mathit{w}$ and $\textbf{1}$ for the same notion? And about semantics - such fibration square essentially is very bold assertion, which essentially says: specific set of weights (as encoded by specific functor $\mathbb{W}$) can emerge from different set of activations (as encoded by specific functor $X^{w}$ or set of specific functors). Well - I am not sure - whether different set activations can indeed correspond to the same set of weights or is it otherwise. I have never though about such relationship and categorical formalisation puts pressure on use to take one side…
The next paragraph is the hardest one. What is $\Omega$? Usually this symbol is used as the classifying object in the fibration. It can be true/false set in logic examples from https://arxiv.org/abs/1012.5647v3 or it can be set of specific $\mathbb{W}$-functors from the preceding paragraph (e.g. specific set of weights classifies all the possible sets of activations). But then I don’t understand how $\textbf{1}$ (as specific functor $\mathbb{W}$ - they are using this letter both for the specific functor and for the category ((sub-)topos as part of ) of functors) can be made the container of subobjects. $C^{^}\textbf{1}$ is constant-weight assignment to all the weights and what can be subobject of such assignment? Assignment for only part of the graph? Maybe $\Omega(1)$ is the denotation for the set of weights for the subgraph of $\Gamma$? Maybe it is so. By why they are speaking about logic at this moment? Upon the first reading of the paper I had the impression that $\Omega$ is somehow denoting the emergent group or semantics from the weights or activations. I.e. maybe the activations or weights of the trained neural network admit the group of transformations that transform some subset of weights/activations into another subset (or something like this) and this set of transformation can be some specific group and this group may be THE semantics of the trained neural network? But they are still speaking about weights/activations only in this paragraph.
Reference "[Pro19] Alain Prout´e. Introduction ‘a la logique cat´egorique. MSc Course, Universit´e Paris Diderot, 2019." seems to be very interesting and accessible, unfortunately, it is in French and I am newbie in French. Maybe there is translation or alternative for that?
And of course - what about (hugely waited for preprints): [BBG20] Jean-Claude Belfiore, Daniel Bennequin, and Xavier Giraud. Logical information cells, Part I. Internal technical report, Huawei, 2020. [BBG21] Jean-Claude Belfiore, Daniel Bennequin, and Xavier Giraud. Logical information cells, Part II. Internal technical report, Huawei, 2021. I have checked Arxiv and they still are not there. Maybe some researchers have access to those preprints that are circulated privately (e.g. with the signed/stated non-disclosure agreement). If so, maybe someone can share them with me? I guess, that many miunderstanding can be solved by those examples.
1 to 12 of 12