Want to take part in these discussions? Sign in if you have an account, or apply for one below
Vanilla 1.1.10 is a product of Lussumo. More Information: Documentation, Community Support.
Hi again.
As I said (here) in your other thread (btw, you don’t need to start a new thread for each issue in this context), for questions regarding the nLab’s server installation you’ll need to be speakig with our technical team.
I’ll drop them a note.
Hi Daniel,
[context: I help a bit on the server administration side of things here at nlab]
Cool to see your project. Prepping up nlab semantically could be a worthwhile endeavour.
Can you help me a bit on what you need support on / help with? As Urs already pointed out, the raw content, as well as all the html rendered pages, are available on github and can be either parsed with a customized (itex2MML) markdown parser. Otherwise, you can try to parse the HTML pages.
Some parts of the nlab wiki already have categories, which you might be able to make use of, e.g. for people.
Sorry for the delay to reply. Please continue to shoot questions at us!
Bye!
I don’t need support so much; this is more about good manners.
In #1 there seemed to be the question of what to make of the entry count on the the nLab.
Not sure what you are asking (I don’t know what it means that “the last corpus was built”), but otherwise it does not sound inappropriate. .
As an aside, I can say that the pages with “empty
” or “> history
” in their title arise because there is no mechanism in our Instiki software for users to delete pages. Instead, there is a server-side command (which however has never been run so far) for administrators to delete all “orphaned” pages, namely those that receive no link from any other nLab page. Hence the closest for users to get rid of a page is to rename it to a title that is unlikely to be referenced anywhere.
Some background for the interested. In 2020 Valeria de Paiva, Jacob Collard and Evan Patterson built a 500M NLP corpus of nLab, nlab.conll. This is the first step in integrating WikiData with nLab. Jacob has build Parmesan which accesses nLab information from WikiData. See Parmesan. I am beginning to document the build process at NetMath. One goal is to be able to create a corpus at will. nLab has 3K or 4K more entries than it did in 2020. I use nlab.conll to do regular expression queries.
Thanks for the information about history and empty pages. I’ll get feedback on whether to capture the page names for the corpus.
5242 pages are tagged with people tag; I estimate by frequency of creation which I observe in latest pages that at nearly a half of these entries are rather recent, say last 2 years or so.
I’m now finishing up creating a current corpus of nLab. Of course some issues have come up; I’m not including blank pages or pages with ” > history” in their name. There is also an issue about including pages from the people category. While there are some notable people, it is largely entries for what other systems consider users. Wikipedia supports users with their own namespace. What do people think, should all pages from the people category be included in the corpus?
The people
pages mostly are about authors of references that are being cited on nLab pages.
The point is that readers interested in one reference they find on the nLab can click on the author name to find other writings by that author.
Thanks for the explanation, I missed the connection between authors and references.
I’m now working on running queries on the nLab corpus using Python. My near term goal is to be able to quickly answer elaborate questions about the corpus. So let me know if anyone has any questions about the nLab data.
I am now reconciling the figures I have from the 2020 nLab corpus and the current 2024 corpus. The Topos Institute nLab2024 corpus contains the code and datafiles for the project. The stats2024.xml file suggest nLab has twice the content it had back in 2020, but the difference in file size suggests a 33% increase in content. Do folks at nForum have any estimates that would either support or invalidate these results.
I have no idea. It seems the one uniquely in position to answer such questions now is: you. :-)
My work for Valeria de Paiva supporting the identification of mathematical concepts in nLab continues. One consequence is that less and less of nLab is included in the 2024 corpus because we currently are only concerned with pages whose name is a mathematical concept. The following shows the page names and categories that are currently excluded from the current corpus. As a result only 13 K of the 19 K nLab pages are included. Since this change may impact the project’s usefulness to the nForum community I wanted to give folks a heads up. I’m now writing reports to provide different overviews of nLab. For example I am tracking the number of excluded pages by reason. Please let me know if this software’s functionality can be extended to serve the needs of this community.
Excluded page names:
Excluded categories:
we currently are only concerned with pages whose name is a mathematical concept
Excluded page names:
^\W regex matches initial non-word character
^\d regex matches initial digit
For excluded categories you should probably provide counts of pages, examples, or even complete lists.
A page like 2-category is definitely a concept.
Thanks for the feedback RodMcGuire, the page counts follow. I just modified the scraper to collect the information to provide a report with a complete list which I should be able to post to GitHub later today. I have modified the program to not exclude page names beginning with a digit. The list of concepts is preliminary and conservative in it inclusion. For the concepts corpus it is considered much better to exclude a valid concept than to include an invalid one.
Candidate concepts 12903
Excluded
There are also lots of concept-entries whose title start with a parenthesis, you can see this at the beginning of the list of all pages.
Also the very first entry in the list, whose title starts with an exclamation mark, is a concept-entry.
On the other hand, !include
-entries (e.g. this one) typically have alphabetical titles (except for a hyphen somewhere) and don’t carry a tag, but are not stand-alone concept-entries.
I’d suggest to instead go by the actual content of the entries:
If an entry contains a section header titled either “Idea” or “Definition” then it surely is a concept-entry,
and if it doesn’t then it’s probably not a concept entry or else we should go and add such a section header to it, anyways.
Just beware, when searching through an article’s content for section headers, that there is differing alternative syntax for them:
You want to be looking for “## Idea
” or “\section{Idea}
” (and allow for extra whitespace, of course).
Thanks Urs, I will implement your comments in the reports I am supporting in the next day or two. My current focus is on writing the most revealing reports I can because what I am supporting is the software development of a research project. We have many questions and few solid agreed upon facts to work with. Very exploratory.
The complete lists RodMcGuire suggested is now at nLab report.
1 to 26 of 26