Not signed in

Want to take part in these discussions? Sign in if you have an account, or apply for one below

Site Tag Cloud

Vanilla 1.1.10 is a product of Lussumo. More Information: Documentation, Community Support.

Welcome to nForum
If you want to take part in these discussions either sign in now (if you have an account), apply for one now (if you don't).

nLab > nLab Technical Matters: Number of nLab enteries

Bottom of Page

1 to 26 of 26

- CommentRowNumber1.
- CommentAuthorDaniel Geisler
- CommentTimeApr 29th 2024
- PermaLink
Author: Daniel Geisler
Format: TextHello. I'm having an odd problem parsing the results of the web scraping. nLab advertises that it has 19193 entries and that agrees the number of entries on the All Pages page. But the resultant file nlab_pages.json has over 28000 entries. I do have 743 pages whose name ends with "> history" that are not listed on All Pages, but than only accounts from a small portion of the additional page entries. Does anyone have any insight into this?
Hello. I'm having an odd problem parsing the results of the web scraping. nLab advertises that it has 19193 entries and that agrees the number of entries on the All Pages page. But the resultant file nlab_pages.json has over 28000 entries. I do have 743 pages whose name ends with "> history" that are not listed on All Pages, but than only accounts from a small portion of the additional page entries. Does anyone have any insight into this?
- CommentRowNumber2.
- CommentAuthorUrs
- CommentTimeApr 29th 2024
- PermaLink
Author: Urs
Format: MarkdownItexHi again. As I said ([here](https://nforum.ncatlab.org/discussion/17936/scraping-nlab/?Focus=116699#Comment_116699)) in your other thread (btw, you don't need to start a new thread for each issue in this context), for questions regarding the nLab's server installation you'll need to be speakig with our technical team. I'll drop them a note.

Hi again.

As I said (here) in your other thread (btw, you don’t need to start a new thread for each issue in this context), for questions regarding the nLab’s server installation you’ll need to be speakig with our technical team.

I’ll drop them a note.
- CommentRowNumber3.
- CommentAuthorDaniel Geisler
- CommentTimeApr 29th 2024
- PermaLink
Author: Daniel Geisler
Format: TextThanks for the feedback. My work includes both technical and data science issues. The short term focus of the Network Mathematics project I'm on is to provide answers to questions about nLab. The current nLab corpus is four years old and as Valeria de Paiva says, the capability of open source nlp tools is rapidly increasing. My job is to see what is currently possible. For example, conservatively speaking, how many concepts does nLab contain excluding people? This is obviously a difficult question, but the technology to investigate such questions has increased considerably in the last four years. The long term objective is to be able to ask questions regarding multiple resources like nLab, PlanetMath, Encyclopedia of Math and MathWorld. WikiData has been selected to be the central hub of the project. Ideally WikiData's SPARQL facility would then allow queries involving things like the intersection and union of data across different sources of mathematical information.
Thanks for the feedback. My work includes both technical and data science issues. The short term focus of the Network Mathematics project I'm on is to provide answers to questions about nLab. The current nLab corpus is four years old and as Valeria de Paiva says, the capability of open source nlp tools is rapidly increasing. My job is to see what is currently possible. For example, conservatively speaking, how many concepts does nLab contain excluding people? This is obviously a difficult question, but the technology to investigate such questions has increased considerably in the last four years. The long term objective is to be able to ask questions regarding multiple resources like nLab, PlanetMath, Encyclopedia of Math and MathWorld. WikiData has been selected to be the central hub of the project. Ideally WikiData's SPARQL facility would then allow queries involving things like the intersection and union of data across different sources of mathematical information.
- CommentRowNumber4.
- CommentAuthorstrsnd
- CommentTimeApr 29th 2024
- (edited Apr 29th 2024)
- PermaLink
Author: strsnd
Format: MarkdownItexHi Daniel, [context: I help a bit on the server administration side of things here at nlab] Cool to see your project. Prepping up nlab semantically could be a worthwhile endeavour. Can you help me a bit on what you need support on / help with? As Urs already pointed out, the raw content, as well as all the html rendered pages, are available on [github](https://github.com/ncatlab/nlab-content) and can be either parsed with a customized (itex2MML) markdown parser. Otherwise, you can try to parse the HTML pages. Some parts of the nlab wiki already have categories, which you might be able to make use of, e.g. for [people](https://ncatlab.org/nlab/all_pages/people). Sorry for the delay to reply. Please continue to shoot questions at us! Bye!

Hi Daniel,

[context: I help a bit on the server administration side of things here at nlab]

Cool to see your project. Prepping up nlab semantically could be a worthwhile endeavour.

Can you help me a bit on what you need support on / help with? As Urs already pointed out, the raw content, as well as all the html rendered pages, are available on github and can be either parsed with a customized (itex2MML) markdown parser. Otherwise, you can try to parse the HTML pages.

Some parts of the nlab wiki already have categories, which you might be able to make use of, e.g. for people.

Sorry for the delay to reply. Please continue to shoot questions at us!

Bye!
- CommentRowNumber5.
- CommentAuthorDaniel Geisler
- CommentTimeApr 29th 2024
- PermaLink
Author: Daniel Geisler
Format: TextHowdy, strsnd. I don't need support so much; this is more about good manners. I wanted folks to know I was using the software from the scraping of nLab back in 2020 to get an image of nLab. I think I already have the data I needed. The target I've been given is for a pipeline supporting the following technology which is mainly Python based. Scrapy -> LaTeXML -> SpaCy -> CONLL format -> conllu-stats.pl. Right now I'm focusing on using the JSON image of nLab to ask questions using regular expressions instead of SQL. If you have any questions, feel free to let me know.
Howdy, strsnd. I don't need support so much; this is more about good manners. I wanted folks to know I was using the software from the scraping of nLab back in 2020 to get an image of nLab. I think I already have the data I needed. The target I've been given is for a pipeline supporting the following technology which is mainly Python based.
Scrapy -> LaTeXML -> SpaCy -> CONLL format -> conllu-stats.pl.
Right now I'm focusing on using the JSON image of nLab to ask questions using regular expressions instead of SQL. If you have any questions, feel free to let me know.
- CommentRowNumber6.
- CommentAuthorstrsnd
- CommentTimeApr 30th 2024
- PermaLink
Author: strsnd
Format: TextAwesome! Please keep us updated in case you publish something.
Awesome! Please keep us updated in case you publish something.
- CommentRowNumber7.
- CommentAuthorUrs
- CommentTimeApr 30th 2024
- PermaLink
Author: Urs
Format: MarkdownItex> I don't need support so much; this is more about good manners. In [#1](#Comment_116822) there seemed to be the question of what to make of the entry count on the the nLab.

I don’t need support so much; this is more about good manners.

In #1 there seemed to be the question of what to make of the entry count on the the nLab.
- CommentRowNumber8.
- CommentAuthorDaniel Geisler
- CommentTimeApr 30th 2024
- PermaLink
Author: Daniel Geisler
Format: TextYes, I believe I've resolved the discrepancy. The software I'm using didn't overwrite the previous scrape, it appended the run to the contents of the JSON file it was building.
Yes, I believe I've resolved the discrepancy. The software I'm using didn't overwrite the previous scrape, it appended the run to the contents of the JSON file it was building.
- CommentRowNumber9.
- CommentAuthorDaniel Geisler
- CommentTimeMay 3rd 2024
- PermaLink
Author: Daniel Geisler
Format: TextI'm getting close to having a new corpus built, but there are some pages that probably shouldn't be included. The AUTOMATH page, the sandbox pages and one hundred empty pages. The empty pages seem to be a new occurrence since the the last corpus was built. Does this sound appropriate?
I'm getting close to having a new corpus built, but there are some pages that probably shouldn't be included. The AUTOMATH page, the sandbox pages and one hundred empty pages. The empty pages seem to be a new occurrence since the the last corpus was built. Does this sound appropriate?
- CommentRowNumber10.
- CommentAuthorUrs
- CommentTimeMay 3rd 2024
- (edited May 3rd 2024)
- PermaLink
Author: Urs
Format: MarkdownItexNot sure what you are asking (I don't know what it means that "the last corpus was built"), but otherwise it does not sound inappropriate. . As an aside, I can say that the pages with "`empty`" or "`> history`" in their title arise because there is no mechanism in our Instiki software for users to delete pages. Instead, there is a server-side command (which however has never been run so far) for administrators to delete all "orphaned" pages, namely those that receive no link from any other nLab page. Hence the closest for users to get rid of a page is to rename it to a title that is unlikely to be referenced anywhere.

Not sure what you are asking (I don’t know what it means that “the last corpus was built”), but otherwise it does not sound inappropriate. .

As an aside, I can say that the pages with “empty” or “> history” in their title arise because there is no mechanism in our Instiki software for users to delete pages. Instead, there is a server-side command (which however has never been run so far) for administrators to delete all “orphaned” pages, namely those that receive no link from any other nLab page. Hence the closest for users to get rid of a page is to rename it to a title that is unlikely to be referenced anywhere.
- CommentRowNumber11.
- CommentAuthorDaniel Geisler
- CommentTimeMay 3rd 2024
- PermaLink
Author: Daniel Geisler
Format: TextSome background for the interested. In 2020 Valeria de Paiva, Jacob Collard and Evan Patterson built a 500M [NLP corpus](https://en.wikipedia.org/wiki/Text_corpus) of nLab, nlab.conll. This is the first step in integrating WikiData with nLab. Jacob has build Parmesan which accesses nLab information from WikiData. See [Parmesan](https://www.youtube.com/watch?v=-ZhZjMn1Zpk). I am beginning to document the build process at [NetMath](https://github.com/DanielLeeGeisler/NetMath). One goal is to be able to create a corpus at will. nLab has 3K or 4K more entries than it did in 2020. I use nlab.conll to do [regular expression](https://en.wikipedia.org/wiki/Regular_expression) queries. Thanks for the information about history and empty pages. I'll get feedback on whether to capture the page names for the corpus.
Some background for the interested. In 2020 Valeria de Paiva, Jacob Collard and Evan Patterson built a 500M [NLP corpus](https://en.wikipedia.org/wiki/Text_corpus) of nLab, nlab.conll. This is the first step in integrating WikiData with nLab. Jacob has build Parmesan which accesses nLab information from WikiData. See [Parmesan](https://www.youtube.com/watch?v=-ZhZjMn1Zpk). I am beginning to document the build process at [NetMath](https://github.com/DanielLeeGeisler/NetMath). One goal is to be able to create a corpus at will. nLab has 3K or 4K more entries than it did in 2020. I use nlab.conll to do [regular expression](https://en.wikipedia.org/wiki/Regular_expression) queries.

Thanks for the information about history and empty pages. I'll get feedback on whether to capture the page names for the corpus.
- CommentRowNumber12.
- CommentAuthorDaniel Geisler
- CommentTimeMay 3rd 2024
- PermaLink
Author: Daniel Geisler
Format: MarkdownItexSome background for the interested. In 2020 Valeria de Paiva, Jacob Collard and Evan Patterson built a 500M [NLP corpus](https://en.wikipedia.org/wiki/Text_corpus) of nLab, nlab.conll. This is the first step in integrating WikiData with nLab. Jacob has build Parmesan which accesses nLab information from WikiData. See [Parmesan](https://www.youtube.com/watch?v=-ZhZjMn1Zpk). I am beginning to document the build process at [NetMath](https://github.com/DanielLeeGeisler/NetMath). One goal is to be able to create a corpus at will. nLab has 3K or 4K more entries than it did in 2020. I use nlab.conll to do [regular expression](https://en.wikipedia.org/wiki/Regular_expression) queries. Thanks for the information about history and empty pages. I'll get feedback on whether to capture the page names for the corpus.

Some background for the interested. In 2020 Valeria de Paiva, Jacob Collard and Evan Patterson built a 500M NLP corpus of nLab, nlab.conll. This is the first step in integrating WikiData with nLab. Jacob has build Parmesan which accesses nLab information from WikiData. See Parmesan. I am beginning to document the build process at NetMath. One goal is to be able to create a corpus at will. nLab has 3K or 4K more entries than it did in 2020. I use nlab.conll to do regular expression queries.

Thanks for the information about history and empty pages. I’ll get feedback on whether to capture the page names for the corpus.
- CommentRowNumber13.
- CommentAuthorzskoda
- CommentTimeMay 6th 2024
- PermaLink
Author: zskoda
Format: MarkdownItex5242 pages are tagged with [people](https://ncatlab.org/nlab/all_pages/people) tag; I estimate by frequency of creation which I observe in latest pages that at nearly a half of these entries are rather recent, say last 2 years or so.

5242 pages are tagged with people tag; I estimate by frequency of creation which I observe in latest pages that at nearly a half of these entries are rather recent, say last 2 years or so.
- CommentRowNumber14.
- CommentAuthorDaniel Geisler
- CommentTimeMay 6th 2024
- PermaLink
Author: Daniel Geisler
Format: TextI'm currently building a nLab corpus for 2024 to compliment the corpus built in 2020. Not only are there many new entries, but a large number of edits have been made on nLab. I hope to be able to provide hard answers to questions like how many new entries and how many edited entries are there in the last four years. Also I hope to be able to identify people who do not have a nLab people page.
I'm currently building a nLab corpus for 2024 to compliment the corpus built in 2020. Not only are there many new entries, but a large number of edits have been made on nLab. I hope to be able to provide hard answers to questions like how many new entries and how many edited entries are there in the last four years. Also I hope to be able to identify people who do not have a nLab people page.
- CommentRowNumber15.
- CommentAuthorDaniel Geisler
- CommentTimeMay 21st 2024
- PermaLink
Author: Daniel Geisler
Format: MarkdownItexI'm now finishing up creating a current [corpus](https://en.wikipedia.org/wiki/Corpus) of nLab. Of course some issues have come up; I'm not including blank pages or pages with " > history" in their name. There is also an issue about including pages from the people category. While there are some notable people, it is largely entries for what other systems consider users. Wikipedia supports users with their own namespace. What do people think, should all pages from the people category be included in the corpus?

I’m now finishing up creating a current corpus of nLab. Of course some issues have come up; I’m not including blank pages or pages with ” > history” in their name. There is also an issue about including pages from the people category. While there are some notable people, it is largely entries for what other systems consider users. Wikipedia supports users with their own namespace. What do people think, should all pages from the people category be included in the corpus?
- CommentRowNumber16.
- CommentAuthorUrs
- CommentTimeMay 21st 2024
- PermaLink
Author: Urs
Format: MarkdownItexThe `people` pages mostly are about authors of references that are being cited on nLab pages. The point is that readers interested in one reference they find on the nLab can click on the author name to find other writings by that author.

The people pages mostly are about authors of references that are being cited on nLab pages.

The point is that readers interested in one reference they find on the nLab can click on the author name to find other writings by that author.
- CommentRowNumber17.
- CommentAuthorDaniel Geisler
- CommentTimeMay 21st 2024
- PermaLink
Author: Daniel Geisler
Format: MarkdownItexThanks for the explanation, I missed the connection between authors and references. I'm now working on running queries on the nLab corpus using Python. My near term goal is to be able to quickly answer elaborate questions about the corpus. So let me know if anyone has any questions about the nLab data.

Thanks for the explanation, I missed the connection between authors and references.

I’m now working on running queries on the nLab corpus using Python. My near term goal is to be able to quickly answer elaborate questions about the corpus. So let me know if anyone has any questions about the nLab data.
- CommentRowNumber18.
- CommentAuthorDaniel Geisler
- CommentTimeMay 23rd 2024
- (edited May 23rd 2024)
- PermaLink
Author: Daniel Geisler
Format: MarkdownItexI am now reconciling the figures I have from the 2020 nLab corpus and the current 2024 corpus. The [Topos Institute nLab2024 corpus](https://github.com/ToposInstitute/nLab2024-corpus) contains the code and datafiles for the project. The stats2024.xml file suggest nLab has twice the content it had back in 2020, but the difference in file size suggests a 33% increase in content. Do folks at nForum have any estimates that would either support or invalidate these results.

I am now reconciling the figures I have from the 2020 nLab corpus and the current 2024 corpus. The Topos Institute nLab2024 corpus contains the code and datafiles for the project. The stats2024.xml file suggest nLab has twice the content it had back in 2020, but the difference in file size suggests a 33% increase in content. Do folks at nForum have any estimates that would either support or invalidate these results.
- CommentRowNumber19.
- CommentAuthorUrs
- CommentTimeMay 23rd 2024
- PermaLink
Author: Urs
Format: MarkdownItexI have no idea. It seems the one uniquely in position to answer such questions now is: you. :-)

I have no idea. It seems the one uniquely in position to answer such questions now is: you. :-)
- CommentRowNumber20.
- CommentAuthorDaniel Geisler
- CommentTimeJun 16th 2024
- (edited Jun 16th 2024)
- PermaLink
Author: Daniel Geisler
Format: MarkdownItexMy work for Valeria de Paiva supporting the identification of mathematical concepts in nLab continues. One consequence is that less and less of nLab is included in the 2024 corpus because we currently are only concerned with pages whose name is a mathematical concept. The following shows the page names and categories that are currently excluded from the current corpus. As a result only 13 K of the 19 K nLab pages are included. Since this change may impact the project's usefulness to the nForum community I wanted to give folks a heads up. I'm now writing reports to provide different overviews of nLab. For example I am tracking the number of excluded pages by reason. Please let me know if this software's functionality can be extended to serve the needs of this community. **Excluded page names:** * "sandbox" in page name * "> history" in page name * ^\W [regex](https://en.wikipedia.org/wiki/Regular_expression) matches initial non-word character * "survey" in page name * ^\d regex matches initial digit * "first idea" in page name * blank pages **Excluded categories:** * people * joke * svg * meta * empty * reference
My work for Valeria de Paiva supporting the identification of mathematical concepts in nLab continues. One consequence is that less and less of nLab is included in the 2024 corpus because we currently are only concerned with pages whose name is a mathematical concept. The following shows the page names and categories that are currently excluded from the current corpus. As a result only 13 K of the 19 K nLab pages are included. Since this change may impact the project’s usefulness to the nForum community I wanted to give folks a heads up. I’m now writing reports to provide different overviews of nLab. For example I am tracking the number of excluded pages by reason. Please let me know if this software’s functionality can be extended to serve the needs of this community.

Excluded page names:
- “sandbox” in page name
- ”> history” in page name
- ^\W regex matches initial non-word character
- “survey” in page name
- ^\d regex matches initial digit
- “first idea” in page name
- blank pages
Excluded categories:
- people
- joke
- svg
- meta
- empty
- reference
- CommentRowNumber21.
- CommentAuthorRodMcGuire
- CommentTimeJun 16th 2024
- PermaLink
Author: RodMcGuire
Format: MarkdownItex> we currently are only concerned with pages whose name is a mathematical concept > **Excluded page names:** > * ^\W [regex](https://en.wikipedia.org/wiki/Regular_expression) matches initial non-word character > * ^\d regex matches initial digit For excluded categories you should probably provide counts of pages, examples, or even complete lists. A page like [[2-category]] is definitely a concept.
we currently are only concerned with pages whose name is a mathematical concept

Excluded page names:
- ^\W regex matches initial non-word character
- ^\d regex matches initial digit
For excluded categories you should probably provide counts of pages, examples, or even complete lists.

A page like 2-category is definitely a concept.
- CommentRowNumber22.
- CommentAuthorDaniel Geisler
- CommentTimeJun 16th 2024
- PermaLink
Author: Daniel Geisler
Format: MarkdownItexThanks for the feedback RodMcGuire, the page counts follow. I just modified the scraper to collect the information to provide a report with a complete list which I should be able to post to [GitHub](https://github.com/ToposInstitute/nLab2024-corpus) later today. I have modified the program to not exclude page names beginning with a digit. The list of concepts is preliminary and conservative in it inclusion. For the concepts corpus it is considered much better to exclude a valid concept than to include an invalid one. Candidate concepts 12903 **Excluded** * people 5064 * reference 352 * number 146 * symbol 126 * empty 83 * err 42 * first_idea 19 * history 513 * joke 1 * meta 14 * sandbox 12 * survey 13 * svg 18
Thanks for the feedback RodMcGuire, the page counts follow. I just modified the scraper to collect the information to provide a report with a complete list which I should be able to post to GitHub later today. I have modified the program to not exclude page names beginning with a digit. The list of concepts is preliminary and conservative in it inclusion. For the concepts corpus it is considered much better to exclude a valid concept than to include an invalid one.

Candidate concepts 12903

Excluded
- people 5064
- reference 352
- number 146
- symbol 126
- empty 83
- err 42
- first_idea 19
- history 513
- joke 1
- meta 14
- sandbox 12
- survey 13
- svg 18
- CommentRowNumber23.
- CommentAuthorUrs
- CommentTimeJun 17th 2024
- (edited Jun 17th 2024)
- PermaLink
Author: Urs
Format: MarkdownItexThere are also lots of concept-entries whose title start with a parenthesis, you can see this at the beginning of the [list of all pages](https://ncatlab.org/nlab/all_pages). Also the very first entry in the list, whose title starts with an exclamation mark, is a concept-entry. On the other hand, `!include`-entries (e.g. [this one](https://ncatlab.org/nlab/show/topological+phases+of+matter+via+K-theory+--+references)) typically have alphabetical titles (except for a hyphen somewhere) and don't carry a tag, but are not stand-alone concept-entries. I'd suggest to instead go by the actual content of the entries: If an entry contains a section header titled either "Idea" or "Definition" then it surely is a concept-entry, and if it doesn't then it's probably not a concept entry or else we should go and add such a section header to it, anyways.

There are also lots of concept-entries whose title start with a parenthesis, you can see this at the beginning of the list of all pages.

Also the very first entry in the list, whose title starts with an exclamation mark, is a concept-entry.

On the other hand, !include-entries (e.g. this one) typically have alphabetical titles (except for a hyphen somewhere) and don’t carry a tag, but are not stand-alone concept-entries.

I’d suggest to instead go by the actual content of the entries:

If an entry contains a section header titled either “Idea” or “Definition” then it surely is a concept-entry,

and if it doesn’t then it’s probably not a concept entry or else we should go and add such a section header to it, anyways.
- CommentRowNumber24.
- CommentAuthorUrs
- CommentTimeJun 17th 2024
- PermaLink
Author: Urs
Format: MarkdownItexJust beware, when searching through an article's content for section headers, that there is differing alternative syntax for them: You want to be looking for "`## Idea`" or "`\section{Idea}`" (and allow for extra whitespace, of course).

Just beware, when searching through an article’s content for section headers, that there is differing alternative syntax for them:

You want to be looking for “## Idea” or “\section{Idea}” (and allow for extra whitespace, of course).
- CommentRowNumber25.
- CommentAuthorDaniel Geisler
- CommentTimeJun 18th 2024
- PermaLink
Author: Daniel Geisler
Format: MarkdownItexThanks Urs, I will implement your comments in the reports I am supporting in the next day or two. My current focus is on writing the most revealing reports I can because what I am supporting is the software development of a research project. We have many questions and few solid agreed upon facts to work with. Very exploratory.

Thanks Urs, I will implement your comments in the reports I am supporting in the next day or two. My current focus is on writing the most revealing reports I can because what I am supporting is the software development of a research project. We have many questions and few solid agreed upon facts to work with. Very exploratory.
- CommentRowNumber26.
- CommentAuthorDaniel Geisler
- CommentTimeJun 18th 2024
- PermaLink
Author: Daniel Geisler
Format: MarkdownItexThe complete lists RodMcGuire suggested is now at [nLab report](https://github.com/ToposInstitute/nLab2024-corpus/blob/main/2024/nlab_report.txt).

The complete lists RodMcGuire suggested is now at nLab report.

1 to 26 of 26

nForum

Discussion Feed

Not signed in

Site Tag Cloud

nLab > nLab Technical Matters: Number of nLab enteries