Not signed in

Want to take part in these discussions? Sign in if you have an account, or apply for one below

Site Tag Cloud

Vanilla 1.1.10 is a product of Lussumo. More Information: Documentation, Community Support.

Welcome to nForum
If you want to take part in these discussions either sign in now (if you have an account), apply for one now (if you don't).

nLab > nLab Technical Matters: Number of nLab enteries

Bottom of Page

1 to 14 of 14

- CommentRowNumber1.
- CommentAuthorDaniel Geisler
- CommentTimeApr 29th 2024
- PermaLink
Author: Daniel Geisler
Format: TextHello. I'm having an odd problem parsing the results of the web scraping. nLab advertises that it has 19193 entries and that agrees the number of entries on the All Pages page. But the resultant file nlab_pages.json has over 28000 entries. I do have 743 pages whose name ends with "> history" that are not listed on All Pages, but than only accounts from a small portion of the additional page entries. Does anyone have any insight into this?
Hello. I'm having an odd problem parsing the results of the web scraping. nLab advertises that it has 19193 entries and that agrees the number of entries on the All Pages page. But the resultant file nlab_pages.json has over 28000 entries. I do have 743 pages whose name ends with "> history" that are not listed on All Pages, but than only accounts from a small portion of the additional page entries. Does anyone have any insight into this?
- CommentRowNumber2.
- CommentAuthorUrs
- CommentTimeApr 29th 2024
- PermaLink
Author: Urs
Format: MarkdownItexHi again. As I said ([here](https://nforum.ncatlab.org/discussion/17936/scraping-nlab/?Focus=116699#Comment_116699)) in your other thread (btw, you don't need to start a new thread for each issue in this context), for questions regarding the nLab's server installation you'll need to be speakig with our technical team. I'll drop them a note.

Hi again.

As I said (here) in your other thread (btw, you don’t need to start a new thread for each issue in this context), for questions regarding the nLab’s server installation you’ll need to be speakig with our technical team.

I’ll drop them a note.
- CommentRowNumber3.
- CommentAuthorDaniel Geisler
- CommentTimeApr 29th 2024
- PermaLink
Author: Daniel Geisler
Format: TextThanks for the feedback. My work includes both technical and data science issues. The short term focus of the Network Mathematics project I'm on is to provide answers to questions about nLab. The current nLab corpus is four years old and as Valeria de Paiva says, the capability of open source nlp tools is rapidly increasing. My job is to see what is currently possible. For example, conservatively speaking, how many concepts does nLab contain excluding people? This is obviously a difficult question, but the technology to investigate such questions has increased considerably in the last four years. The long term objective is to be able to ask questions regarding multiple resources like nLab, PlanetMath, Encyclopedia of Math and MathWorld. WikiData has been selected to be the central hub of the project. Ideally WikiData's SPARQL facility would then allow queries involving things like the intersection and union of data across different sources of mathematical information.
Thanks for the feedback. My work includes both technical and data science issues. The short term focus of the Network Mathematics project I'm on is to provide answers to questions about nLab. The current nLab corpus is four years old and as Valeria de Paiva says, the capability of open source nlp tools is rapidly increasing. My job is to see what is currently possible. For example, conservatively speaking, how many concepts does nLab contain excluding people? This is obviously a difficult question, but the technology to investigate such questions has increased considerably in the last four years. The long term objective is to be able to ask questions regarding multiple resources like nLab, PlanetMath, Encyclopedia of Math and MathWorld. WikiData has been selected to be the central hub of the project. Ideally WikiData's SPARQL facility would then allow queries involving things like the intersection and union of data across different sources of mathematical information.
- CommentRowNumber4.
- CommentAuthorstrsnd
- CommentTimeApr 29th 2024
- (edited Apr 29th 2024)
- PermaLink
Author: strsnd
Format: MarkdownItexHi Daniel, [context: I help a bit on the server administration side of things here at nlab] Cool to see your project. Prepping up nlab semantically could be a worthwhile endeavour. Can you help me a bit on what you need support on / help with? As Urs already pointed out, the raw content, as well as all the html rendered pages, are available on [github](https://github.com/ncatlab/nlab-content) and can be either parsed with a customized (itex2MML) markdown parser. Otherwise, you can try to parse the HTML pages. Some parts of the nlab wiki already have categories, which you might be able to make use of, e.g. for [people](https://ncatlab.org/nlab/all_pages/people). Sorry for the delay to reply. Please continue to shoot questions at us! Bye!

Hi Daniel,

[context: I help a bit on the server administration side of things here at nlab]

Cool to see your project. Prepping up nlab semantically could be a worthwhile endeavour.

Can you help me a bit on what you need support on / help with? As Urs already pointed out, the raw content, as well as all the html rendered pages, are available on github and can be either parsed with a customized (itex2MML) markdown parser. Otherwise, you can try to parse the HTML pages.

Some parts of the nlab wiki already have categories, which you might be able to make use of, e.g. for people.

Sorry for the delay to reply. Please continue to shoot questions at us!

Bye!
- CommentRowNumber5.
- CommentAuthorDaniel Geisler
- CommentTimeApr 29th 2024
- PermaLink
Author: Daniel Geisler
Format: TextHowdy, strsnd. I don't need support so much; this is more about good manners. I wanted folks to know I was using the software from the scraping of nLab back in 2020 to get an image of nLab. I think I already have the data I needed. The target I've been given is for a pipeline supporting the following technology which is mainly Python based. Scrapy -> LaTeXML -> SpaCy -> CONLL format -> conllu-stats.pl. Right now I'm focusing on using the JSON image of nLab to ask questions using regular expressions instead of SQL. If you have any questions, feel free to let me know.
Howdy, strsnd. I don't need support so much; this is more about good manners. I wanted folks to know I was using the software from the scraping of nLab back in 2020 to get an image of nLab. I think I already have the data I needed. The target I've been given is for a pipeline supporting the following technology which is mainly Python based.
Scrapy -> LaTeXML -> SpaCy -> CONLL format -> conllu-stats.pl.
Right now I'm focusing on using the JSON image of nLab to ask questions using regular expressions instead of SQL. If you have any questions, feel free to let me know.
- CommentRowNumber6.
- CommentAuthorstrsnd
- CommentTimeApr 30th 2024
- PermaLink
Author: strsnd
Format: TextAwesome! Please keep us updated in case you publish something.
Awesome! Please keep us updated in case you publish something.
- CommentRowNumber7.
- CommentAuthorUrs
- CommentTimeApr 30th 2024
- PermaLink
Author: Urs
Format: MarkdownItex> I don't need support so much; this is more about good manners. In [#1](#Comment_116822) there seemed to be the question of what to make of the entry count on the the nLab.

I don’t need support so much; this is more about good manners.

In #1 there seemed to be the question of what to make of the entry count on the the nLab.
- CommentRowNumber8.
- CommentAuthorDaniel Geisler
- CommentTimeApr 30th 2024
- PermaLink
Author: Daniel Geisler
Format: TextYes, I believe I've resolved the discrepancy. The software I'm using didn't overwrite the previous scrape, it appended the run to the contents of the JSON file it was building.
Yes, I believe I've resolved the discrepancy. The software I'm using didn't overwrite the previous scrape, it appended the run to the contents of the JSON file it was building.
- CommentRowNumber9.
- CommentAuthorDaniel Geisler
- CommentTimeMay 3rd 2024
- PermaLink
Author: Daniel Geisler
Format: TextI'm getting close to having a new corpus built, but there are some pages that probably shouldn't be included. The AUTOMATH page, the sandbox pages and one hundred empty pages. The empty pages seem to be a new occurrence since the the last corpus was built. Does this sound appropriate?
I'm getting close to having a new corpus built, but there are some pages that probably shouldn't be included. The AUTOMATH page, the sandbox pages and one hundred empty pages. The empty pages seem to be a new occurrence since the the last corpus was built. Does this sound appropriate?
- CommentRowNumber10.
- CommentAuthorUrs
- CommentTimeMay 3rd 2024
- (edited May 3rd 2024)
- PermaLink
Author: Urs
Format: MarkdownItexNot sure what you are asking (I don't know what it means that "the last corpus was built"), but otherwise it does not sound inappropriate. . As an aside, I can say that the pages with "`empty`" or "`> history`" in their title arise because there is no mechanism in our Instiki software for users to delete pages. Instead, there is a server-side command (which however has never been run so far) for administrators to delete all "orphaned" pages, namely those that receive no link from any other nLab page. Hence the closest for users to get rid of a page is to rename it to a title that is unlikely to be referenced anywhere.

Not sure what you are asking (I don’t know what it means that “the last corpus was built”), but otherwise it does not sound inappropriate. .

As an aside, I can say that the pages with “empty” or “> history” in their title arise because there is no mechanism in our Instiki software for users to delete pages. Instead, there is a server-side command (which however has never been run so far) for administrators to delete all “orphaned” pages, namely those that receive no link from any other nLab page. Hence the closest for users to get rid of a page is to rename it to a title that is unlikely to be referenced anywhere.
- CommentRowNumber11.
- CommentAuthorDaniel Geisler
- CommentTimeMay 3rd 2024
- PermaLink
Author: Daniel Geisler
Format: TextSome background for the interested. In 2020 Valeria de Paiva, Jacob Collard and Evan Patterson built a 500M [NLP corpus](https://en.wikipedia.org/wiki/Text_corpus) of nLab, nlab.conll. This is the first step in integrating WikiData with nLab. Jacob has build Parmesan which accesses nLab information from WikiData. See [Parmesan](https://www.youtube.com/watch?v=-ZhZjMn1Zpk). I am beginning to document the build process at [NetMath](https://github.com/DanielLeeGeisler/NetMath). One goal is to be able to create a corpus at will. nLab has 3K or 4K more entries than it did in 2020. I use nlab.conll to do [regular expression](https://en.wikipedia.org/wiki/Regular_expression) queries. Thanks for the information about history and empty pages. I'll get feedback on whether to capture the page names for the corpus.
Some background for the interested. In 2020 Valeria de Paiva, Jacob Collard and Evan Patterson built a 500M [NLP corpus](https://en.wikipedia.org/wiki/Text_corpus) of nLab, nlab.conll. This is the first step in integrating WikiData with nLab. Jacob has build Parmesan which accesses nLab information from WikiData. See [Parmesan](https://www.youtube.com/watch?v=-ZhZjMn1Zpk). I am beginning to document the build process at [NetMath](https://github.com/DanielLeeGeisler/NetMath). One goal is to be able to create a corpus at will. nLab has 3K or 4K more entries than it did in 2020. I use nlab.conll to do [regular expression](https://en.wikipedia.org/wiki/Regular_expression) queries.

Thanks for the information about history and empty pages. I'll get feedback on whether to capture the page names for the corpus.
- CommentRowNumber12.
- CommentAuthorDaniel Geisler
- CommentTimeMay 3rd 2024
- PermaLink
Author: Daniel Geisler
Format: MarkdownItexSome background for the interested. In 2020 Valeria de Paiva, Jacob Collard and Evan Patterson built a 500M [NLP corpus](https://en.wikipedia.org/wiki/Text_corpus) of nLab, nlab.conll. This is the first step in integrating WikiData with nLab. Jacob has build Parmesan which accesses nLab information from WikiData. See [Parmesan](https://www.youtube.com/watch?v=-ZhZjMn1Zpk). I am beginning to document the build process at [NetMath](https://github.com/DanielLeeGeisler/NetMath). One goal is to be able to create a corpus at will. nLab has 3K or 4K more entries than it did in 2020. I use nlab.conll to do [regular expression](https://en.wikipedia.org/wiki/Regular_expression) queries. Thanks for the information about history and empty pages. I'll get feedback on whether to capture the page names for the corpus.

Some background for the interested. In 2020 Valeria de Paiva, Jacob Collard and Evan Patterson built a 500M NLP corpus of nLab, nlab.conll. This is the first step in integrating WikiData with nLab. Jacob has build Parmesan which accesses nLab information from WikiData. See Parmesan. I am beginning to document the build process at NetMath. One goal is to be able to create a corpus at will. nLab has 3K or 4K more entries than it did in 2020. I use nlab.conll to do regular expression queries.

Thanks for the information about history and empty pages. I’ll get feedback on whether to capture the page names for the corpus.
- CommentRowNumber13.
- CommentAuthorzskoda
- CommentTimeMay 6th 2024
- PermaLink
Author: zskoda
Format: MarkdownItex5242 pages are tagged with [people](https://ncatlab.org/nlab/all_pages/people) tag; I estimate by frequency of creation which I observe in latest pages that at nearly a half of these entries are rather recent, say last 2 years or so.

5242 pages are tagged with people tag; I estimate by frequency of creation which I observe in latest pages that at nearly a half of these entries are rather recent, say last 2 years or so.
- CommentRowNumber14.
- CommentAuthorDaniel Geisler
- CommentTimeMay 6th 2024
- PermaLink
Author: Daniel Geisler
Format: TextI'm currently building a nLab corpus for 2024 to compliment the corpus built in 2020. Not only are there many new entries, but a large number of edits have been made on nLab. I hope to be able to provide hard answers to questions like how many new entries and how many edited entries are there in the last four years. Also I hope to be able to identify people who do not have a nLab people page.
I'm currently building a nLab corpus for 2024 to compliment the corpus built in 2020. Not only are there many new entries, but a large number of edits have been made on nLab. I hope to be able to provide hard answers to questions like how many new entries and how many edited entries are there in the last four years. Also I hope to be able to identify people who do not have a nLab people page.

1 to 14 of 14

nForum

Discussion Feed

Not signed in

Site Tag Cloud

nLab > nLab Technical Matters: Number of nLab enteries