Welcome to nForum
    • CommentRowNumber1.
    • CommentAuthorDaniel Geisler
    • CommentTimeApr 22nd 2024
    Hello, my name is Daniel Geisler and I'm new here. I'm providing software support to Valeria de Paiva. I'm working with the nLab corpus from a couple of years ago, but nLab is 90% larger now. I'm using Python software to scrape nLab so a current corpus of nLab can be built. This is just a heads up that I am scraping nLab. I'll try to access nLab with a short list of pages for debugging, but I just had to run upto XYZ before I generated an error.
    • CommentRowNumber2.
    • CommentAuthorUrs
    • CommentTimeApr 22nd 2024

    Hi Daniel,

    thanks for writing in; sounds interesting.

    In case there is anything concerning the nLab’s server, let me know and I can bring you in contact with our technical team.

    • CommentRowNumber3.
    • CommentAuthorDmitri Pavlov
    • CommentTimeApr 22nd 2024

    Re #1: A repository with the source code of all nLab pages is available here:, and a repository with the compiled HTML code of all nLab pages is available here:

    • CommentRowNumber4.
    • CommentAuthorDaniel Geisler
    • CommentTimeApr 24th 2024
    Thank you Dmitri. I was able to scrape nLab a couple of days before by using the Python code from the previous scrape. Next task is to get a version of LaTeXML running on my computer and then use spaCy to build the corpus conll file. Then I can run statistics using UD
    • CommentRowNumber5.
    • CommentAuthorzskoda
    • CommentTimeApr 24th 2024

    What does it mean “to scrape” in this context ?

    • CommentRowNumber6.
    • CommentAuthorUrs
    • CommentTimeApr 24th 2024

    Wikipedia: Web scraping

    • CommentRowNumber7.
    • CommentAuthorDaniel Geisler
    • CommentTimeApr 27th 2024
    I need to create documentation for how to create a nLab corpus from scratch. Both an overview of the project as well as software installation and configuration. If I created a nLab corpus page or category on nLab then other interested parties could contribute information or questions. Does this sound like a good approach?