Not signed in (Sign In)

Not signed in

Want to take part in these discussions? Sign in if you have an account, or apply for one below

  • Sign in using OpenID

Site Tag Cloud

2-category 2-category-theory abelian-categories adjoint algebra algebraic algebraic-geometry algebraic-topology analysis analytic-geometry arithmetic arithmetic-geometry book bundles calculus categorical categories category category-theory chern-weil-theory cohesion cohesive-homotopy-type-theory cohomology colimits combinatorics complex complex-geometry computable-mathematics computer-science constructive cosmology deformation-theory descent diagrams differential differential-cohomology differential-equations differential-geometry digraphs duality elliptic-cohomology enriched fibration foundation foundations functional-analysis functor gauge-theory gebra geometric-quantization geometry graph graphs gravity grothendieck group group-theory harmonic-analysis higher higher-algebra higher-category-theory higher-differential-geometry higher-geometry higher-lie-theory higher-topos-theory homological homological-algebra homotopy homotopy-theory homotopy-type-theory index-theory integration integration-theory k-theory lie-theory limits linear linear-algebra locale localization logic mathematics measure-theory modal modal-logic model model-category-theory monad monads monoidal monoidal-category-theory morphism motives motivic-cohomology nforum nlab noncommutative noncommutative-geometry number-theory of operads operator operator-algebra order-theory pages pasting philosophy physics pro-object probability probability-theory quantization quantum quantum-field quantum-field-theory quantum-mechanics quantum-physics quantum-theory question representation representation-theory riemannian-geometry scheme schemes set set-theory sheaf sheaves simplicial space spin-geometry stable-homotopy-theory stack string string-theory superalgebra supergeometry svg symplectic-geometry synthetic-differential-geometry terminology theory topology topos topos-theory tqft type type-theory universal variational-calculus

Vanilla 1.1.10 is a product of Lussumo. More Information: Documentation, Community Support.

Welcome to nForum
If you want to take part in these discussions either sign in now (if you have an account), apply for one now (if you don't).
    • CommentRowNumber1.
    • CommentAuthorDaniel Geisler
    • CommentTimeApr 29th 2024
    Hello. I'm having an odd problem parsing the results of the web scraping. nLab advertises that it has 19193 entries and that agrees the number of entries on the All Pages page. But the resultant file nlab_pages.json has over 28000 entries. I do have 743 pages whose name ends with "> history" that are not listed on All Pages, but than only accounts from a small portion of the additional page entries. Does anyone have any insight into this?
    • CommentRowNumber2.
    • CommentAuthorUrs
    • CommentTimeApr 29th 2024

    Hi again.

    As I said (here) in your other thread (btw, you don’t need to start a new thread for each issue in this context), for questions regarding the nLab’s server installation you’ll need to be speakig with our technical team.

    I’ll drop them a note.

    • CommentRowNumber3.
    • CommentAuthorDaniel Geisler
    • CommentTimeApr 29th 2024
    Thanks for the feedback. My work includes both technical and data science issues. The short term focus of the Network Mathematics project I'm on is to provide answers to questions about nLab. The current nLab corpus is four years old and as Valeria de Paiva says, the capability of open source nlp tools is rapidly increasing. My job is to see what is currently possible. For example, conservatively speaking, how many concepts does nLab contain excluding people? This is obviously a difficult question, but the technology to investigate such questions has increased considerably in the last four years. The long term objective is to be able to ask questions regarding multiple resources like nLab, PlanetMath, Encyclopedia of Math and MathWorld. WikiData has been selected to be the central hub of the project. Ideally WikiData's SPARQL facility would then allow queries involving things like the intersection and union of data across different sources of mathematical information.
    • CommentRowNumber4.
    • CommentAuthorstrsnd
    • CommentTimeApr 29th 2024
    • (edited Apr 29th 2024)

    Hi Daniel,

    [context: I help a bit on the server administration side of things here at nlab]

    Cool to see your project. Prepping up nlab semantically could be a worthwhile endeavour.

    Can you help me a bit on what you need support on / help with? As Urs already pointed out, the raw content, as well as all the html rendered pages, are available on github and can be either parsed with a customized (itex2MML) markdown parser. Otherwise, you can try to parse the HTML pages.

    Some parts of the nlab wiki already have categories, which you might be able to make use of, e.g. for people.

    Sorry for the delay to reply. Please continue to shoot questions at us!

    Bye!

    • CommentRowNumber5.
    • CommentAuthorDaniel Geisler
    • CommentTimeApr 29th 2024
    Howdy, strsnd. I don't need support so much; this is more about good manners. I wanted folks to know I was using the software from the scraping of nLab back in 2020 to get an image of nLab. I think I already have the data I needed. The target I've been given is for a pipeline supporting the following technology which is mainly Python based.
    Scrapy -> LaTeXML -> SpaCy -> CONLL format -> conllu-stats.pl.
    Right now I'm focusing on using the JSON image of nLab to ask questions using regular expressions instead of SQL. If you have any questions, feel free to let me know.
    • CommentRowNumber6.
    • CommentAuthorstrsnd
    • CommentTimeApr 30th 2024
    Awesome! Please keep us updated in case you publish something.
    • CommentRowNumber7.
    • CommentAuthorUrs
    • CommentTimeApr 30th 2024

    I don’t need support so much; this is more about good manners.

    In #1 there seemed to be the question of what to make of the entry count on the the nLab.

    • CommentRowNumber8.
    • CommentAuthorDaniel Geisler
    • CommentTimeApr 30th 2024
    Yes, I believe I've resolved the discrepancy. The software I'm using didn't overwrite the previous scrape, it appended the run to the contents of the JSON file it was building.
    • CommentRowNumber9.
    • CommentAuthorDaniel Geisler
    • CommentTimeMay 3rd 2024
    I'm getting close to having a new corpus built, but there are some pages that probably shouldn't be included. The AUTOMATH page, the sandbox pages and one hundred empty pages. The empty pages seem to be a new occurrence since the the last corpus was built. Does this sound appropriate?
    • CommentRowNumber10.
    • CommentAuthorUrs
    • CommentTimeMay 3rd 2024
    • (edited May 3rd 2024)

    Not sure what you are asking (I don’t know what it means that “the last corpus was built”), but otherwise it does not sound inappropriate. .

    As an aside, I can say that the pages with “empty” or “> history” in their title arise because there is no mechanism in our Instiki software for users to delete pages. Instead, there is a server-side command (which however has never been run so far) for administrators to delete all “orphaned” pages, namely those that receive no link from any other nLab page. Hence the closest for users to get rid of a page is to rename it to a title that is unlikely to be referenced anywhere.

    • CommentRowNumber11.
    • CommentAuthorDaniel Geisler
    • CommentTimeMay 3rd 2024
    Some background for the interested. In 2020 Valeria de Paiva, Jacob Collard and Evan Patterson built a 500M [NLP corpus](https://en.wikipedia.org/wiki/Text_corpus) of nLab, nlab.conll. This is the first step in integrating WikiData with nLab. Jacob has build Parmesan which accesses nLab information from WikiData. See [Parmesan](https://www.youtube.com/watch?v=-ZhZjMn1Zpk). I am beginning to document the build process at [NetMath](https://github.com/DanielLeeGeisler/NetMath). One goal is to be able to create a corpus at will. nLab has 3K or 4K more entries than it did in 2020. I use nlab.conll to do [regular expression](https://en.wikipedia.org/wiki/Regular_expression) queries.

    Thanks for the information about history and empty pages. I'll get feedback on whether to capture the page names for the corpus.
    • CommentRowNumber12.
    • CommentAuthorDaniel Geisler
    • CommentTimeMay 3rd 2024

    Some background for the interested. In 2020 Valeria de Paiva, Jacob Collard and Evan Patterson built a 500M NLP corpus of nLab, nlab.conll. This is the first step in integrating WikiData with nLab. Jacob has build Parmesan which accesses nLab information from WikiData. See Parmesan. I am beginning to document the build process at NetMath. One goal is to be able to create a corpus at will. nLab has 3K or 4K more entries than it did in 2020. I use nlab.conll to do regular expression queries.

    Thanks for the information about history and empty pages. I’ll get feedback on whether to capture the page names for the corpus.

    • CommentRowNumber13.
    • CommentAuthorzskoda
    • CommentTimeMay 6th 2024

    5242 pages are tagged with people tag; I estimate by frequency of creation which I observe in latest pages that at nearly a half of these entries are rather recent, say last 2 years or so.

    • CommentRowNumber14.
    • CommentAuthorDaniel Geisler
    • CommentTimeMay 6th 2024
    I'm currently building a nLab corpus for 2024 to compliment the corpus built in 2020. Not only are there many new entries, but a large number of edits have been made on nLab. I hope to be able to provide hard answers to questions like how many new entries and how many edited entries are there in the last four years. Also I hope to be able to identify people who do not have a nLab people page.