Not signed in (Sign In)

Not signed in

Want to take part in these discussions? Sign in if you have an account, or apply for one below

  • Sign in using OpenID

Discussion Tag Cloud

Vanilla 1.1.10 is a product of Lussumo. More Information: Documentation, Community Support.

Welcome to nForum
If you want to take part in these discussions either sign in now (if you have an account), apply for one now (if you don't).
    • CommentRowNumber1.
    • CommentAuthordeusexbit
    • CommentTimeDec 21st 2023
    • (edited Dec 21st 2023)

    Hello everyone. This is my first post on this forum. I was thrilled to find nLab, as it’s the first time I encountered a group dedicated to “that thing I’ve noticed in math and physics that I can’t quite find the name for”. Ends up the name is higher category theory / cohomology. While I’m just barely caught up on the prerequisites for this field of study, I’m excited to dive in.

    But getting to the point: While I’ve been studying math and physics for about 20 years now, my formal education is in computer science. Obviously, machine learning has caught my attention in recent years. I noticed that your forum search method appears to rely upon substring matching. I propose indexing it for semantic search via AI embeddings. For those unfamiliar, embeddings are a way to convert some piece of text into a unit 1600-dimensional vector. These vectors are designed in such a way that the ideal embeddings are ones that produce vectors such that “king” - “man” + “woman” = “queen” (before normalization). One of the key benefits is in the area of search, where you can get the vector embedding for a search query and then return results based on the dot product of the query embedding and each text embedding in the index.

    It’s a tool that has been kind of swept under the rug as everyone tends to focus on the generative aspects of NLP. But it seems that it could be very useful in this context for the purpose of building a graph of concepts. These embeddings are not only useful for search, either. They’re also useful for the purpose of clustering / categorization.

    My proposal is that, with the help of the administrators, that I could build a search index of embeddings that would enable users to find ideas related to their query (even if those ideas are stated using completely different wording). This is referred to as “semantic search” and I think it might be of great benefit to this effort.

    Just an idea I had while diving in here. I look forward to seeing where nLab goes.

    • CommentRowNumber2.
    • CommentAuthorUrs
    • CommentTimeDec 21st 2023

    Welcome.

    You write:

    My proposal is that, with the help of the administrators,

    What help would you need?

    Let me highlight that all the source code of nLab pages is publicly available.

    For instance the source code for the entry cohomology is at the url

      https://ncatlab.org/nlab/source/cohomology
    
    • CommentRowNumber3.
    • CommentAuthordeusexbit
    • CommentTimeDec 21st 2023
    • (edited Dec 21st 2023)

    Well, I suppose that I would simply need an understanding of the layout of things. Ultimately, I need to compile a DB of the wiki with fields “url”, “content” (and perhaps “title”, though I assume that that would be easily extracted from the content). I suppose that, ultimately, the raw material for this project comes down to an index of the topics/urls. I can scrape based on that to create a local DB to work off of. From there, it’s pretty much just me creating something in the hopes that you guys would use it (or at least confirm that you’re okay with me hosting it as an independent search engine that links back here, in the event that you guys don’t want to integrate it directly).

    • CommentRowNumber4.
    • CommentAuthorDmitri Pavlov
    • CommentTimeDec 22nd 2023

    Re #3: The git repository for the nLab is available here:

    https://github.com/ncatlab/nlab-content.git

    and the compiled HTML pages are available here:

    https://github.com/ncatlab/nlab-content-html.git

    • CommentRowNumber5.
    • CommentAuthorUrs
    • CommentTimeDec 22nd 2023

    Here a reaction from a member of the nLab’s technical team:

    Can they write a proposal which could be reviewed here?

    Using embeddings could certainly improve search requests on the website but the embedding index comes with a strict dependence on the AI model used, the longevity of the AI service providing it and maybe costs for every search request to calculate the embedding and upfront costs to calculate the initial embeddings.

    Using local AI models would require to check if we have the necessary compute power and the resulting costs etc. plus maybe more quality assurance.

    My fear is that such a service requires more hand holding than using the full text search features from the databases we already use. It would require someone who is committed to such a project for a longer time.

    • CommentRowNumber6.
    • CommentAuthorNikolajK
    • CommentTimeDec 22nd 2023
    • (edited Dec 22nd 2023)

    Relatedly, I’m reminded of the mLab. github. nCafe post.

    • CommentRowNumber7.
    • CommentAuthordeusexbit
    • CommentTimeDec 22nd 2023

    These are valid concerns but I feel that worst case, you just fall back on what you already have. The cost is negligible (unless your traffic is orders of magnitude higher than what I suspect) and something I would be taking on. There’s a dependence on the AI model used. I run local AIs but not sure I want to dedicate my GPU to this. Plus, the latency would be absurd.

    I suppose let’s table this and I’ll come back to it in the event that I have something a bit more substantial to discuss. I was just wanting to throw the idea out there.

    • CommentRowNumber8.
    • CommentAuthorstrsnd
    • CommentTimeDec 22nd 2023
    Urs quoted my reply above when discussing your proposal on our mailing list.

    My request for a more "formal" proposal was exactly about specifying hosting, operations, fallbacks and maintainership and how this service would be affiliated with ncatlab. Regarding costs: sure the costs are negligible, but the actual problem might be that ncatlab might not have a long-term usable credit card to top up accounts. Maybe someone else knows better? In case the service is truly remarkable and provides value, I guess we can't simply fallback without users complaining, so at some point we would be forced to adopt it. ;)

    I absolutely don't want to stop you working on this and maybe we can find a model that works. I guess (I have to verify with the other admins) it would be possible to setup a virtual host on our infrastructure and handle this as an external service on a subdomain? We could probably, depending on the load, also host this on the ncatlab servers. For that, it would be useful to maybe see a demo of the proposed search functionality.

    In any case, all content should be readily available for you to process already, as Urs mentioned. Otherwise, please ask!

    PS: I was playing and enjoying datasette [1], which provides tooling to enrich the databases with embeddings [2] and search capabilities. A lot of fun!

    [1] https://datasette.io/
    [2] https://datasette.io/plugins/datasette-openai
    • CommentRowNumber9.
    • CommentAuthordeusexbit
    • CommentTimeDec 22nd 2023

    Hi @strsnd! (not sure if this forum enables “mentioning”) I appreciate your input on the matter. These are all important considerations. I hadn’t considered the “forced to adopt it. ;)” aspect but that’s a good point. I agree that a proof of concept would be a natural next step. I found the wiki’s page index and had to put my foot in my mouth because I was not expecting 18,000+ articles! You guys have been busy! haha. That’s still a managable amount but definitely something that requires a more careful approach.

    That said, I think that investigating these tools might prove themselves valuable to an effort which has so many interrelated concepts expressed in various ways.

    I’m still on the fence about this project but made this post to kind of get some first impressions about the method. But I completely agree; the next step seems to be setting up a demo on a separate domain that simply enables semantic search against vector/url pairs generated by a modest subset including both closely related and loosely related articles and see how it handles the nature of the content. While these things are very effective for English, I’m skeptical about its capacity to handle TeX and such.

    Re: PS: I hadn’t heard of datasette. Clever name. I’ll take a look; it sounds interesting.

    Thanks again for your time and I’ll let you know if I follow through with this prospect. Sounds like it’s hypothetically viable but contingent upon its usefulness as determined by a proof of concept / demo / spot-check.

    • CommentRowNumber10.
    • CommentAuthordeusexbit
    • CommentTimeDec 22nd 2023

    Afterthought about “skeptical about its capacity to handle TeX”: I think this is where open source models may become necessary. There are several that were trained on research papers in math and physics and it may very well be the case that such a corpus is required in order to effectively handle the nature of this content. The overall cost/reward of such an undertaking would be very difficult to establish without further investigation.

    • CommentRowNumber11.
    • CommentAuthordeusexbit
    • CommentTimeDec 22nd 2023

    Pardon the third consecutive comment here but I’m afraid I missed #4 in this conversation. That’s of great help. Thank you, Dmitri.

    • CommentRowNumber12.
    • CommentAuthorUrs
    • CommentTimeDec 22nd 2023

    Probably comment #6 was meant to address the issue of handling TeX:

    The mLab is an example of some kind of kind of AI that is playing with the full nLab content, including TeX equations.

    (Or so it seems, I am not expert enough to judge what is really going on.)

    • CommentRowNumber13.
    • CommentAuthorstrsnd
    • CommentTimeDec 23rd 2023
    A few ideas regarding TeX:

    arxiv started to support showing papers in html few days ago [1], maybe by reusing their parser one could convert TeX to something more sensible to generate embeddings.

    I would also have a look at langchain [2] if they support TeX (though I had no good experience with langchain a while ago and I find the concept dubious, I only use it for inspiration, sometimes...).

    [1] https://blog.arxiv.org/2023/12/21/accessibility-update-arxiv-now-offers-papers-in-html-format/
    [2] https://github.com/langchain-ai/langchain
    • CommentRowNumber14.
    • CommentAuthordeusexbit
    • CommentTimeDec 23rd 2023

    Just a heads up in case you’re unfamiliar: http://llamahub.ai

    It seems that’s more in line with simply parsing and chunking. Langchain is an interesting tool but it seems it would only be useful if we were looking to make a generative AI. For the purpose of generating embedding vectors for search, llama loader would seem to be more appropriate. They have a pdf loader but I think that’d ultimately just be moving the goal post. I don’t know… I think the TeX might ultimately be unnecessary to the efficacy of the approach. Simply the English around it should provide enough “semantic anchors” to enable effective search. Would probably want to take the “hypothetical document” approach.