Not signed in (Sign In)

Not signed in

Want to take part in these discussions? Sign in if you have an account, or apply for one below

  • Sign in using OpenID

Site Tag Cloud

2-category 2-category-theory abelian-categories adjoint algebra algebraic algebraic-geometry algebraic-topology analysis analytic-geometry arithmetic arithmetic-geometry book bundles calculus categorical categories category category-theory chern-weil-theory cohesion cohesive-homotopy-type-theory cohomology colimits combinatorics complex complex-geometry computable-mathematics computer-science constructive cosmology definitions deformation-theory descent diagrams differential differential-cohomology differential-equations differential-geometry digraphs duality elliptic-cohomology enriched fibration foundation foundations functional-analysis functor gauge-theory gebra geometric geometric-quantization geometry graph graphs gravity grothendieck group group-theory harmonic-analysis higher higher-algebra higher-category-theory higher-differential-geometry higher-geometry higher-lie-theory higher-topos-theory homological homological-algebra homotopy homotopy-theory homotopy-type-theory index-theory integration integration-theory k-theory lie-theory limits linear linear-algebra locale localization logic mathematics measure-theory modal modal-logic model model-category-theory monad monads monoidal monoidal-category-theory morphism motives motivic-cohomology nlab noncommutative noncommutative-geometry number-theory of operads operator operator-algebra order-theory pages pasting philosophy physics pro-object probability probability-theory quantization quantum quantum-field quantum-field-theory quantum-mechanics quantum-physics quantum-theory question representation representation-theory riemannian-geometry scheme schemes set set-theory sheaf simplicial space spin-geometry stable-homotopy-theory stack string string-theory superalgebra supergeometry svg symplectic-geometry synthetic-differential-geometry terminology theory topology topos topos-theory tqft type type-theory universal variational-calculus

Vanilla 1.1.10 is a product of Lussumo. More Information: Documentation, Community Support.

Welcome to nForum
If you want to take part in these discussions either sign in now (if you have an account), apply for one now (if you don't).
    • CommentRowNumber1.
    • CommentAuthorRichard Williamson
    • CommentTimeJul 16th 2018
    • (edited Jul 16th 2018)

    I have made a start on one of the Technical TODO list (nlabmeta) items, namely to create a dashboard with some statistics about the nLab, for example number of page views. I am not yet at the stage where we can get graphs of page views, but just to illustrate it a bit, here is a ’snapshot’ of how it looks at the moment.

    https://ncatlab.org/grafana/dashboard/snapshot/a4K1AOyDB0YDq1tQv2RlwprCxsKWW3QB

    There is one statistic there, which shows the number of page edits/creations over the last 24 hours. This snapshot will be deleted after 7 days from now.

    The dashboard tool is called grafana. It is extremely powerful and flexible. In the snapshot, you cannot do much except look at the statistic, but the actual dashboard is live (but password protected), and one can create all kinds of graphs, etc, change the time over which one searches,etc.

    To get statistics on page views, since we do not have any metrics inside Instiki itself, probably the quickest will be to use the nginx logs. But this will require a bit more infrastructure, so will take a bit more time.

    The statistic that is there uses the nLab database as its ’data source’.

    For now, let me know what kind of graphs/statistics you would be looking for, and I’ll see what I can do to create them.

    Probably in the end we’ll host daily snapshots of the dashboard which anybody can see, and certain people will be given access to the live dashboard.

    • CommentRowNumber2.
    • CommentAuthorRichard Williamson
    • CommentTimeJul 18th 2018
    • (edited Jul 18th 2018)

    I have now done some more work on putting in place infrastructure for analytics. The short message is that the dashboard now shows something more useful, namely page views, both graphed and total. I have created users now for Urs and Mike to take a look, and will send the login details over email. You will have ’viewer’ permissions, which means you cannot create any graphs, etc, but you at least see the dashboard updating in real time, and you can change the time frame for the search.

    If anyone else wishes to take a look, let me know, and I will create a user for you. Otherwise, here is a snapshot of how it looks for the last 7 days (excluding today). The x-axis interval for the page views graphs is 10 minutes, i.e the y value is the number of page views in that 10 minute interval.

    So far, to avoid blowing things up, I have imported only the last 8 or so days of data. The graph looks a bit weird for today, but I think this is due to the way that the log rotation works; part of the data has not been processed yet. I guess that all will be fine from sometime tomorrow (UTC).

    In case you would like to know the technical details, there were quite a few layers involved in setting this up. The main goal was to get the nginx logs into something called ’Elasticsearch’ in a nice format. This is achieved by a couple of other tools: ’Filebeat’ and ’Logstash’. Grafana (the dashboard) then queries Elasticsearch to get the data about page views from the nginx logs (to be precise, it matches on GET calls to /nlab/show which have a 200 response).

    Elasticsearch has a search API, but if one wishes to dig deeper and search the logs, one usually uses yet another tool, typically ’Kibana’ or ’Graylog’. I will look into that when I get the chance: we can already see some spikes in the page views graphs that should be investigated, for example.

    • CommentRowNumber3.
    • CommentAuthorRodMcGuire
    • CommentTimeJul 19th 2018

    I tried the snapshot. While there is a way to “zoom out” the time series (magnifying glass icon) I couldn’t find any way to zoom in. Reloading the page in place remembers the zoom level rather resetting it, however closing the page and then reopening it “resets” the zoom level to the full week.

    • CommentRowNumber4.
    • CommentAuthorUrs
    • CommentTimeJul 19th 2018

    Thanks, Richard! A naive question: Why does viewing the dashboard information require accounts and password protection? Could this not just be made publically available?

    • CommentRowNumber5.
    • CommentAuthorRichard Williamson
    • CommentTimeJul 19th 2018
    • (edited Jul 19th 2018)

    Good question, Urs! My original reason that these tools are very powerful, and need to be used with responsibility. Once one has defined a graph, they update in real time, every so often (one can set the refresh interval). This update involves querying ElasticSearch or the nLab database for each counter/graph. Now, if one is a bit naïve, one might choose a long time period to analyse with a short refresh rate. This is going to put a very heavy load on the nLab server. Similarly if there are many people who have the dashboard open. So if the dashboard were completely unprotected, we would be exposing ourselves to quite a bit of risk that the whole nLab might be brought down (often ElasticSearch and Grafana would be on a different server from the web server to mitigate this problem, but we obviously do not at the moment have this option).

    However, after seeing your question and thinking about it some more, and since the nLab has happily not really been targeted for these kind of attacks so far, maybe I was being overcautious. I think we can open to anybody who is sufficiently motivated, in the following way. Anybody can use username

    nlab

    and password

    MmQ2ZGFmMDJhZGI4MzAzMjgxOTA4YmM3

    to login at the following link.

    https://ncatlab.org/grafana

    Just give it a go! This user has viewer permissions, so you can adjust the timeframe as you wish and see it update in realtime, but cannot create any graphs, etc. Let me know if you need any guidance. As above, just be a bit responsible: it is OK to use large timeframes, but do so sparingly, e.g. adjust back to a shorter timeframe once you have found the info you need, or use a long refresh interval.

    Let’s maybe not distribute this username and password too widely, just spread it by ’word of mouth’. I assume most regulars will see this message and be able to write down the password if they are interested. Please do not change the password (once you have logged in once, it will be cached anyway).

    I have changed Urs and Mike to admins now, which I think means they should have the same permissions as me: they can add users, and do what they like with graphs, including creating them. I will add Adeel as well. If anyone else would like to be an admin, just let me know. There is also an ’editor’ role, where one can create graphs but not do things like add users. If you would like to add a graph or change one of the existing ones, great if you run it by me here first. You will probably need to know what the JSON structure of the nginx logs in ElasticSearch looks like.

    The graph for the 18th is still weird; it is a consequence of the fact that I was ingesting old logs at the same time as the live ones were coming in. I will fix it (re-ingest the ElasticSearch data for the 18th) when I have a moment.

  1. And as requested in #1, let me know what you’d like to see on the dashboard! I have a few ideas, but am more likely to implement more quickly a request!

  2. Re #3: thanks for trying the snapshot, Rod! That behaviour seems a bit unintuitive, I agree! You can change the timerange by clicking on the description of the range (to the right of the magnifying glass). The zoom out seems to be just a kind of shortcut for that (I don’t think I had ever used it before!).

    (Perhaps I should say, if it was not clear, that the dashboard is not written by me! It is a widely used tool in the software industry. I have just set it up on the nLab server with the appropriate config, and defined the counters/graph you see).

  3. I have begun re-ingesting the data for the 18th into ElasticSearch. It will take a little while before it’s finished; you will see a gap in the grafana page views graph until then.

    • CommentRowNumber9.
    • CommentAuthorUrs
    • CommentTimeJul 20th 2018
    • (edited Jul 20th 2018)

    Thanks, Richard. I see, so using the dashboard does affect the server, I hadn’t guessed that. Maybe the data could simply be cached? But it’s maybe not so important.

    I found it entertaining to look at the numbers, though I am not sure what to make of them. Any technical conclusions that you are drawing from this?

    • CommentRowNumber10.
    • CommentAuthorMike Shulman
    • CommentTimeJul 20th 2018

    One important technical conclusion is that knowing the average/maximum server load will help us decide how powerful of a (virtual) server we need when we get a new one.

    • CommentRowNumber11.
    • CommentAuthorDavidRoberts
    • CommentTimeJul 20th 2018

    Can we tell how much of the traffic is bots/crawlers and how much is actual human visitors? A tech-savvy acquaintance who saw the visitor numbers plot when I was looking at it was suspicious about that big spike on the 16th July.

    • CommentRowNumber12.
    • CommentAuthorMike Shulman
    • CommentTimeJul 20th 2018

    Bots do seem like a likely culprit for the big spikes, at least. Are there standard ways of distinguishing bots from humans?

    • CommentRowNumber13.
    • CommentAuthorTim_Porter
    • CommentTimeJul 20th 2018

    Today’s spike at 12.45 to 13.00 again looks like bots. What about their origins?

  4. Will respond to the questions in a little while. Just wished to inform that I will be deleting and regenerating some data shortly, so the graphs will look empty/weird for a while.

    • CommentRowNumber15.
    • CommentAuthorRichard Williamson
    • CommentTimeJul 20th 2018
    • (edited Jul 20th 2018)

    Still working on getting the data to be gap-less. In the meantime…

    Re #9:

    I found it entertaining to look at the numbers, though I am not sure what to make of them. Any technical conclusions that you are drawing from this?

    The immediate goal was just to provide an indication of the amount of traffic the nLab has, and whether there are any patterns in usage. The amount of traffic had been requested a few times, for example for the purposes of grant applications (indicating how much the nLab is used) or for evaluating how large a server we might need if we use the cloud. In the long-term, we can see whether the amount of traffic goes up or down, etc.

    But actually you are correct that in the software industry, dashboards like this are used principally for monitoring performance: they should make it easier to identify anomalies, and investigate them. The most obvious ones so far are the troughs and spikes. The nLab went down a bit on the 17th, and one can see the drop in traffic then.

    People have already been commenting on the spikes above! [Edit: updated!] The one around 13.30 UTC on the 14th was caused by a bot, ’SemanticScholarBot’. The others I investigated do not seem to be caused by bots, they seem to be isolated individuals downloading the entire nLab by means of a wget call. The IP address for the spike on the 16th, again around 13.30 UTC, though this seems a coincidence, was 177.220.99.26, and one of the earlier ones was 2601:644:400:27bb:6149:eb0f:faa4:2f6a. This could be well-intentioned behaviour, just people wanting a local copy of the nLab. There are things we can do to block this, or to introduce rate-limiting to slow down how quickly they can do it; but it doesn’t seem an urgent priority for now. Today’s spike that Tim mentioned seems to have bot related, but these bots seem legitimate (one of them is bingbot, from the Bing search engine).

    A well-behaved bot will identify itself in the user-agent string, which appears in the nginx logs, so I should be able to filter these out or in. It is probably useful to keep them in for the main statistics, though, for server load is server load, regardless of whether it is human or automated. In general, as I mentioned in an earlier comment, one just has to dig into the logs, typically using a tool like Kibana or Graylog. In this case, I just dug into the logs manually for the affected times.

    Later on, we can add further graphs to help us hone-in on issues. For instance, we should certainly provide some graphs of CPU, memory usage, disk space, etc of the nLab server. One would also have graphs coming from the actual application, Instiki in this case. But at least we have a start now.

  5. Data is now correct and up to date from the 10th of July onwards. I am also ingesting the data for the first part of July. I don’t think I’ll go further back than this, for now at least.

  6. It seems that many of the bots which crawl the nLab self-identify in such a way that they are identified as of device ’Spider’ in the logstash parsing of the nginx logs. This means that I am able to exclude them, and I have now added a pair of graphs in which self-identifying bots are indeed excluded.

    • CommentRowNumber18.
    • CommentAuthorDavid_Corfield
    • CommentTimeJun 17th 2019

    Am I right in thinking that this dashboard facility is unavailable now?

  7. It is still set up, but it is down, yes (needs a little clearing of old data). I can bring it back up when I get a chance if you wish.

    • CommentRowNumber20.
    • CommentAuthorDavid_Corfield
    • CommentTimeJun 20th 2019
    • (edited Jun 20th 2019)

    If that wouldn’t be too much bother. Alternatively if we already have data from a recent period of the rate of access to the nLab, that would be great.

    • CommentRowNumber21.
    • CommentAuthorDavid_Corfield
    • CommentTimeJun 24th 2019
    • (edited Jun 24th 2019)

    So any rough idea of traffic? This site suggests 28K/month, but in 2012 we were supposedly receiving 16K/day.

    Mind you, the former figure is ’organic traffic’, whatever that is.

    • CommentRowNumber22.
    • CommentAuthorDavid_Corfield
    • CommentTimeJun 24th 2019

    The latter figure is in line with the estimate here of 425K/month.

    • CommentRowNumber23.
    • CommentAuthorRichard Williamson
    • CommentTimeJul 16th 2019
    • (edited Jul 16th 2019)

    Apologies that it took a long time to get to it, but the dashboard is back up now. Only has data from this month plus a few random days in June.

    I will add a cron job which will hopefully keep the dashboard up from now on. I plan to keep about 30 days worth of data.

    Interestingly, if one looks at the 1st of July, one can see a massive spike, and this was definitely an attack. IP address 104.200.153.82 did numerous GET calls (programatically) per second to the nLab home page and to a couple of other URLs which do not exist. I have of course now blocked this IP address, which seems to be known to be dodgy, and the attack did not I think have any significant impact. Also I think that Cloudfare kicked in after a while. Still it is a reminder that we have to be careful; an attack of this nature by somebody who knew more about the nLab and knew how to really cause trouble could cause a lot of headaches.

    Back to the traffic estimates, the ’Total non-bot page views’ statistic is 12287 for 00:00:00 July 15th to 00:00:00 July 16th UTC. This is not distinct users, just distinct page views excluding bots (and also does not include edits, nForum, or any other nLab functionality). It seems like a reasonably typical day. That’s about one page view every 7 seconds. The actual nLab server gets significantly more frequent hits.

    • CommentRowNumber24.
    • CommentAuthorRichard Williamson
    • CommentTimeJul 16th 2019
    • (edited Jul 16th 2019)

    PS - For those who very helpfully like to keep an eye on things, e.g checking the Latest Revisions page for rogue behaviour, as I know some people do, one could do a lot worse than keep a regular eye on the dashboard and let me know if you see something strange!

    • CommentRowNumber25.
    • CommentAuthorDavid_Corfield
    • CommentTimeSep 2nd 2019

    Re #24, there was a very large spike on 1July.