Want to take part in these discussions? Sign in if you have an account, or apply for one below
Vanilla 1.1.10 is a product of Lussumo. More Information: Documentation, Community Support.
The single most important thing for the protection of the nLab against a ’disaster’ is that we have regular backups of the database in different locations. Nowadays a lot of systems would do this in the cloud, but we can achieve something more or less as good if people are willing to volunteer to regularly download a backup.
An item on the Technical TODO list (nlabmeta), 20 currently, has been to make this possible technically. Prompted by the fact that Jake is making preparations for working on the nLab frontend and needed this, I have now done this (there is now a password protected endpoint one can call to generate an SQL dump and download it).
My question is: would some people be willing to make the downloads regularly (as many as possible!)? And what would be most convenient for you? A user interface where one clicks a button? A script which you run on your own machine? Something like a cron job which runs the script once a day, or week, or whatever on your own machine?
Great that you bring this up. Every now and then I am getting nervous about this.
I bet there are many people out there, not all of them vocal here, who will be interested in making backups to their private machines. I am thinking of people like Jake, who are not active editors but active readers of the Lab. So this should be very worthwhile.
For somebody like me an interface with a button would be most convenient. People more like you and Jake probably would prefer something more fancy. But most important is that there is any mechanism at all, so if in doubt of what to set up, I’d suggest you start with the option that is easiest for you to set up!
Hehe, what I have already made is enough for myself and Jake :-). Yes, I think we should definitely have a user interface with the click option, also it can show when a backup was last made. The disadvantage, though, is that it is manual. Whereas one could have something that runs every day, say, on your computer, and tries to fetch a backup, without you needing to do or remember anything. Let me know if that would be of interest (no problem if not). I could make it a mobile app instead, but I guess one typically does not want 2GB downloading to one’s mobile!
Sure, I’d be interested in having an automatic backup to my machine! Hopefully the job of convincing my machine to do that can be reduced to one or two button clicks, though? Say to hit “download”, then “execute”, then done? :-)
What happened to https://github.com/ncatlab/nlab-content? It used to be a very convenient way to make backups, but the last commit was on April 16, 2018.
I’ve not looked into it yet, but my first guess would be simply that there is too much content for github.
But in any case there is a big difference between nlab-content and the database, there is all kinds of crucial data in the latter. But I do intend to provide exactly the same functionality as for SQL dumps for downloading rendered content.
In earlier discussion the idea was to simply ask the github staff for more space.
Thanks Bas! My feeling, though, is that github is not really the appropriate place for this (they discourage SQL dumps, for instance, and this is not much different). If we host an endpoint on our own server for downloading the rendered content, we achieve more or less the same when it comes to quick recovery of current content, as long as there are people downloading regularly, of course. In other words, storing the database dumps allows us to recover everything, whilst storing rendered content complements this by allowing us to recover more quickly to something usable.
I’d be happy to do this. I’d want an automated script that does it nightly. But I can set up cron on my own, and write my own shell scripts, so the easiest interface for me would be something like a single URL that my cron job could hit with a wget. But if you want to write a simple script that does something more complicated (maybe it would be best to keep multiple backups around at once, in case stuff gets corrupted and recent backups inherit the corruption), I could run that in my cron too.
About how big would a single backup be?
I’d be happy to do this.
Fantastic!
The easiest interface for me would be something like a single URL that my cron job could hit with a wget.
We more or less have this now, except that there are two URLs involved (one a sub-URL of the other), and one of them is a POST, not a GET. But if you are on a unix machine, you will have almost certainly have the command line tool curl, which is just as easy to run as wget, so you can achieve everything in a two-line script. I can provide you with the curl syntax. [Edit: actually you should be able to use two wget commands instead of two curl commands if you prefer!]
I will create the credentials for you soon (this evening, maybe), and then you can just try it out.
But if you want to write a simple script that does something more complicated (maybe it would be best to keep multiple backups around at once, in case stuff gets corrupted and recent backups inherit the corruption)
For now I was thinking just of the simplest possible thing that simply generates and downloads the SQL dump, over-riding the old one. But you’re right that it would be a good idea in the end to keep a few hanging around (e.g. one that updates only per week/month) rather than just one. Just one which is over-ridden is much better than the current situation, though, and should protect us from many/most scenarios.
About how big would a single backup be?
At the moment it’s 2.3GB. If that size is problematic for people, we can look into a more complicated (but also more risk-prone) alternative (e.g. something which does a diff between a particular SQL dump and the current database, and makes only those changes to the previous SQL dump).
I’ve finally created a user for Mike and given him instructions on how to do it. I am running a daily cron job to backup myself, I hope Mike will be able to do something similar. I am for the moment not overriding each day, I am managing it manually.
I’ll see if the process goes smoothly for Mike, and then I can pass on instructions to Urs and anybody else who would be willing to do this.
Re #11: Good ideas! But I’m a little short of time at the moment, so I’ll keep the current simple design for now. I am happy to create a user for anybody willing, though, so that they can do it; it is not difficult.
Thanks Richard! The script you sent me works perfectly; I am setting up a cron job now. I also added a line to the script that gzips the dump once downloaded, which reduces the size down to ~600M. I think I will also set up a cron job to delete sufficiently old backups, since I don’t want to run out of disk space.
Excellent! Thanks very much! Could you possibly paste your scripts here (minus authentication header)? I think others might like that behaviour too.
The basic daily backup.sh
script is a slight modification of the one you sent me:
SQL_DUMP_ID=$(curl -X POST -H "Authorization: Basic ((AUTH))" https://ncatlab.org/sqldump)
OUT_FILE=$HOME/nlab-backups/daily/$(date +"%Y-%m-%dT%H:%M:%S").sql
curl -H "Authorization: Basic ((AUTH))" https://ncatlab.org/sqldump/$SQL_DUMP_ID > $OUT_FILE
gzip $OUT_FILE
where ((AUTH))
should be replaced with the personal authentication hash. The next script link-to.sh
can be run weekly, monthly, etc. with a commandline parameter to hardlink the most recent daily backup to a weekly or monthly directory of longer-ago backups:
FILE=$(ls -t $HOME/nlab-backups/daily | head -1)
ln $HOME/nlab-backups/daily/$FILE $HOME/nlab-backups/$1
Finally the daily delete-old.sh
script deletes all but the most recent few of each backup:
find $HOME/nlab-backups/daily/ -mtime +7 -exec rm {} \;
find $HOME/nlab-backups/weekly/ -mtime +56 -exec rm {} \;
find $HOME/nlab-backups/monthly/ -mtime +365 -exec rm {} \;
The idea is that I want to keep some backups from up to a year ago, but those don’t need the finer-grained day-by-day nature of backups from the past week. Here’s my crontab:
0 2 * * * /home/shulman/nlab-backups/backup.sh >/dev/null 2>&1
20 2 * * 0 /home/shulman/nlab-backups/link-to.sh weekly >/dev/null 2>&1
40 2 1 * * /home/shulman/nlab-backups/link-to.sh monthly >/dev/null 2>&1
0 3 * * * /home/shulman/nlab-backups/delete-old.sh >/dev/null 2>&1
I tested all this as much as I could, but there could still be bugs in it since not even one day has actually gone by. (-:
All the scripts should begin with #!/usr/bin/bash
of course, but the markdown parser here has trouble with that. (-:
Great, thanks very much! I can see from the logs that your cron job is downloading an SQL dump once a day. That’s really good. I’ll create a user for Urs as well when I get a chance.
Have created a user for Urs now, and sent instructions over email. I offered him your help as well as mine, Mike, with setting up a cron job :-).
If anybody else would like to help out, just let me know. The more the merrier: the more people we have with cron jobs running at different times of day, the greater chance we have of losing as little data as possible in a ’crisis’.
Hi Dmitri, I have sent this to you over email now.
Hi Owen, yes, you are welcome! Could you let me know an email address to which I can send the necessary info, and then I’ll do so as soon as I can?
Hi Owen, apologies for taking an eternity to respond, I have created a user for you now and sent the details of how to make a backup over email to you.
I think that when I wrote that (if indeed it was me who wrote it!) I had intended to make that functionality available in a similar way as to the database backup functionality. I have not in fact written that API yet, but it is fairly trivial; I will try to get to it as soon as I can. Would you prefer to specify individual pages to download, or to download everything?
Hi, there are no restrictions in general, but it would be good to not create too many SQL dumps, e.g. once a day or less frequently would be reasonable. Do you have a particular purpose in mind? For those who are downloading the dumps mostly to protect the nLab against disaster, all tables are downloaded, but I can exclude user data in your case (there is nothing very sensitive (passwords are hashed), but it does contain email addresses, which shouldn’t end up everywhere to avoid spam, etc.).
There is an email address attached to your nForum account; can I send a username and password to this?
https://ncatlab.org/sqldump appears to give
504 Gateway Time-out
when I try to refresh my local copy of the nLab.
Hi Dmitri, I have tweaked something now, and think it may have solved the issue. Please check :-).
Incidentally, there has recently (maybe last three weeks, not sure) been an issue that, sometimes, submitting an edit just brings back the edit page, and in that case sometimes the edit has been saved nevertheless, sometimes it has not.
I haven’t been able to reproduce this in a minimal example. But when it happens, so far, another attempt to save will either actually save it or make the system admit that the edit has already been saved, so it’s not a big problem. Or was, in case this is related to the tweak you just made.
Thanks for letting me know, sounds a bit strange. If you are able to re-produce it, do let me know. It should not be related to the change I just made, the SQL dump mechanism is on a different application server.
The summer holiday is approaching in Europe, and this is probably the best chance I will have this year to try to carry out the migration to the cloud, which should be a fresh start with regard to the kind of issue you describe.
Now I get a new error:
524 Origin Time-out
Hmm, thanks for letting me know, it worked for me earlier, but I’ll try to reproduce it. Might not be until tomorrow unfortunately.
Hi Dmitri, I tried again now and couldn’t reproduce it (it worked). There might, however, be some timeout that is kicking in for you and just about not for me. Could you try running the following script manually, with the correct authentication token, and post the output? I particularly wish to see how long the first of the two commands takes.
#!/usr/bin/bash
SQL_DUMP_ID=$(curl -X POST -H "Authorization: Basic TODO" https://ncatlab.org/sqldump)
curl -H "Authorization: Basic TODO" https://ncatlab.org/sqldump/$SQL_DUMP_ID > /tmp/$(date +"%Y-%m-%dT%H:%M:%S").sql
Yes, I retried it and it now works (the database downloaded correctly).
The problem was with the first command, the error messages I reproduced above were caused by it, not by the second command.
OK, great. As I mentioned, we might be on the boundary of a time out; if the issue recurs again, if you can provide the information that I asked for in #35, that would be very helpful.
I appear to have dropped the ball on this amidst the general chaos of this era, but I’m still interested in getting hold of a dump.
Hi, there are no restrictions in general, but it would be good to not create too many SQL dumps, e.g. once a day or less frequently would be reasonable.
I’d definitely be updating my copy daily or less, quite likely closer to weekly.
Do you have a particular purpose in mind?
I’m initially interested in making nLab browseable from my reMarkable e-paper tablet, so that I can look up concepts while reading papers or taking notes. Depending on their size (and given the lack of restrictions), I would be interested in mirroring them for others to have access to the raw page data frely.
For those who are downloading the dumps mostly to protect the nLab against disaster, all tables are downloaded, but I can exclude user data in your case (there is nothing very sensitive (passwords are hashed), but it does contain email addresses, which shouldn’t end up everywhere to avoid spam, etc.).
I’m quite happy to do without those.
There is an email address attached to your nForum account; can I send a username and password to this?
Yes, that works :)
Thanks. In case they haven’t seen this yet, I am now forwarding your message to the technical board. Also, we might need to post a general update on the matter.
I see https://github.com/ncatlab/nlab-content is back, as is https://github.com/ncatlab/nlab-content-html, which is extremely convenient.
Not only it allows for backups of the nLab, but also for an offline viewing. Perhaps this can be reflected at the appropriate meta pages (HowTo, maybe)?
Please be invited to add a comment at HowTo! You know more about this than I do,
Added a description of the git repositories to the HowTo page. Should the material about the SQL database be deleted? It appears that the old endpoint https://ncatlab.org/sqldump is no longer operational.
I don’t actually know. But if its not operational and also redundant now, it seems best to delete the corresponding text.
1 to 45 of 45