Want to take part in these discussions? Sign in if you have an account, or apply for one below
Vanilla 1.1.10 is a product of Lussumo. More Information: Documentation, Community Support.
There's a discussion on latest changes about making an html copy of the n-lab downloadable every now and then. Jacques has disabled the official export html facility for reasons of severe server overload.
I'll need to experiment, but I think that what Zoran wants could be achieved with a simple wget
command which would also have the advantage of making sure that all the links are correct (and wouldn't require any changes to Instiki). I presume that, as Zoran wants to be able to do stuff offline, this would be desirable. What else would be useful for this?
Now that I've read the wget
manual, then this does seem eminently possible. The exact command depends a little on exactly what should be downloaded. Presumably what one wants is all the existent pages in their most recent forms. Thus one doesn't care about all the fiddly bits concerning histories and edits boxes. What one wants is everything of the form:
http://ncatlab.org/show/some+random+page
plus everything required to properly display that page (icons, stylesheets, etc). To play nice with the server, this should only be downloaded if the server copy is newer than the original.
First step is to get a list of all the pages, we do this by downloading the All pages page and extracting a list of the pages (via a perl script). We feed this back into wget
as a list of pages to get (using the -i
option). For each downloaded page we ensure that we have the required extras to display it correctly (-p
option), we convert the links so that they work correctly: links to downloaded files point to downloaded files, links to non-downloaded files point to non-downloaded files (-k
option), we use time-stamping to only get new pages (-N
), but because we're doing a little post-processing we need to keep the original files for time-stamping to work correctly (-K
). Files are also converted to html extension (-E
) since no matter how they were generated, they are now boring html (well, okay, xhtml+mathml+svg) files.
wget --output-document=- http://ncatlab.org/nlab/list \
| perl -lne '/<div id="allPages"/ and $print = 1;
/<div id="wantedPages"/ and exit;
/href="([^"]*)"/ and $print and print "http://ncatlab.org$1";' \
| wget -i - -kKEpN
I haven't tried this yet. I would suggest that, at least the first time anyone does this, it be done at light usage times. Not completely sure when they are, though.
First step is to get a list of all the pages, we do this by downloading the All pages page and extracting a list of the pages (via a perl script).
Or if you want the Markdown source as well, you can download the Markdown export and get the titles from that. I'm inclined to guess that this would be cleaner. But hey, if you wrote the script already …
light usage times. Not completely sure when they are
In my experience, weekends (dang, just missed one!) and between 7:00 and 9:00 UTC.
I'm amazed that a script did want it was intended to do first time! Mind you, the real test is when you run it the second time and see if it genuinely only downloads the pages that have been updated.
I can empathise with the hacking/mathing dilemma. The similarity in both is so strong it's sometimes hard to allocate time appropriately. The old "why study" springs to mind:
I do maths, therefore I hack
The more I hack, the less time I have
The less time I have, the less I do maths
So why do maths?
Okay, not perfect. But if I add polishing poetry into my day I'd never get anything done.
Since you are talking about WEBrick and ruby on rails then I presume that you are talking about the Instiki process underlying the n-lab, rather than what's running this forum.
At the moment, Instiki uses mongrel as it's server and is proxied through a lighttpd server. The other relevant fact is that it uses sqlite3 for its database. There's been no discussion about shifting from lighttpd to apache, but when we migrate we will shift from sqlite3 to mysql. There are no other applications served on the n-lab than the instiki process underlying the n-lab (and the other labs, but it's the same instiki process).
Thanks, it is very helpful when one person does the 40 min wget
and then makes a nice compressed file for everybody else to download. (^_^)
Zoran, Instiki uses sqlite3 as default because it can be done entirely internally. With mysql you need to set up two things: a mysql database and a rails app. With sqlite3 everything is in the rails app. So it makes it easier to do and for small installations there's not enough of a difference to make mysql the default. However, as we're finding out then there's a reason that mysql is preferred to sqlite3! We do intend to migrate soon.
Thanks also for putting the download online. I like the idea of distributing the n-lab a little, not only to save bandwidth which we have to pay for but also to make it feel more like a distributed project (which it is). Somewhere (can't remember where off the top of my head) I've pondered about how to set up static mirrors of the n-lab ...
1 to 11 of 11