## Not signed in

Want to take part in these discussions? Sign in if you have an account, or apply for one below

## Site Tag Cloud

Vanilla 1.1.10 is a product of Lussumo. More Information: Documentation, Community Support.

• CommentRowNumber1.
• CommentAuthorAndrew Stacey
• CommentTimeJul 12th 2009

There's a discussion on latest changes about making an html copy of the n-lab downloadable every now and then. Jacques has disabled the official export html facility for reasons of severe server overload.

I'll need to experiment, but I think that what Zoran wants could be achieved with a simple wget command which would also have the advantage of making sure that all the links are correct (and wouldn't require any changes to Instiki). I presume that, as Zoran wants to be able to do stuff offline, this would be desirable. What else would be useful for this?

• CommentRowNumber2.
• CommentAuthorAndrew Stacey
• CommentTimeJul 13th 2009
• (edited Jul 13th 2009)

Now that I've read the wget manual, then this does seem eminently possible. The exact command depends a little on exactly what should be downloaded. Presumably what one wants is all the existent pages in their most recent forms. Thus one doesn't care about all the fiddly bits concerning histories and edits boxes. What one wants is everything of the form:

http://ncatlab.org/show/some+random+page


plus everything required to properly display that page (icons, stylesheets, etc). To play nice with the server, this should only be downloaded if the server copy is newer than the original.

First step is to get a list of all the pages, we do this by downloading the All pages page and extracting a list of the pages (via a perl script). We feed this back into wget as a list of pages to get (using the -i option). For each downloaded page we ensure that we have the required extras to display it correctly (-p option), we convert the links so that they work correctly: links to downloaded files point to downloaded files, links to non-downloaded files point to non-downloaded files (-k option), we use time-stamping to only get new pages (-N), but because we're doing a little post-processing we need to keep the original files for time-stamping to work correctly (-K). Files are also converted to html extension (-E) since no matter how they were generated, they are now boring html (well, okay, xhtml+mathml+svg) files.

wget --output-document=- http://ncatlab.org/nlab/list \
| perl -lne '/<div id="allPages"/ and $print = 1; /<div id="wantedPages"/ and exit; /href="([^"]*)"/ and$print and print "http://ncatlab.org\$1";' \
| wget -i - -kKEpN


I haven't tried this yet. I would suggest that, at least the first time anyone does this, it be done at light usage times. Not completely sure when they are, though.

• CommentRowNumber3.
• CommentAuthorTobyBartels
• CommentTimeJul 13th 2009

First step is to get a list of all the pages, we do this by downloading the All pages page and extracting a list of the pages (via a perl script).

Or if you want the Markdown source as well, you can download the Markdown export and get the titles from that. I'm inclined to guess that this would be cleaner. But hey, if you wrote the script already …

light usage times. Not completely sure when they are

In my experience, weekends (dang, just missed one!) and between 7:00 and 9:00 UTC.

• CommentRowNumber4.
• CommentAuthorzskoda
• CommentTimeJul 14th 2009
The script does work properly. I did one successful run from old RedHat9 station. Thank you very much for thinking through about the problem! I should warm up myself and start writing similar code myself when needed (I used to be a good programmer around 1998, with focus on languages and compiler design, but practically stopped doing it at all soon after that). From 14.9.2001 (the date when I flied to a math conference where I gave my first conference talk, and in the months just after that I was finishing last chapters of my thesis and so on never stopped), I have had almost no time for anything but math/physics career and my computer skills degenerated to about 10% of what they were plus the technology went on developing without me following it...Maybe it all comes back once...

Zoran
• CommentRowNumber5.
• CommentAuthorAndrew Stacey
• CommentTimeJul 16th 2009
• (edited Jul 16th 2009)

I'm amazed that a script did want it was intended to do first time! Mind you, the real test is when you run it the second time and see if it genuinely only downloads the pages that have been updated.

I can empathise with the hacking/mathing dilemma. The similarity in both is so strong it's sometimes hard to allocate time appropriately. The old "why study" springs to mind:

I do maths, therefore I hack
The more I hack, the less time I have
The less time I have, the less I do maths
So why do maths?

Okay, not perfect. But if I add polishing poetry into my day I'd never get anything done.

• CommentRowNumber6.
• CommentAuthorzskoda
• CommentTimeJul 28th 2009
This is totally opposite to the Littlewood's point of view that really creative mathematics can be done only if it is not done long hours. He says at most 5 hours of creative thinking a day, if more then the things degenerate.

Anyway, I can not find basic facts on the forum about what kind of system you have: what is your underlying server -- is it default WEBrick or Apache (ruby on rails work with both) ? Which principal other applications are served ?
• CommentRowNumber7.
• CommentAuthorAndrew Stacey
• CommentTimeJul 29th 2009

Since you are talking about WEBrick and ruby on rails then I presume that you are talking about the Instiki process underlying the n-lab, rather than what's running this forum.

At the moment, Instiki uses mongrel as it's server and is proxied through a lighttpd server. The other relevant fact is that it uses sqlite3 for its database. There's been no discussion about shifting from lighttpd to apache, but when we migrate we will shift from sqlite3 to mysql. There are no other applications served on the n-lab than the instiki process underlying the n-lab (and the other labs, but it's the same instiki process).

• CommentRowNumber8.
• CommentAuthorzskoda
• CommentTimeAug 3rd 2009
Thank you for the answers. I am a bit surprised with the answer content a bit, e.g. as the mysql is the default choice for almost all ruby on rails manuals I found on the web.
• CommentRowNumber9.
• CommentAuthorzskoda
• CommentTimeAug 7th 2009
Today, to have a new version before going partly offline (partial vacation) I run today the script from above from another linux machine, it took wget about 40 minutes to execute with all the pauses included. Here is the outcome, tarred and gzipped, with only a bit over 10 Mb while the unpacked .tar version is I think about 77 Mb

http://www.irb.hr/korisnici/zskoda/nlab.tar.gz
• CommentRowNumber10.
• CommentAuthorTobyBartels
• CommentTimeAug 7th 2009

Thanks, it is very helpful when one person does the 40 min wget and then makes a nice compressed file for everybody else to download. (^_^)

• CommentRowNumber11.
• CommentAuthorAndrew Stacey
• CommentTimeAug 10th 2009

Zoran, Instiki uses sqlite3 as default because it can be done entirely internally. With mysql you need to set up two things: a mysql database and a rails app. With sqlite3 everything is in the rails app. So it makes it easier to do and for small installations there's not enough of a difference to make mysql the default. However, as we're finding out then there's a reason that mysql is preferred to sqlite3! We do intend to migrate soon.

Thanks also for putting the download online. I like the idea of distributing the n-lab a little, not only to save bandwidth which we have to pay for but also to make it feel more like a distributed project (which it is). Somewhere (can't remember where off the top of my head) I've pondered about how to set up static mirrors of the n-lab ...