Not signed in

Want to take part in these discussions? Sign in if you have an account, or apply for one below

Site Tag Cloud

Vanilla 1.1.10 is a product of Lussumo. More Information: Documentation, Community Support.

Welcome to nForum
If you want to take part in these discussions either sign in now (if you have an account), apply for one now (if you don't).

Atrium > Mathematics, Physics & Philosophy: Downloading the n-lab

Bottom of Page

1 to 11 of 11

- CommentRowNumber1.
- CommentAuthorAndrew Stacey
- CommentTimeJul 12th 2009
- PermaLink
Author: Andrew Stacey
Format: MarkdownThere's a discussion on [latest changes](http://ncatlab.org/nlab/show/latest+changes) about making an html copy of the n-lab downloadable every now and then. Jacques has disabled the official export html facility for reasons of severe server overload. I'll need to experiment, but I think that what Zoran wants could be achieved with a simple `wget` command which would also have the advantage of making sure that all the links are correct (and wouldn't require any changes to Instiki). I presume that, as Zoran wants to be able to do stuff offline, this would be desirable. What else would be useful for this?

There's a discussion on latest changes about making an html copy of the n-lab downloadable every now and then. Jacques has disabled the official export html facility for reasons of severe server overload.

I'll need to experiment, but I think that what Zoran wants could be achieved with a simple wget command which would also have the advantage of making sure that all the links are correct (and wouldn't require any changes to Instiki). I presume that, as Zoran wants to be able to do stuff offline, this would be desirable. What else would be useful for this?
- CommentRowNumber2.
- CommentAuthorAndrew Stacey
- CommentTimeJul 13th 2009
- (edited Jul 13th 2009)
- PermaLink
Author: Andrew Stacey
Format: MarkdownNow that I've read the `wget` manual, then this does seem eminently possible. The exact command depends a little on exactly what should be downloaded. Presumably what one wants is all the existent pages in their most recent forms. Thus one doesn't care about all the fiddly bits concerning histories and edits boxes. What one wants is everything of the form: http://ncatlab.org/show/some+random+page plus everything required to properly display that page (icons, stylesheets, etc). To play nice with the server, this should only be downloaded if the server copy is newer than the original. First step is to get a list of all the pages, we do this by downloading the _All pages_ page and extracting a list of the pages (via a perl script). We feed this back into `wget` as a list of pages to get (using the `-i` option). For each downloaded page we ensure that we have the required extras to display it correctly (`-p` option), we convert the links so that they work correctly: links to downloaded files point to downloaded files, links to non-downloaded files point to non-downloaded files (`-k` option), we use time-stamping to only get new pages (`-N`), but because we're doing a little post-processing we need to keep the original files for time-stamping to work correctly (`-K`). Files are also converted to html extension (`-E`) since no matter how they were generated, they are now boring html (well, okay, xhtml+mathml+svg) files. wget --output-document=- http://ncatlab.org/nlab/list \ | perl -lne '/<div id="allPages"/ and $print = 1; /<div id="wantedPages"/ and exit; /href="([^"]*)"/ and $print and print "http://ncatlab.org$1";' \ | wget -i - -kKEpN I haven't tried this yet. I would suggest that, at least the first time anyone does this, it be done at light usage times. Not completely sure when they are, though.
Now that I've read the wget manual, then this does seem eminently possible. The exact command depends a little on exactly what should be downloaded. Presumably what one wants is all the existent pages in their most recent forms. Thus one doesn't care about all the fiddly bits concerning histories and edits boxes. What one wants is everything of the form:
```
http://ncatlab.org/show/some+random+page
```
plus everything required to properly display that page (icons, stylesheets, etc). To play nice with the server, this should only be downloaded if the server copy is newer than the original.

First step is to get a list of all the pages, we do this by downloading the All pages page and extracting a list of the pages (via a perl script). We feed this back into wget as a list of pages to get (using the -i option). For each downloaded page we ensure that we have the required extras to display it correctly (-p option), we convert the links so that they work correctly: links to downloaded files point to downloaded files, links to non-downloaded files point to non-downloaded files (-k option), we use time-stamping to only get new pages (-N), but because we're doing a little post-processing we need to keep the original files for time-stamping to work correctly (-K). Files are also converted to html extension (-E) since no matter how they were generated, they are now boring html (well, okay, xhtml+mathml+svg) files.
```
wget --output-document=- http://ncatlab.org/nlab/list \
| perl -lne '/<div id="allPages"/ and $print = 1;
                /<div id="wantedPages"/ and exit;
                /href="([^"]*)"/ and $print and print "http://ncatlab.org$1";' \
| wget -i - -kKEpN
```
I haven't tried this yet. I would suggest that, at least the first time anyone does this, it be done at light usage times. Not completely sure when they are, though.
- CommentRowNumber3.
- CommentAuthorTobyBartels
- CommentTimeJul 13th 2009
- PermaLink
Author: TobyBartels
Format: Markdown>First step is to get a list of all the pages, we do this by downloading the All pages page and extracting a list of the pages (via a perl script). Or if you want the Markdown source as well, you can download the Markdown export and get the titles from that. I'm inclined to guess that this would be cleaner. But hey, if you wrote the script already … >light usage times. Not completely sure when they are In my experience, weekends (dang, just missed one!) and between 7:00 and 9:00 UTC.

First step is to get a list of all the pages, we do this by downloading the All pages page and extracting a list of the pages (via a perl script).

Or if you want the Markdown source as well, you can download the Markdown export and get the titles from that. I'm inclined to guess that this would be cleaner. But hey, if you wrote the script already …

light usage times. Not completely sure when they are

In my experience, weekends (dang, just missed one!) and between 7:00 and 9:00 UTC.
- CommentRowNumber4.
- CommentAuthorzskoda
- CommentTimeJul 14th 2009
- PermaLink
Author: zskoda
Format: TextThe script does work properly. I did one successful run from old RedHat9 station. Thank you very much for thinking through about the problem! I should warm up myself and start writing similar code myself when needed (I used to be a good programmer around 1998, with focus on languages and compiler design, but practically stopped doing it at all soon after that). From 14.9.2001 (the date when I flied to a math conference where I gave my first conference talk, and in the months just after that I was finishing last chapters of my thesis and so on never stopped), I have had almost no time for anything but math/physics career and my computer skills degenerated to about 10% of what they were plus the technology went on developing without me following it...Maybe it all comes back once... Zoran
The script does work properly. I did one successful run from old RedHat9 station. Thank you very much for thinking through about the problem! I should warm up myself and start writing similar code myself when needed (I used to be a good programmer around 1998, with focus on languages and compiler design, but practically stopped doing it at all soon after that). From 14.9.2001 (the date when I flied to a math conference where I gave my first conference talk, and in the months just after that I was finishing last chapters of my thesis and so on never stopped), I have had almost no time for anything but math/physics career and my computer skills degenerated to about 10% of what they were plus the technology went on developing without me following it...Maybe it all comes back once...

Zoran
- CommentRowNumber5.
- CommentAuthorAndrew Stacey
- CommentTimeJul 16th 2009
- (edited Jul 16th 2009)
- PermaLink
Author: Andrew Stacey
Format: MarkdownI'm amazed that a script did want it was intended to do first time! Mind you, the real test is when you run it the second time and see if it genuinely only downloads the pages that have been updated. I can empathise with the hacking/mathing dilemma. The similarity in both is so strong it's sometimes hard to allocate time appropriately. The old "why study" springs to mind: > I do maths, therefore I hack The more I hack, the less time I have The less time I have, the less I do maths So why do maths? Okay, not perfect. But if I add polishing poetry into my day I'd never get _anything_ done.

I'm amazed that a script did want it was intended to do first time! Mind you, the real test is when you run it the second time and see if it genuinely only downloads the pages that have been updated.

I can empathise with the hacking/mathing dilemma. The similarity in both is so strong it's sometimes hard to allocate time appropriately. The old "why study" springs to mind:

I do maths, therefore I hack
The more I hack, the less time I have
The less time I have, the less I do maths
So why do maths?

Okay, not perfect. But if I add polishing poetry into my day I'd never get anything done.
- CommentRowNumber6.
- CommentAuthorzskoda
- CommentTimeJul 28th 2009
- PermaLink
Author: zskoda
Format: TextThis is totally opposite to the Littlewood's point of view that really creative mathematics can be done only if it is not done long hours. He says at most 5 hours of creative thinking a day, if more then the things degenerate. Anyway, I can not find basic facts on the forum about what kind of system you have: what is your underlying server -- is it default WEBrick or Apache (ruby on rails work with both) ? Which principal other applications are served ?
This is totally opposite to the Littlewood's point of view that really creative mathematics can be done only if it is not done long hours. He says at most 5 hours of creative thinking a day, if more then the things degenerate.

Anyway, I can not find basic facts on the forum about what kind of system you have: what is your underlying server -- is it default WEBrick or Apache (ruby on rails work with both) ? Which principal other applications are served ?
- CommentRowNumber7.
- CommentAuthorAndrew Stacey
- CommentTimeJul 29th 2009
- PermaLink
Author: Andrew Stacey
Format: MarkdownSince you are talking about WEBrick and ruby on rails then I presume that you are talking about the Instiki process underlying the n-lab, rather than what's running this forum. At the moment, Instiki uses mongrel as it's server and is proxied through a lighttpd server. The other relevant fact is that it uses sqlite3 for its database. There's been no discussion about shifting from lighttpd to apache, but when we migrate we will shift from sqlite3 to mysql. There are no other applications served on the n-lab than the instiki process underlying the n-lab (and the other labs, but it's the same instiki process).

Since you are talking about WEBrick and ruby on rails then I presume that you are talking about the Instiki process underlying the n-lab, rather than what's running this forum.

At the moment, Instiki uses mongrel as it's server and is proxied through a lighttpd server. The other relevant fact is that it uses sqlite3 for its database. There's been no discussion about shifting from lighttpd to apache, but when we migrate we will shift from sqlite3 to mysql. There are no other applications served on the n-lab than the instiki process underlying the n-lab (and the other labs, but it's the same instiki process).
- CommentRowNumber8.
- CommentAuthorzskoda
- CommentTimeAug 3rd 2009
- PermaLink
Author: zskoda
Format: TextThank you for the answers. I am a bit surprised with the answer content a bit, e.g. as the mysql is the default choice for almost all ruby on rails manuals I found on the web.
Thank you for the answers. I am a bit surprised with the answer content a bit, e.g. as the mysql is the default choice for almost all ruby on rails manuals I found on the web.
- CommentRowNumber9.
- CommentAuthorzskoda
- CommentTimeAug 7th 2009
- PermaLink
Author: zskoda
Format: TextToday, to have a new version before going partly offline (partial vacation) I run today the script from above from another linux machine, it took wget about 40 minutes to execute with all the pauses included. Here is the outcome, tarred and gzipped, with only a bit over 10 Mb while the unpacked .tar version is I think about 77 Mb http://www.irb.hr/korisnici/zskoda/nlab.tar.gz
Today, to have a new version before going partly offline (partial vacation) I run today the script from above from another linux machine, it took wget about 40 minutes to execute with all the pauses included. Here is the outcome, tarred and gzipped, with only a bit over 10 Mb while the unpacked .tar version is I think about 77 Mb

http://www.irb.hr/korisnici/zskoda/nlab.tar.gz
- CommentRowNumber10.
- CommentAuthorTobyBartels
- CommentTimeAug 7th 2009
- PermaLink
Author: TobyBartels
Format: MarkdownThanks, it is very helpful when one person does the 40 min `wget` and then makes a nice compressed file for everybody else to download. (^_^)

Thanks, it is very helpful when one person does the 40 min wget and then makes a nice compressed file for everybody else to download. (^_^)
- CommentRowNumber11.
- CommentAuthorAndrew Stacey
- CommentTimeAug 10th 2009
- PermaLink
Author: Andrew Stacey
Format: MarkdownZoran, Instiki uses sqlite3 as default because it can be done entirely internally. With mysql you need to set up two things: a mysql database and a rails app. With sqlite3 everything is in the rails app. So it makes it easier to do and for _small_ installations there's not enough of a difference to make mysql the default. However, as we're finding out then there's a reason that mysql is preferred to sqlite3! We do intend to migrate soon. Thanks also for putting the download online. I like the idea of distributing the n-lab a little, not only to save bandwidth which we have to pay for but also to make it feel more like a distributed project (which it is). Somewhere (can't remember where off the top of my head) I've pondered about how to set up static mirrors of the n-lab ...

Zoran, Instiki uses sqlite3 as default because it can be done entirely internally. With mysql you need to set up two things: a mysql database and a rails app. With sqlite3 everything is in the rails app. So it makes it easier to do and for small installations there's not enough of a difference to make mysql the default. However, as we're finding out then there's a reason that mysql is preferred to sqlite3! We do intend to migrate soon.

Thanks also for putting the download online. I like the idea of distributing the n-lab a little, not only to save bandwidth which we have to pay for but also to make it feel more like a distributed project (which it is). Somewhere (can't remember where off the top of my head) I've pondered about how to set up static mirrors of the n-lab ...

1 to 11 of 11

nForum

Discussion Feed

Not signed in

Site Tag Cloud

Atrium > Mathematics, Physics & Philosophy: Downloading the n-lab