Want to take part in these discussions? Sign in if you have an account, or apply for one below
Vanilla 1.1.10 is a product of Lussumo. More Information: Documentation, Community Support.
The redirection of things like nlab/new/Gamma-space
to nlab/show/Gamma-space
was put in to counter the problem that new
links to existing pages were appearing in Google searches so this is not the source of the problem. I’ve not figured out where these search results are coming from, but my hypothesis is that they exist on pages that Google indexes and so it lists them. The robots.txt
blocks Google from following them and so actually Google never knows that new/existing+page
redirects to show/existing+page
so I don’t think that the status code has anything to do with it.
It’s quite possible that I’m wrong; I’ve not done extensive analysis. But before I start changing the code I’d like to know how to detect whether or not the redirection code is making a difference as I prefer to keep the nLab’s code as close to the main instiki line as possible.
See http://nforum.mathforge.org/discussion/4884/odd-result-in-google for the original discussion on this.
Why is this not a problem experienced by other wikis, like Wikipedia, that also contain links to not-yet-existent pages?
Redlinks on Wikipedia are of the form .../w/index.php?title=...&action=edit&redlink=1
, which looks much less like a legitimate URL compared to .../new/...
. But I doubt that’s the whole story.
Putting something in the robots.txt file does not prevent it from showing up in Google search. On the other hand, a 301 redirect will definitely remove /new/* pages from Google search, provided that they are not blocked by robots.txt in the first place.
So it seems like a better solution here would be to remove /new/* pages from robots.txt and make them into 301 redirects instead of 302.
More information on Google and 301 redirects: https://support.google.com/webmasters/answer/93633
Given that these pages were turning up before the redirection code was put in place, I’d still like to see more evidence that this change would fix things before making the change.
The show/*
pages are not in robots.txt. The /new/
pages are.
The really odd thing here is that the page /show/Gamma-space
is not showing up at all in a search. Even if I explicitly search for ncatlab.org/nlab/show/Gamma-space
or Gamma-space site:ncatlab.org
then it doesn’t appear. This seems to be more than a redirect would cause. Again, this problem predates the redirection.
Sorry, I meant /new/* pages instead of /show/* pages.
For example, the page http://www.hochmanconsultants.com/articles/301-versus-302.shtml says “If a 302 is used instead of a 301, search engines might continue to index the old URL, and disregard the new one as a duplicate.”
This seems to be the case here, Google indexes /new/Gamma-space, which 302 redirects to /show/Gamma-space, and the latter is ignored because of the 302 redirect as opposed to a 301 redirect, in which case the former would be ignored.
That still doesn’t explain why we were seeing this behaviour before the redirects were put in place, since then there was no link between the /new/
and the /show/
pages so the /show/
pages should have shown up in the search. But they didn’t. There are no links that I can find to /new/Gamma-space
and there are plenty of links to /show/Gamma-space
. I can understand that when Google follows /new/Gamma-space
and gets to /show/Gamma-space
then it doesn’t index it, but it ought to index it if it gets there directly, shouldn’t it?
As far as search engines are concerned, the best thing to do with /new/
pages that do actually exist is simply to send a 404
. However, that’s not helpful to people, which is why we chose the redirection method.
The real difficulty here is that it is very hard to find out what google, and other search engines, actually do. That article you linked to is full of supposition. So it’s very hard to figure out how to judge if changes we make actually fix things in the google search list. What is easier to do is to ensure that if google indexes the wrong thing then someone ends up on the right page when they get to the nLab.
Ideally we’d like to do both.
Nonetheless, I’ve made the change in the code and we can monitor what happens.
It would be useful to know when google indexed that /new/
address and where it got it from. I don’t know how to get that information, though.
but it ought to index it if it gets there directly, shouldn’t it?
I am not sure about this. Google actively tries to remove duplicates from its search results, redirects are considered duplicates, so it seems plausible that it simply does not index /show/Gamma-space.
I think it’s best to wait a month and see if Google’s results change.
I also think it makes sense to remove /nlab/new from robots.txt, at least for the duration of the experiment, because Google might be confused by it. In fact, I think Google won’t even see the 301 redirect unless it is allowed to access /new/Gamma-space, which is only possible if /nlab/new is removed from robots.txt.
I am somewhat wary of removing the /new/
from robots.txt
simply because the vast majority of /new/
pages do not exist and for those then we really would rather that the bots didn’t follow them.
But your last remark confuses me a little: if it didn’t see the 301, surely it wouldn’t see the 302. The problem, as you describe it, is that the 301 is stopping it from removing the /new/Gamma-space
from its index. I don’t care if it doesn’t see that /new/Gamma-space
redirects to /show/Gamma-space
because no-one should be going to /new/Gamma-space
. What bothers me is that /new/Gamma-space
is being indexed and /show/Gamma-space
is not. So if the robots.txt
is stopping Google from following /new/Gamma-space
and figuring out where it goes to, that’s not a problem so long as it is picking up /show/Gamma-space
from somewhere. What you’ve suggested is that the 301 means that when it reaches /show/Gamma-space
then it figures “I already know this as /new/Gamma-space
so I’ll not bother with it.” and that the 302 means that it now thinks “Ah, even though this is linked to /new/Gamma-space
then this is the one I should have.”. As /show/
isn’t blocked by robots.txt
then it will do so. It would be a pretty poor design if the robot block on /new/
also applied to /show/
!
So I think I’ll leave robots.txt
alone for now. We can revisit it later.
FWIW, 301 makes more sense than 302 to me anyway. We don't expect that page to go anywhere.
I think I got 301 and 302 confused in my previous post. Hopefully it is understandable.
Just to prove that I didn’t get confused when doing the implementation:
HTTP/1.1 301 Moved Permanently
Date: Wed, 04 Sep 2013 06:33:08 GMT
Server: Apache/2.2.16 (Ubuntu)
X-Powered-By: Phusion Passenger (mod_rails/mod_rack) 3.0.7
Cache-Control: no-cache
X-Runtime: 7
Location: http://ncatlab.org/nlab/show/Gamma-space
Content-Length: 106
Status: 301
Vary: Accept-Encoding
Content-Type: text/html; charset=utf-8
<html><body>You are being <a href="http://ncatlab.org/nlab/show/Gamma-space">redirected</a>.</body></html>
Apparently 301 redirects didn’t help: searching for Gamma-space on Google still gives the same result. It seems that the underlying cause of this is that robots.txt has a Disallow: /nlab/new line.
When Google indexes http://ncatlab.org/nlab/new/Gamma-space, it appears that it sees the 301 redirect to http://ncatlab.org/nlab/show/Gamma-space, and, given the fact that /nlab/new/ is disallowed by robots.txt, it automatically infers that the corresponding /nlab/show/ page should not be indexed either.
From the page https://www.google.com/webhp#filter=0&q=%22/nlab/new/%22+site:ncatlab.org it seems like whenever an /nlab/new/ page is linked, Google simply will not index the corresponding /nlab/show/ page. In other words, it seems like Andrew Stacey’s remark “It would be a pretty poor design if the robot block on /new/ also applied to /show/!” is actually true.
Perhaps removing the Disallow: /nlab/new line from robots.txt will resolve the problem? Of course, bots would then index it, but since these pages have no content, they will rank very low in search results.
1 to 13 of 13