Thursday, June 29, 2006

robots.txt FAQ

When does a robots.txt file take effect?
Many search engines only download the file once per day, but they may send in many robots later that same day. So, it is extremely important to make the change to the robots.txt file one day, and then wait until the next day to create a subdirectory and place files into it. Always plan your changes one day ahead of time, or you may be very surprised to see your hidden pages appear in the SERPs.

Will a robots.txt file stop a page from appearing the SERPs?
No. Even if you follow the procedure above, most robots will follow a link up to that page, and then cache just a link to that page, which includes the page title (in the head section of the HTML code) and the URL of the page itself. There will not be a description, and it will not likely rate very highly in the SERPs, but it will definitely be there and you can find it by using the "site:" command (or by searching for the entire page title in quotation marks).

What happens to a page that is already in the SE's cache?
The robot will try to access and update the page, but it will be blocked, so it will keep the cache that it already has. In a few weeks (or months, depending upon the engine) the snippet and the link to the cache of the page will eventually disappear, but the link to the page will remain in the search engine. Also, the page title that was in the cache prior to the change will appear in the SERPs for a very long time, even if you change the page title after blocking the robots' access to the page.

Will a robots.txt file stop every robot?
No. There are many, many rogue robots that ignore the robots.txt file. Examples of this are sweepstakes scrapper sites that want to show every sweepstakes for their visitors. Their bots completely ignore the robots.txt file (and all other META tag robot instructions).

Is a robots.txt file necessary?
No, there are lots of sites on the web that do not have this file in place. However, you will see lots of 404 errors related to the robots.txt file in your server logs, so creating a robots.txt file that specifically allows the robots in will eliminate these errors and reduce your server log file size.

Is a robots.txt file needed if there are no links to any pages in a site?
No, the robots do not snoop around looking for hidden subdirectories with hidden files. HOWEVER, depending upon the registrar, new domains get submitted to search engines automatically and they send their robots out to index the site. ALSO, any link from a page that is part of the existing "visible" web will provide a path for the robots to follow, and they will all eventually follow that link in and cache whatever pages they can "get" so it is best to create a robots.txt file first. This will make certain that no legitimate search engine gets your hidden pages and caches them.

Can a robots.txt file delete pages from the search engines?
No, only Google has a URL removal tool that can be used in conjunction with the robots.txt file, and great care must be taken or your entire site can be deleted from their index for 180 days. Note that pages will return to Google's index after 180 days if the robots.txt restriction is removed. None of the other engines have a URL removal tool at the moment, so if they already have a cache of the page, they will not delete it based on the robots.txt file. The page will remain for what seems like an eternity, even after it is putting out a 404 server header code (page not found).

Tuesday, June 27, 2006

Comparing the "site:" command results

The results from the "site:" command should ideally be identical in all three search engines (Google, Yahoo, and MSN) for a well constructed site that has no problems. There are always minor variations, but comparing the results can often point out large problems that need to be fixed.

First, you need to know exactly how many pages are in your site. Then, when you run the site command, you should see that same number of pages in each search engine's index (give or take a few pages due to indexing variations for PDF, SWF files, etc.). If you see any major difference between the number of pages in the index and the actual number of pages in your site, then you have a problem and need to look closer at the pages that are returned.

This can be a real problem if you have more than 1,000 pages, because that is the maximum number of pages that Google and Yahoo return at the moment, while MSN only returns 250 pages max. If you do have more pages in your site, you will need to be more specific in your search string by adding a search term that should be found on each page, as in, "blue widget site:www.mydomain.com" and be sure to enter a space between each search term. It appears that the search term can be either before or after the site command, but MSN shows a larger number of pages returned if the search terms are placed before the site command.

At the moment, MSN has the least number of supplemental pages (usually old, dead links), followed by Google, and then Yahoo, which seems to hang onto dead links forever. This may be part of Yahoo's claim that their index is larger than Google's, but it seems pointless to me to keep pages in the index that have been putting out 404 server header codes for over a year.

Monday, June 26, 2006

Finding Duplicate Content with the "site:" command

One way to find duplicate content problems is to use the "site:" command in all of the search engines. This will show you exactly what pages the search engine sees in a site, and how it sees them. This command may be entered with, or without the www prefix (site:mydomain.com or site:www.mydomain.com), so the total number of results can vary if you have subdomains.

If you do have subdomains, leaving the www out of the query will bring up all of the subdomain URLs in addition to the primary domain name. You can isolate subdomains by entering the entire subdomain string (site:subdomain.mydomain.com) so you can look at just those pages.

If you start looking and finally see the dreaded comment, "repeat the search with the omitted results included" at the end of the results, there is a duplicate content problem. Clicking on that link will bring up everything, including the duplicate results. However, the duplicates will be mixed up in the standard results, so it is a good idea to print out the standard results beforehand so you can compare the results and search for the duplicates.

Yahoo - repeat the search with the omitted results included.
Google - repeat the search with the omitted results included.
MSN - does not display this option at this time.

Wednesday, June 07, 2006

SEO 101 - Validate your HTML code first...

The very first thing to look at when a page is having trouble getting ranked is the document structure of the page itself. If there are HTML formatting errors in the page, it is going to have a hard time getting into spot #1.

Think like a search engine for a second: you are comparing two pages for a position in your SERPs, and both pages score equally in all respects except for one small point - one page is correctly formatted, while the other has numerous HTML errors. Which page would you put on top of the other in your SERPs and deliver to your customer first?

I once heard a Google engineer say that 60% of the pages on the web fail the W3C validation test, but they do not omit those pages because that would eliminate 60% of the web pages in their SERPs. However, just because they include those pages does not mean that they will rate them very highly, especially when they are so concerned with delivering solid, relevant results to their customers. If a page did not appear correctly in all browsers because it was full of HTML formatting errors, would you really want to put that page up on top of your SERPs?

The obvious answer is, "No, of course not" so despite the fact that a page may tie for a spot with another page, the validated page will usually win over the page with errors.

So, lesson #1 is to be certain that the HTML code on your page validates correctly, and then you can start to look at other issues of SEO.

The W3C validator is here: http://validator.w3.org/