Sunday, April 30, 2006

How to find to Duplicate Content in a search engine.

A salesman for a client of mine bought his own domain when he joined the company, and then set up a program to copy the client's entire web site. He then installed this copy of content on his own domain. He changed only the phone number and replaced a graphic image at the top of the page that advertised his location, but everything else in the web site was identical (or duplicate content) of the manufacturer's web site. Then, he forgot all about it.

He didn't even realize that it would be a problem, and it was only when I came on the scene in a few years later and started looking at their duplicate content problem, that I discovered this web site and asked the question, "Who is this?"

Nobody at my client's company had any idea, so I contacted him via the e-mail supplied in the "who is" and finally got a reply that said, "I work for this company. I am one of the salesman."

I had him change his URL to a single landing page, and within a few weeks, my client's Web site began to climb up to the top of the results.

So, how did I find this duplicate content?

There is a web site called, "copyscape.com" but that will not find 100% of the duplicate content on the web. It will help you find blatant examples of stolen duplicate content on the Internet. However, there are more subtle examples, and these require more work to verify. Fortunately, each search engine will allow you to find your own examples of duplicate contact in their very own cache.

Although the problem is that it can be very time consuming to find, it is well worth spending the time to uncover duplicate content via this simple procedure:

Grab a random section of text on a page, and paste that into a search window inside quotation marks. You may need to run several different sections of the text to uncover all of the duplicates, but this is the only way to truly see what is in each search engine's cache.

You will need to do this search in each search engine, because each engine has its own cache and its own criteria for detecting duplicate content. Some are better than others at detecting the problem, but if you can find and correct it in any one engine, the problem will gradually fade away from ALL of the search engines.

Thursday, April 27, 2006

Duplicate Content on Windows servers

It is very ironic that the inventors at Microsoft would allow capitalization to be random on a Desktop computer. In other words, “MyFile.HTM” is the same as “myfile.htm” on a Windows PC or server (both capitalization spellings will bring up the same file).

However, their very own the search engine is UNIX-based and sees the variations in capitalization as duplicate content. The MSN search engine sees these as two separate files, and will index them as two unique URLs.

This means that ALL of the search engines see these capitalization variations as two unique pages, even though it is the same page. Remember that because UNIX is case sensitive, it sees those two pages as completely different pages, which forces programmers to write code or create links that are strictly case sensitive.

Unfortunately, Microsoft’s servers allow programmers to use lowercase on one link, and uppercase in another link, but both links deliver the same results. So, if you use the uppercase “MyFile.HTM” in one place on a web site, and use a lowercase “myfile.htm” in a different link within the same web site, all of the search engines see this as two completely different pages with exactly identical contact, and the duplicate content filter is tripped, which ultimately pushes both pages down in the SERPs

Wednesday, April 26, 2006

Is there a Duplicate Content Penalty?

There have been a few comments on the web that the Duplicate Content Penalty is a myth.

There may not to be a duplicate content penalty, but it certainly feels like a penalty when the filter is applied, so whether the penalty is a myth or not, duplicate content will push your page down in the SERPs. Therefore, duplicate content is something you need to avoid at all costs.

So, how do you know that you have tripped the duplicate content filter?

The classic symptoms are:

You make a drastic change to a page, and it appears at the top of the results for a short time, but then it starts to sink down in the results over time. This occurs because for a brief moment in time, the page content is unique and it ranks highly. Then, the robot comes into that page via a link from a different place on the web, reads the same content, and then compares the fingerprints of the two pages. As soon as the search engine determines that both of the pages are identical, it trips the duplicate filter, and both of the pages get pushed down in the results.

What are other symptoms of duplicate content?

A very good indicator is that the search engine uses the description for the snippet below the page title that does not appear on a page itself. The description may come from the DMOZ description, or it may come from a directory within the search engine itself (the Yahoo! Directory results, for example).

So, what are the causes of duplicate content within a site?

Affiliate strings are one of the main causes of duplicate content, and typically take the form of a query string, such as “http://www.mydomain.com/default.aspx?source=affilitate1” (without the quotation marks).

Even MSN, who provided the query string capabilities cannot discern the difference, so each query string URL is seen as a unique page, and the duplicate content filter gets tripped.

Why? The search engine robot will follow the link from an affiliate’s web site back to the main URL and see the link as pointing to a unique page. Once it compares the fingerprint of that affiliate link with the fingerprint of the original page on your web site, it discovers that they are identical, and then it trips the duplicate content filter.

So, why can’t the search engine determine that my main page is the original one, and only punish the second example that it finds on the web?

How is it supposed to know that? It is easy to fake a “last-modified” setting in the header response. The simplest way is to set the date in the server one day ahead of every other server on the planet (who’s to say that the server is on one side or the other of the “International Date Line”).

The other really big problem is “spider time.“ If the spider indexes the main site’s page on Monday, and you make a change to that same page on Tuesday, then the spider follows a link into that same page via another link on the web on Wednesday. How is the search engine supposed to decide which came first? Is it the cache of that page from Monday? Or, is it the cache of the same page from Wednesday?

What about the robot’s cache from another link on Thursday, which may be same as the cache from Wednesday? Or, maybe you decide to undo your changes on Tuesday, and go back to Monday’s version on Wednesday?

With so many possibilities, the easiest thing for the engine to do it is to “assume” that it is being “spammed,” and both pages get pushed down in the results.

After all, the very last thing that any search engine wants is to display ten identical pages for the top ten results, so the easiest thing is to push the confusion down and move on to the next site.

So, there may not be a duplicate content penalty, but there certainly is a filter, and it can seem like a penalty to any Webmaster that is unlucky enough to accidentally encounter it.