There have been a few comments on the web that the Duplicate Content Penalty is a myth.
There may not to be a duplicate content
penalty, but it certainly feels like a penalty when the
filter is applied, so whether the penalty is a myth or not, duplicate content will push your page down in the SERPs. Therefore, duplicate content is something you need to avoid at all costs.
So, how do you know that you have tripped the duplicate content filter?
The classic symptoms are:
You make a drastic change to a page, and it appears at the top of the results for a short time, but then it starts to sink down in the results over time. This occurs because for a brief moment in time, the page content is unique and it ranks highly. Then, the robot comes into that page via a link from a different place on the web, reads the same content, and then compares the fingerprints of the two pages. As soon as the search engine determines that both of the pages are identical, it trips the duplicate filter, and both of the pages get pushed down in the results.
What are other symptoms of duplicate content?
A very good indicator is that the search engine uses the description for the snippet below the page title that does not appear on a page itself. The description may come from the DMOZ description, or it may come from a directory within the search engine itself (the Yahoo! Directory results, for example).
So, what are the causes of duplicate content within a site?
Affiliate strings are one of the main causes of duplicate content, and typically take the form of a query string, such as “http://www.mydomain.com/default.aspx?source=affilitate1” (without the quotation marks).
Even MSN, who provided the query string capabilities cannot discern the difference, so each query string URL is seen as a unique page, and the duplicate content filter gets tripped.
Why? The search engine robot will follow the link from an affiliate’s web site back to the main URL and see the link as pointing to a unique page. Once it compares the fingerprint of that affiliate link with the fingerprint of the original page on your web site, it discovers that they are identical, and then it trips the duplicate content filter.
So, why can’t the search engine determine that my main page is the original one, and only punish the second example that it finds on the web?
How is it supposed to know that? It is easy to fake a “last-modified” setting in the header response. The simplest way is to set the date in the server one day ahead of every other server on the planet (who’s to say that the server is on one side or the other of the “International Date Line”).
The other really big problem is “spider time.“ If the spider indexes the main site’s page on Monday, and you make a change to that same page on Tuesday, then the spider follows a link into that same page via another link on the web on Wednesday. How is the search engine supposed to decide which came first? Is it the cache of that page from Monday? Or, is it the cache of the same page from Wednesday?
What about the robot’s cache from another link on Thursday, which may be same as the cache from Wednesday? Or, maybe you decide to undo your changes on Tuesday, and go back to Monday’s version on Wednesday?
With so many possibilities, the easiest thing for the engine to do it is to “assume” that it is being “spammed,” and both pages get pushed down in the results.
After all, the very last thing that any search engine wants is to display ten identical pages for the top ten results, so the easiest thing is to push the confusion down and move on to the next site.
So, there may not be a duplicate content penalty, but there certainly is a filter, and it can seem like a penalty to any Webmaster that is unlucky enough to accidentally encounter it.