Reasons for Replicating Data
According to a study done by Krishna Bharat and Andrei Brodner there are several reasons why
data are replicated or why mirror sites are created – Load Balancing, High Availability,
Multi-lingual replication, Franchises or Local versions, Database Sharing, Virtual Hosting,
and Maintaining Pseudo Identities.
In load balancing, replication of data is done to decrease the servers’ loads.
Instead of just having one server to handle all the traffic from web surfers interested in the data
or content, the site is mirrored or the data replicated so that the traffic is split between two
or more servers.
Data are also replicated to make them more highly available.
An example of this is when data are mirrored within the same organization for geographical
purposes to make them easily available.
Multi-lingual replication of data is also very common. Data translated into different languages
are very useful for reaching a wider audience who all need access to the same data.
Good examples of multi-lingual replication are many Canadian sites that are the same in everything
except for the language of the content wherein English or French is used.
Data is also replicated for franchises or local versions of data. This happens when data or
content is franchised to another company, which then offer the very same data or product but
under different branding.
Sometimes data is replicated unintentionally. This happens when two independent websites
share a common database or file system. The sharing of database sometimes results to mirroring
even without the websites’ intention.
Virtual hosting also sometimes result in mirroring. This happens to services with different websites
and host names but use the same IP address and server. What happens is the path to one site is the
valid one while the path to the other site simply gives an identical webpage as a result.
The last reason, unlike the first six reasons, is often not a valid reason for site mirroring.
This is because mirroring to maintain pseudo identities is often done to spam search engines with
different websites of the same content as a means getting a higher page ranking.
This reason is considered unacceptable and is one of the very reasons why search engines tend to
be adverse towards identical content or replicated data.
Google’s Webmaster Guideline about Duplicate Content
Search engines are blatantly against replicated data so much so that Google even has a warning
against them in their Webmaster Guidelines. Google’s Webmaster Guidelines were a list of
Do’s and Don’ts that ought to be followed by websites to help the search engine in finding,
indexing, and ranking websites. Following the Do’s will of course increase the chance that
Google will list a specific website and ran it favorably as well. However, doing any of the
Don’ts will of course detract from a website’s rank.
In the specific guidelines for quality of the website part, it was stated clearly that websites
should not create multiple pages, subdomains, or domains with substantially duplicate content.
The term duplicate content is however a dubious term since it isn’t clear how many duplicate words
it takes for search engines like Google to penalize a page. It can take ten words or maybe an
entire sentence, or paragraph, or even need an entire document or page for content to be considered
duplicate content. The key thing to remember is that the guideline says to not create pages with
substantially duplicate content. So to be on the safe side it would be better to always have a
fresh original content. This is however not possible at times especially when quoting articles so
that it is your call to determine whether the duplicate content might penalize your website.
If your conscience is clear that the duplicate content is there for the user’s benefit and not to
up your page ranking then the crawlers will hopefully interpret it as the same and not penalize your site.
Annoyed Surfers and Speedy Crawlers
Search engines exist to point surfers to websites containing the information relevant to
their search string. However, they do not exist to point surfers to different websites
containing the exact same or nearly the same information. When surfers click on different
links they expect to be getting different web pages with maybe the same or different take on
the same topic but with definitely different content. However there are many sites out there with
partial duplicate content and even the exact content simply replicated. Clicking on mirror sites
irritate surfers since it is only a waste of time waiting for the same thing to load twice or
maybe even more times. This is especially irritating if the site happens to be a spam site whose
content is not of a good quality. Due to this problem web crawlers now do not crawl exact duplicate
and near-duplicate web pages or websites that they have determined from a previous crawl.
This means that the mirror sites not crawled will not even make it to the search engine’s results
listing since only one of the duplicates is indexed by the web crawler. Because of this search engines
will not have more than one of the mirror sites among its results listing thus avoiding irritating the
Satisfied surfers are not the only result of the new technique crawlers use. Search engines benefit
as well since not having to crawl mirrored pages lessens the load of the crawlers and thus
speeds up crawling. The bandwidth is also saved because of this resulting to a faster more efficient
crawling operation wherein the web crawler can cover and index more significant websites.
Valid Mirrored Sites
However, for valid mirror sites like those mentioned above (multi-lingual, franchise, etc.)
there should be no worry since search engines have provisions for such things and take into
account the motive behind them. You can help your mirror site by making sure that you follow
all the other guidelines to get noticed and ranked by Google.
Following the guidelines will surely help not only your ranking with Google but with other
search engines as well.