Maximize Crawlability of WordPress Blogs and Prevent Duplicate Content - Possible WordPress Crawling Issues (Page 2 of 4 ) Category, post, and archived pages contain the same content.
Like other blog publishing platforms, when a new post is created, by default it will be shown automatically in the post URL, categories and archives. The post URL is the actual URL where the post is located. This is the correct URL to include in the search engines, because this will contain some keywords in the URL.
WordPress categories just contain those posts, but arranged in topics. The purpose is to easily classify the post, which will greatly help blog visitors. When search engine bots visit these categories, it creates duplicate content problems because it will contain the same information as the post.
Google uses the "Page Rank" system to score and classify document importance. The "Page Rank" is a measurement of popularity in terms of the number of back links pointing to the URL. So this means that if those WordPress categories can get higher "Page Rank" than the post, the categories will be placed in the main Google index, while a much more important post will be buried very deep in the Google index and will be a "second priority" document.
Default URLs are complex , do not contain the targeted keywords and will be very long.
WordPress's default URLs are very unfriendly and confusing. They will present crawling problems, especially with long URLs. The default URLs contain query strings "?" and other characters, which does not help in the indexing and crawling process in the search engines. The longer those URLs, the greater the risk that they will simply be treated as "second priority" documents in the crawling process.
WordPress's front page contains the same content as the post page, archive page and category pages.
WordPress's Front page by default contains same content as the post page. This setup creates much more problems with Google because of its "Page Rank" system. If the front page has more Page Rank than the URLs of your posts (this is always true in reality), the front page will be the one that is prioritized in the search results, not the URLs of posts. Again, those post URLs will be classified as "second priority" documents and will be placed in the supplemental index. However, this risk will be minimized if you post on a daily basis,because the front page is always updated and viewers tend to bookmark or link to post URLs because of frequent updating.
RSS feeds and Track back pages get crawled and contain the same information as front pages, category pages and archived pages.
By default, there will always be RSS feeds in WordPress. These RSS feeds are indexable by default, and again contain the same content as the post, front page and archived pages. The RSS feeds are only used in syndicating documents or providing users with updated content. These URLs are not as important as the post URL and front page.
WordPress admin pages are indexable, like the wp-login.
Dynamic URLs in WordPress are also crawlable.
Dynamic URLs have no place in WordPress if you need to maximize search engine crawling of your blog. Dynamic URLs are used in previews, blog search results and admin pages, which are not important pages.
Default WordPress installation does not give a 301 redirect from non-www pages to www version.
Have you noticed that when you type the non-www version of your blog into your browser's address bar, you get the same information as the www version? This will create crawling problems and decrease the potential for maximizing the crawling of search engine bots on your site, particularly if you are targeting Google. The reason is that if search engines get the chance to crawl those non-www pages, they will end up as duplicate content and give signal that your blog pages are confusing to crawl, especially if the search engines cannot properly determine the right documents to index; you must provide some guidance or clues. In the SEO industry, having both www and non www pages indexed is called a "canonical issue."
Keep in mind that some installation of WordPress does not give 404 header statuses but will give a 200 header status, even though the URL does not exist at all.
- It is true that there will be some WordPress blogs in some servers that will not return a 404 header status even if the page does not exist. This is a particularly strange and rare event, but if your blog is a victim of this issue, then you will fail to get the maximum crawling priorities for your pages. The obvious issue is that if search engines happen to index previously 200 (OK) pages that should be 404 now, it will still return the 200 status (OK), and depending on your setup it could return live documents on your blog, creating duplicate issues.
Current WordPress installation does not offer XML and static sitemaps to guide visitors and search engine crawlers.
This is particularly annoying for WordPress, because by default, the admin page does not have enough features to automatically create a sitemap. XML Sitemaps are used by Googlebots to find important URLs in your blog, which will help in maximizing the crawling potential of your blog URLs.
Next: Solutions to WordPress Crawling Issues >>
More Blog Help Articles More By Codex-M |