Crawler challenges and why they affect you

By Dixon Jones October 15, 2012

This post has come from several imperatives. Firstly I was talking about crawler issues with Vanessa Fox and David Burgess a couple of weeks ago at SMX Advanced in New York and wanted to share the presentation and also because our users are saying that Google is not showing the effects of link removals in the light of Penguin fast enough for them. This is also affecting other link intelligence data sets and frankly I think Majestic is doing better than most – but nowhere near perfect yet.

MajesticSEO cannot speak for Google. We do not “scrape” Google or try in any way to replicate their index. Our data is our own and there are remarkable differences. However – there are some interesting challenges for search engine spiders when it comes to scale. Even though you can replicate a spider over many, many machines, ultimately – however big your ability to crawl the web – you have to make choices on how to best marshall your crawl resources. If you are a “Johnny come lately” crawler that needs to build up its database as quickly as possible, then you would naturally focus on crawling new web pages or ones on websites where you knew there were lots of outbound links (rather then ones you knew were quality for example). You have limited resources – so you have to make choices. The problem here is that the more you push your efforts toward “discovery”, the less you push your resources towards “verification”.

A search engine crawler’s primary goal is not to weed out the bad links. It is to find content so that another part of the algorithm can weed out the good from the bad and return the GOOD to users. If – at the crawl level – a search engine can discount large amounts of the bad by not even recrawling it, because it is perhaps not of the quality a search engine merits as something worth revisiting very often, then this can dramatically improve the efficiency of the crawler resources in focussing on the stuff that matters.

If you are being penalised for the bad stuff, this really isn’t very helpful. Majestic solves its own dilemma of showing quality and current relevance by distinguishing between fresh data (Links from URLs we have seen within a two month timespan) and everything else going back over five years. After 60 days, if a link isn’t worth seeing, we won’t have recrawled it and it will have dropped out of the Fresh Index but will still be in the Historic Index. But Fresh helps to see the GOOD stuff, not the bad stuff. One way Google tries to solve it is by giving you “fetch as Googlebot” in WebmasterTools, so you can effectively tell Google when a site has changed. Another way is by asking you to use caching commands… see the presentation below for more on this.

If you have removed links and you need them to get updated, maybe there is a need to add a step in your cleanup proces – to request the site owner to not only remove offending links, but also to then ask them to “fetch the page as Googlebot” to help Google update at its end faster. The only suggestion I can make is to encourage the site owner that this is quick and painless for them and will only suffice to show Google that the site is improving its focus on quality. I cannot speculate on whether the site owners will agree with you, though.

So back to the question of allocating Resources for large spiders. Majestic does this through its Crawler Controller and the thing is, our Crawler Controller will never really run out of things to crawl -  so we need to maintain rules of engagement for the controller that keeps a sense of balance. We need to look at new content, but also respect that old content may change and keep one eye on the most important pages and maybe revisit these more often than others. Webmasters can help tremendously in helping any crawler be more efficient – not least by avoiding duplicate content which forces spiders to recrawl exactly the same thing twice or more than twice – often on what is apprently to a human exactly the same URL – but a URL to a computer is like a phone number. Put +1 in front of a phone number and many humans will know that this is not required for people dialling within the US. For a computer, this is a different phone number unless the programmer has gone out of his way to merge different variations of the same number onto one record.

So crawler issues affect search engines dramatically, which is probably why the session at SMX Advanced was so popular.

 

 

I have previously posted that presentation on the blog – but just for completeness, here it is again. I hope there are some takeaways for you on how to make your site more spider friendly. In the process you will be doing the world a favour and saving little spider legs from becoming over-tired.

I am sorry I do not have a transcript of what was said at the session.

Dixon.

Posted In: Commentary

14 Responses to “Crawler challenges and why they affect you”

  1. Adrian Land said:

    October 15, 2012 at 12:27 pm

    Thanks for sharing.

    It seems that the more complex things seem to get, the more simple you need to look. It is a classic case of prioritization from spiders mixed with how difficult we make it for them to carry this out when they are on our sites.

    The bit that is a brilliant reminder is if everything is as good as it gets on your own site health, then look the physcial server connections and therefore the hosting solution. Enjoying your neighbourhood tool too. Thanks Dixon.

  2. Segun Ketiku said:

    October 15, 2012 at 4:19 pm

    Please sir,am having a problem on my site as the last time crawled date is far back as September 30th and I have tried listing my website on many blogs through commenting but I cant see any improvement and am updating content on my website daily,please what is the problem.Thanks and God bless

    • Dixon said:

      October 15, 2012 at 9:28 pm

      This post may help to answer your question.

  3. Stephen L. Nelson said:

    October 15, 2012 at 8:58 pm

    Great info, thanks Dixon!

    Can I ask sort of a related question? How would you see DMCA requests in light of all this? I.e., if webmaster uses DMCA requests to get Google to successfully remove “bad neighborhood” pages with spammy links or duplicate content, is that a win? Or a win-win or win-win-win? It seems like it might be, if I follow your logic.

    • Dixon said:

      October 15, 2012 at 9:12 pm

      Well.. not win-win-win for us… because we still end up crawling the junk content – but if DCMA request acted upon could reduce the crawl overhead for Google, that would be great for them, except that I would imagine the scaling issues they have in humans having to get involved in DCMAs probably negates the costs benefits! But it does come back to the root issue. Crawling crap is expensive and wastes resources and we often don’t realize that our sites/servers/trackingstrings are so expensive because we do not look at the web from the point of view of a universal crawler like MJBot or GoogleBot.

  4. MetaTrader Programming said:

    October 16, 2012 at 6:36 am

    Speaking of the MajesticSEO index, i notice that it is only good as an estimation.

    I know for a fact that one of our domains has at least 596 referring domains, but MajesticSEO reports says we have around 191.

    Thats a big difference

    • Dixon said:

      October 16, 2012 at 9:24 am

      Every search engine’s Index is also only an estimation. But… Click the “historic” button or read this as to reasons why this may be.

  5. Stephane Bottine said:

    October 17, 2012 at 3:42 am

    Hi Dixon, to add to your post, Google updates its Penguin algorithm periodically (every 4 to 6 weeks approximately), which also explains why removing “bad” links won’t have an immediate impact on rankings (crawl considerations aside).

  6. organic seo said:

    October 18, 2012 at 1:03 pm

    your why-isnt-my-new-backlink-in-majestic-site-explorer has really helped me to understand more about Google nature.

  7. Lalit Kumar said:

    October 19, 2012 at 3:36 pm

    To sum it up, majestic seo crawler’s job is more complex than google since they have consider cr@p links too.

    - lalit kumar

    • Dixon said:

      October 19, 2012 at 10:00 pm

      Hehe! well – I would not say ours is more complex, because we do not try and analyze images, videos and all sorts of other stuff. Also, Majestic now knows which links are good and which are not so good, so we can also make some determinations based on this.

  8. Oscar Gonzalez said:

    October 24, 2012 at 5:15 am

    Thanks for sharing this presentation. Awesome info in it. At the very least thanks for the heads up about this tool. I didn’t know it existed and this is awesome. Now… in the presentation it says to avoid shared hosting or check the tool… but what if you have shared hosting and obtain a dedicated ip for your account. Does this alleviate the problem?

    http://www.majesticseo.com/reports/neighbourhood-checker

    • Dixon said:

      October 24, 2012 at 9:59 am

      Hi Oscar,
      Thanks for the feedback. I would imagine that you should do a little more digging as to how the server is configured. None of us can be experts in every field – but my intuition would say that the unique IP solves some issues, but not others. In particular, if there are 99 sites on one IP and 1 site on another IP all on the same server, then if the server itself gets overloaded, then the 500 errors will start popping up without regard to the IP number being used.

      • Oscar Gonzalez said:

        October 25, 2012 at 1:36 pm

        Right on, yeah I figured performance would still be an issue. Thanks for the response!