Archive for the ‘Research’ Category
Analysing the web’s biggest sites using Majestic SEO
Analysing the web’s biggest sites using Majestic SEO
Although Majestic SEO has a massive crawl capability, every system at scale has technology limitations. Google, for example, has considerable limitations which it disguises brilliantly:
Whilst the search giant may TELL you that there 1.4 billion results for a particular keyword, it only shows you the top 10 (or if you change the settings the top 100 or so) at a time. Further, if you use Google Webmaster tools to download your backlinks to a given page, Google limits the CSV file to 200 lines of data.
Majestic SEO goes back much further than this by default. Even in the Site Explorer mode, we’ll show you up to 10,000 rows of link data in a platinum level subscription and up to 20,000 in a standard report. We’ll also give you everything we have for most sites in an advanced report.
But with really big domains, giving you every link is not only technically difficult on our end, but also a huge challenge to analyse. Take Wikipedia.org’s historic index. If we actually gave you a CSV file with nearly 8 BILLION lines of data about anchor text and ACRank, what on earth would you do with it? Let’s face it, Excel struggles with 100,000 lines of data. You are not going to be able to slice and dice this:
So here are some tips for analysing backlink profiles for large sites, using Wikipedia to demonstrate. I will assume that people analysing the large sites are not confined to the free version of our site – and I will personally be using a Platinum subscription for my numbers and charts – but many of these tips will work on a silver subscription.
Using the backlink history comparison graphs
Most of our users will already be using the backlink history graph using the historic data. In case you aren’t, here’s the history of Wikipedia vs about.com and Youtube:
The two charts here show that YouTube has, since around April 2009, started to outpace Wikipedia. It suggests that Video is overtaking text as a way to communicate ideas, but also shows that YouTube has a much broader appeal – covering everything from news and education through to celebrity gossip and music. Wikipedia, by contrast, is limited to what one might describe as “evergreen” content – although in this regard they are certainly outpacing About.com.
The second way to use the backlinks charts is to understand the fresh charts and look at the “decay rate” of links. The chart below compares the Daily Fresh review rate of links to Wikipedia with links to StumbleUpon and Tumblr.
I have switched the chart to a column based output – as this will make more sense. Here we see how many links have been seen over a 30 day period (whether they are new links or not) and how early in the 30 days we saw those links. So whilst at first glance you might say that the “best two” are Wikipedia and Tumblr, on closer inspection you can see that after one day, the Tumblr links drop off much more rapidly than Wikipedia. About 5 days in they come back but what this is telling me is that both Wikipedia and Tumblr have a similar link strength at the head of their link funnel – as these links are being seen every day (or every hour). However, many of Tumblr’s links rapidly decay into areas of the internet that our crawlers frequent less regularly. We do get there, but the chances are that less people visit these pages than the Wikipedia pages, because Wikipedia has stronger links from “B-league” pages. If you wanted, you could more or less plot this decay rate and use this and a measure of link longevity.
Worldwide Distribution
Because it is impractical to get a list of all the links in an advanced report, the ‘links by country’ report is also not practical. But there is another way to look at how Wikipedia is disseminating itself across the globe. You can look at ‘links by language’. Whilst the compare backlinks history charts only work at the TLD level, Majestic Millions allows you to compare up to 10 sites at the sub-domain level. This means we can see the ‘links by language’.
The great thing about the Majestic Millions chart is that it gets updated daily and if you were to look at this data over time, you would also see the global ranks against all other sites trend up or down. It would be reasonable to look at the spread of links by language as a proxy for traffic by language should you so wish.
Looking at the strongest pages in Wikipedia
In Site Explorer, we show you the top pages. This report breaks down the site into the number of links to each constituent page within the site. This will tell us different things for different sites, however I would urge Google’s Matt Cutts to have a look at just how much interest there is in the links to the Search Engine Optimisation page... Maybe a topic for another post.
Here we can see that Wikipedia is proving especially useful in helping us understand the Syrian uprising, the Indian Social activist Anna Hazare and the Asian (US) census, who all have pages in the top 100 list on Wikipedia. It would not be a wasted exercise to go through this list of 100 and look at what is generating the imagination of what must be a fairly intelligent audience, to select maybe 5 or so topics that might be the basis for blogs and developing social media equity
Looking at the top line numbers
One set of metrics that does not fall down on scale is the headline counts. Here we have a number of useful metrics. I am using the fresh data here – and plotting the changes to this fresh data daily over time is even more helpful. Here are some observations:
Looking at the educational and government strength
Not surprisingly, Wikipedia has plenty of links from Educational domains. Comparing this with another site (as a percentage of total referring domains, rather than as an absolute comparison) will give a good indication as to whether Wikipedia is trusted amongst the educational sites more than other sites. One potentially alarming item here is just how many government-run sites link to the public encyclopaedia. Wikipedia is (as the links to the Search Engine Optimisation page makes clear) extremely easy to change and manipulate. If Governments are sending their citizens to Wikipedia as the fount of all wisdom, then there surely has to be a question as to whether there is the potential to undermine sectors of society by changing these pages slightly.
Another Decay Metric
These top line numbers also give another indication of decay, with the “deleted” figure in the fresh index. Increasingly, links are coming via blog posts, and the most active blogs add content every day. This means that within a reasonably short amount of time, a link may drop off the home page of a website. After that point, it will be marked as “deleted” when our crawlers next come to the home page, but will still be present on the inner pages. From the numbers above, we can see that about 7.6% of links decay over a 30 day period for Wikipedia (11.9K/156.7K). Again, this looks like the site is stronger than Tumblr, which has a comparative decal rate of 11.3%.
Looking at individual Backlinks by Page, not by Site
usually, when we pull a full report, we simply pull a report for a domain. With a site like Wikipedia, this would give you 20,000 links to the home page in a standard report. This barely scratches the surface. However, by pulling reports by PAGE/URL you get much more targeted data. For example, one of the top 20 pages on Wikipedia is the page about the Libyan conflict. This has 19,000 links to it. If you click the little "report" tab next to that link, you can get a standard page report containing all 19,000 links to that individual page - and it only costs one standard report! Quick - Focused - and Detailed.
And there's More...
I think the conclusion is that whilst trying to analyze sites with Billions of links, you need to first focus on what you are really trying to measure, before going too granular in your analysis. we would love to hear other thoughts from readers as to how they look at very large websites.
Q: How large does a computer need to be for good SEO? A: TeraFlops!

Dixon Jones shows off new Teraflop Cluster
Today, Majestic 12, the company powering MajesticSEO, announces the commissioning of a new cluster of super computers.
Whilst most search engine marketers get their data from their PC and whatever Google and Yahoo hands them on a plate, a relatively small subset of the search community build their own programs and their own technologies to lift their understanding of search engine algorithms.
Most agencies can build these tools on PCs, drawing data from other sources – but eventually, someone needs to either get the data from first principals or rely on the benevolence of a search engine’s data… who, as Internet marketers will know, can be a fickle master.
So how far beyond a standard PC do you need to go to get the web’s back link data from first principals for example? (Back-link data is information about how different websites link to each other.)
Today, Majestic-12 – a leader in this technology – has announced that it has commissioned a new Teraflop Cluster to do its data crunching and storage. So is that big? Well some PCs now come with hard drives of a terabyte in size. Majestic almost have that much RAM. (That’s the fast memory bit, used for doing lots of fast calculations.)
When it comes to hard drives, their new computer cluster has raw storage of some 300 terabytes. So about 300-600 PCs worth… except that the speed with which they can be accessed is considerably faster., with the new computer cluster comprising of 12 dual processor nodes based on recently released six core Intel Xeon X5670 (2.93 Ghz) processors. Each node in the 12 node Cluster was tested with Intel Linpack benchmark to produce in excess of 130 GFLOPS each with total processing capacity in excess of 1.5 TFLOPs.
The Cluster runs bespoke software, which will enable Majestic to increase the speed of production of it’s Trillion-Scale Back-links index – which Majestic believes is the largest publicly available index of it’s type commercially available in the world today.
In addition to the Teraflop cluster, Majestic uses hundreds of donor computers around the world to crawl the web through its distributed crawler and additionally has servers in a data-center dedicated to the web interface for users to be able to access the compiled version of the database for under UK£10 (US$15) a month.
To subscribe to the system, choose a plan online at http://www.majesticseo.com/subscription-packages.php.
The hardware was supplier by Gigaserver – our supplier based in Holland.
iPad Competition Shortlist
Our iPad competition received quality over quantity when it came to entries and I am delighted that we received some well thought out suggestions, which are listed in the comments at http://blog.majesticseo.com/general/win-ipad/.
It is now up to Alex to have a look at the ideas and suggestions to decide on a winner.
New link building relevance case study
On Monday, John Straw announced a new product, Influence finder which uses MajesticSEO’s Enterprise level API as a significant part of its algorithms. His presentation rocked the room and hours earlier the product just PIPPED MajesticSEO in a survey of link tools by the highly respected Link Builder, Weip, who gave MajesticSEO 90% and Influence Finder 92%. Before I go too far, let me say that MajesticSEO will work hard to try to catch up on the missing 2% in Weip’s comparison.
His presentation was really interesting and was based around a case study with econsultancy (who gave us first prize in their technology and innovation awards earlier this year).
John set out to build a system which solved the problem of having huge numbers of link targets, when a company really wanted to know a few that would be worth developing a meaningful relationship with. His model would take the best dataset of links around, then run a series of extra algorithms on the data to filter out those which gave off poor quality signals.
So his first challenge was to decide what dataset to use. Since Yahoo only gives up 1000 results, he had really only had two datasets to choose from and compared these with Google’s WMT list for econsultancy. So what did he find?
Well he found first and foremost (from our perspective) that MajesticSEO showed more links than Google WMT. We found links from 8,448 referring domains whilst Google found (or at least reported) links from 5,189 domains. The commonality chart above shows only about a 30% overlap – so this needs explaining. about 18 months ago, e-consultancy.com changed to econsultancy.com and I suspect (but have not yet tested) that Google and Majestic report these differently.
Influence Finder then goes considerably further, by applying several algorithms such as their “heartbeat” algorithm to find sites that are active – but not hyperactive and therefore better prospects than sites which either are left decaying on some far flung corner of Blogspot or alternatively collect all their data through RSS feeds and have no relevent human involvement.
In order to do this, of course, they need to actively go and spider the sites in our dataset, to pick up new quality signals. This is the sort of added value that we have been working with our Enterprise API partners to deliver. Influence Finder is still in its embryonic form – but it is VC backed and has an impressive pedigree of people and contacts behind its management.
The resulting list of sites were a small; set of high value blogs and news sites with a natural affinity to econsultancy, rather than 8,000 initial target sites.
It will be interesting to see how the algorithm progresses.
I urge you to give Influence Finder a run through with their free seven day trial if you have a chance, if you are one of those people that gets overwhelmed by MajesticSEO’s data and would like someone or something to do the analysis and data crunching for you.
Competitors Analysis
This research section contains articles that have been linked to historically. For our users convenience, research material is now posted onto our blog, rather than take up a seperate section. However as a service to our customers and user base, we have retained some of the articles. The layout has been altered as a part of the migration to new webserver technology, but we have tried to ensure the content has not been meaningfully altered.
Breaking news!
8 Oct 2009 – we’ve broken Google’s size of web milestone!
Majestic SEO was in development for a long time and we have invested a lot of time and money into making sure that our publicly available index is the best in the world. The first prototype of our index was created around February 2007 and then a lot of work went into making sure we can scale to rival and even beat top search engines in the world. The history of our indices is shown below – one picture tells the story better than thousand words!
On the chart above you can see our index (blue line going up pretty steeply) size measured in unique URLs as it went through very intensive development in the last 18 months. We’ve actually released first version of the index publicly early in 2008, even though we had fairly large scale test indices made in 2007. We did not release these smaller indices beacause we felt that we did not have sufficient depth and quality in our index to actually make it public. It took another half a year since the first public release for us to actually start selling competitive information.
Since we are dealing with competitive intelligence it would have been wrong to avoid comparing us with the known competitor – Yahoo Site Explorer. They’ve been around for longer than us, so we kept an eye on how close our index is compared to theirs. Finally this month we have reached the point when we have a much bigger index then they have! Catching up took a long time but we are finally here.
As you search for urls in our index you will find that we always include links to Google and Yahoo Site Explorer (YSE) link: commands to make quick comparison of how many backlinks they have against our database. We do that because we are confindent that in vast majority of cases for established websites we have more data. Just for this article we run a few searches and found this relevant blog post that announced arrival of YSE, searching for backlinks in our database here shows that we have 9 external backlinks to it from 7 domains – query Yahoo for the same data and at the time of writing this article they have only shown 2 external backlinks! This is the approach we used to check how well we do for the list of top world domains plus a couple of interest for us shown in the table below:
| # | Checked homepage of domain |
External backlinks count | ||
|---|---|---|---|---|
| Majestic SEO | Yahoo Site Explorer | SEOmoz Linkscape | ||
| 1 | google.com | 1,399,343,388 | 288,913,788(21%) | 87,046,699 (6%) |
| 2 | yahoo.com | 188,798,830 | 60,165,558(32%) | 7,319,669(4%) |
| 3 | blogspot.com | 3,216,258 | 69,894 (2%) | 64,285 (2%) |
| 4 | adobe.com | 11,553,778 | 3,416,027 (30%) | 946,488 (8%) |
| 5 | microsoft.com | 20,439,892 | 4,724,728 (23%) | 1,299,931 (6%) |
| 6 | wikipedia.org | 67,862,860 | 5,454,297 (8%) | 4,575,880 (7%) |
| 7 | w3.org | 6,963,932 | 1,953,095 (28%) | 711,314 (10%) |
| 8 | amazon.com | 17,952,311 | 63,447,096 (353%) | 896,255 (5%) |
| 9 | geocities.com | 1,847,450 | 157,320 (9%) | 49,796 (3%) |
| 10 | youtube.com | 23,328,644 | 12,030,984 (52%) | 2,477,407 (11%) |
| 11 | myspace.com | 9,134,606 | 4,118,036 (45%) | 1,757,238 (19%) |
| 12 | msn.com | 26,124,250 | 7,680,464 (29%) | 2,960,549 (11%) |
| 13 | wordpress.org | 300,892,396 | 154,338,507 (51%) | 36,464,098 (12%) |
| 14 | macromedia.com | 3,370,861 | 1,258,302 (37%) | 182,500 (5%) |
| 15 | aol.com | 13,066,744 | 14,102,296 (108%) | 1,004,143 (8%) |
| 16 | apple.com | 10,637,612 | 2,777,000 (26%) | 731,189 (7%) |
| 17 | bbc.co.uk | 9,030,447 | 2,795,333 (31%) | 678,642 (8%) |
| 18 | sourceforge.net | 7,648,850 | 2,894,128 (38%) | 827,101 (11%) |
| 19 | tripod.com | 109,322 | 38,269 (35%) | 9,262 (8%) |
| 20 | cnn.com | 20,102,927 | 10,722,839 (53%) | 1,783,825 (9%) |
| 21 | seomoz.org | 263,297 | 119,208 (45%) | 37,588 (14%) |
| 22 | seobook.com | 465,112 | 160,352 (34%) | 79,915 (17%) |
In this table we assume that Majestic SEO external backlink counts are 100% and calculate the percentage of our figure that our competitors have – if this figure is less than 100% then it means they have less backlinks for the same URL, in this case we show it using red colour. There is just two case when we have less backlinks than Yahoo – aol.com (marginally less), and also amazon.com (this is a suspected anomaly – wrong reporting at YSE – previously they reported much lower figure). The really interesting part is that we have more backlinks to our competitor’s home sites than they do themselves! The chart below shows this comparison much clearer:
We think this chart clearly demonstrates that our index is well ahead of our competitors. Does the size matter? This is not a rhetorical question and we do have a definite answer – yes it does! If you look at our anchor index quality accessment article you will see that we have been running quality checks on our index every time we made new one. The matching ratio was growing with our index size clearly showing that small databases are not sufficiently close to the big databases used by the search engines. What this means is that using data from much smaller databases is likely to show you only partial picture that is far away from the picture seen by the major search engines.
We now believe we have the biggest publicly available database of backlinks and anchor text. Is this the end of the road for us? Not at all! Google recently made a post about the size of the web saying that they have found 1 trillion (1,000,000,000,000) unique urls. We think that they are likely to be telling the truth in this case, unlike the backlink counts they show on their website! Let’s now go back to the first chart comparing index sizes that will include Google’s secret web-graph index of 1 trln unique urls that they use so effectively to rank sites:
Suddenly the picture is very different: the addition of Google’s secret index size changes scales big time! Now that is the real competitor we have out there! Our goal is to understand the web as good as the best search engine and we are not resting on our laurels – we actively work towards catching up with Google. Next year we expect to close the gap substantially between our index and Google’s internal webgraph database that they keep secret for a good reason – understanding effects of backlinks and anchor text on ranking is the key to great rankings.
Conclusion
We hope that this article helped you learn more about our work and show that even though we are ahead of our natural competition we still have a big job ahead of us to reach our goal of being able to see the web the way the best search engine sees it. We expect big index growth in 2009 and we are well positioned to actually reach our goal. You can help us and yourself by becoming our customer – we offer a great range of domains with excellent competitive link data!
This article will be updated in when we make our next big index update, we expect very good increase in index size, be sure to check us back soon!
Please note that Majestic SEO is not affiliated with Google, Yahoo, SEOmoz or any other company mentioned in the text unless specifically stated. All trademarks belong to their rightful owners.
