Archive for the ‘Research’ Category

Majestic Gives Away A Million – A Majestic Million!

By Steve December 25, 2011

As it is Christmas Day, Majestic SEO is releasing data on the top one million websites in a Creative Commons sharealike license, downloadable CSV file, allowing web users to create derived works and research  (subject to attribution). The files are available at the end of this post.

Majestic SEO launched Majestic Million on May the 19th, and it has caused ripples of interest from time to time, and has found a nice niche to power Buzz League Tables.

 

We have altered the algorithms behind Majestic Millions, generating the list on the Referring C-subnet count rather than the Domain Count. This has resulted in a shift of the top ten, with an increase in the number of well known domains in the Majestic Million.

Today though, we thought we would do something different. Majestic has had a long history of making our data publicly accessible, and we would like to think that it has bought us a certain amount of goodwill in the wider internet community. So we have a surprise gift for the internet analytic community ( and who knows – perhaps some statisticians also ) and are making a snapshot of the entire Majestic Million List available to download.

As a sanity check, we ran a couple of plots using the Statistical Computing package “R”:

A graph of referring C-subnet count against Majestic Million Rank:

 

Again, but just for the top 250:

And a Graph of the referring IP Address count against the C-subnet count:

We would love to hear about any conclusions you come to using the data – so what are you awaiting for – Downloadable in Excel or TXT below:

[ download Excel file here  NB: 1,000,000 records in an Excel file is 60 MB. You need a modern version of Excel. Save to Disk first]

[ download full file here This is the 25 MB .TXT file ZIPPED, Tab delimited and much smaller - but it is still a million lines of data!]

Creative Commons License
This Majestic Million Data is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

 

If you would like to use this data or re-release it, you should reference Majestic SEO as follows, providing a link to this blog post should the medium support it:

Backlink Data sourced from the MajesticSEO.com public release of Majestic Millions Dataset – generated on 22nd December 2011

Posted In: Research

Google’s Strongest Current Backlinks

By Dixon October 18, 2011

Back in January we revealed Google’s top 50,000 strongest backlinks. Back then, we had not released our FRESH Index Now we have, we can give you an even better list.

We are pleased to give you Google’s Strongest backlinks seen by our crawlers in the last 30 days. To keep the list clean, we have chosen to release the strongest link only from each of nearly 30,000 referring domains. We sorted the referring domains based on the number of referring sub-nets to those domains. This reduces undue influence from site-wide links and home grown referring domains.  These are free for you to download from the bottom of this post – no login required. If you do any analysis on the data though, be sure to tell people who you got the data from, please.

Some interesting obsevations about the top 50,000 links to Google are:

Some of the anchor text is listed with ?? Question marks. This is not an error in the Majestic data. We accurately collected the anchor text, but it is in a UTF-8 character set not generally recognized by Unicode during the conversion to CSV. Typically Chinese or Japanese characters, for example. The web interface at Majestic SEO shows the correct anchor text for Eastern and other character types if you want this level of data.

Download Now

You can download the top 30,000 backlinks to Google in this CSV file.

Majestic SEO updates this data up to three times a day – so it differs from day to day. If you would like top analyze the baclink profile of other sites, our subscriptions start at less than 50 dollars a month.

 

Posted In: Research

How Fresh is their Index? Actually?

By Dixon September 19, 2011

How do you test how fresh a search index’s data is?

We decided to check – for ourselves – exactly how fresh (or stale) data is in various indexes around the web. We’ll show you comparison data for an example checked this morning (19th September 2011).

We are going to compare:

  • MajesticSEO Fresh Index
  • Google
  • Yahoo
  • Bing

The example website that we will be using at this stage is http://status.aws.amazon.com/.

If you don’t care about the methodology – only the research output… here you go:

Search Index Tested Date Seen by Index
Actual: status.aws.amazon.com 19th September 2011
Majestic SEO Fresh Index 17th September
Google.com 19th September
Yahoo.com 14th September
Bing.com 15th September

Here’s exactly how we got this data. There are a few steps that you will need to take, which I will take you through right now.

Finding a base line

Right now, http://status.aws.amazon.com/ shows today’s date in its title. That’s a great place to start, because nobody is going to say that Amazon is likely to try to manipulate the date, so every time an index updates, you’ll see the date that the page was actually crawled in the title, regardless of when the new information becomes live. Once you have the example web page up, right click on your mouse and select ‘View Page Source’. A separate box should pop up (like the one in the picture).

 

There you will need to look for the title that any and every crawler will see. (The title is highlighted in the picture). So in this case, the title for this particular page is ‘AWS Service Health Dashboard – Sep 19, 2011’. The date will change depending on the day that you complete this action.

Testing Majestic SEO Fresh Index

Now that you have found a base line, you just need to check this against all the indexes of the web that you would like. So for Majestic SEO, go to majesticseo.com and enter the address http://status.aws.amazon.com/ into the bar and press explore. You should get this:

Hopefully, you should come to a page that gives information on the web sites Backlink History, Referring Domains and Top Backlines; basically it is a complete summary of the web domain that you want information on. If you don’t, try logging in first!

The picture above shows the title of the web domain, it’s URL, the date that it was last crawled, its External Backlinks and the number of Referring Domains. So this example website was last crawled one day ago on 18th September 2011 and has 1,062 Referring Domains – but the TITLE says that the crawl date was actually the 17th. It is perhaps our own bad luck that our crawlers are using London time and Amazon is using a US time. Otherwise the crawl date in our system and the date in the page title should be the same. But we want to compare like with like, and the Amazon title is the base line, so we’ll take the 17th September as the ACTUAL crawl date, using Amazon’s server time.

Testing Google

After testing Majesticseo.com you will need to test Google. Type in the same URL into the Google search bar, the same information should match with the picture below.

From this picture, the information that Google gives you is spot on with today’s date. In fact, the information was last updated two hours ago (from when this was written). Interestingly, if you were to grab the “cache” of Google’s data set, it suggests the cached information is sometimes older, but again – let’s go like for like, using the independence of Amazon’s title as the baseline.

Testing Yahoo

Once you have tested Google, the next step is to test Yahoo.com. Once you are there, you will again need to type in the same link that has been used in the first, second and third steps. At the very top of the page, you will see the information you will need.

The information that yahoo.com has given can be seen above. The date given by Yahoo for the same link (all this information was collected on the 19th September) is September 14th, so that is around five days ago!

Testing Bing

The final step in collecting this information is to go onto bing.com. Again, you will need to type in the same link here. Once you have done that, a page should appear that looks like the one below.

The title here is still the same as it is in each of the other steps, but the date given by bing.com, is the 15th September. This is still nearer to today’s date than Yahoo, but is not as up to date as Majestic or Google.

In Conclusion

You can see exactly what index, gives you the freshest data and which gives you the most out of date without having to rely on claims. As the table shows, the method that gave the most out of date information is Yahoo – whilst Google.com, gave the most up to date information. The information was updated two hours before this post was written. Majestic SEO gave the second most up to date information, beating Bing, and Yahoo by several days.

Posted In: Research

Analysing the web’s biggest sites using Majestic SEO

By Dixon September 14, 2011

Analysing the web’s biggest sites using Majestic SEO

Although Majestic SEO has a massive crawl capability, every system at scale has technology limitations. Google, for example, has considerable limitations which it disguises brilliantly:

Whilst the search giant may TELL you that there 1.4 billion results for a particular keyword, it only shows you the top 10 (or if you change the settings the top 100 or so) at a time. Further, if you use Google Webmaster tools to download your backlinks to a given page, Google limits the CSV file to 200 lines of data.

Majestic SEO goes back much further than this by default. Even in the Site Explorer mode, we’ll show you up to 10,000 rows of link data in a platinum level subscription and up to 20,000 in a standard report. We’ll also give you everything we have for most sites in an advanced report.

But with really big domains, giving you every link is not only technically difficult on our end, but also a huge challenge to analyse. Take Wikipedia.org’s historic index. If we actually gave you a CSV file with nearly 8 BILLION lines of data about anchor text and ACRank, what on earth would you do with it? Let’s face it, Excel struggles with 100,000 lines of data. You are not going to be able to slice and dice this:

So here are some tips for analysing backlink profiles for large sites, using Wikipedia to demonstrate. I will assume that people analysing the large sites are not confined to the free version of our site – and I will personally be using a Platinum subscription for my numbers and charts – but many of these tips will work on a silver subscription.

Using the backlink history comparison graphs

Most of our users will already be using the backlink history graph using the historic data. In case you aren’t, here’s the history of Wikipedia vs about.com and Youtube:



The two charts here show that YouTube has, since around April 2009, started to outpace Wikipedia. It suggests that Video is overtaking text as a way to communicate ideas, but also shows that YouTube has a much broader appeal – covering everything from news and education through to celebrity gossip and music. Wikipedia, by contrast, is limited to what one might describe as “evergreen” content – although in this regard they are certainly outpacing About.com.

The second way to use the backlinks charts is to understand the fresh charts and look at the “decay rate” of links. The chart below compares the Daily Fresh review rate of links to Wikipedia with links to StumbleUpon and Tumblr.



I have switched the chart to a column based output – as this will make more sense. Here we see how many links have been seen over a 30 day period (whether they are new links or not) and how early in the 30 days we saw those links. So whilst at first glance you might say that the “best two” are Wikipedia and Tumblr, on closer inspection you can see that after one day, the Tumblr links drop off much more rapidly than Wikipedia. About 5 days in they come back but what this is telling me is that both Wikipedia and Tumblr have a similar link strength at the head of their link funnel – as these links are being seen every day (or every hour). However, many of Tumblr’s links rapidly decay into areas of the internet that our crawlers frequent less regularly. We do get there, but the chances are that less people visit these pages than the Wikipedia pages, because Wikipedia has stronger links from “B-league” pages. If you wanted, you could more or less plot this decay rate and use this and a measure of link longevity.

Worldwide Distribution

Because it is impractical to get a list of all the links in an advanced report, the ‘links by country’ report is also not practical. But there is another way to look at how Wikipedia is disseminating itself across the globe. You can look at ‘links by language’. Whilst the compare backlinks history charts only work at the TLD level, Majestic Millions allows you to compare up to 10 sites at the sub-domain level. This means we can see the ‘links by language’.



The great thing about the Majestic Millions chart is that it gets updated daily and if you were to look at this data over time, you would also see the global ranks against all other sites trend up or down. It would be reasonable to look at the spread of links by language as a proxy for traffic by language should you so wish.

Looking at the strongest pages in Wikipedia

In Site Explorer, we show you the top pages. This report breaks down the site into the number of links to each constituent page within the site. This will tell us different things for different sites, however I would urge Google’s Matt Cutts to have a look at just how much interest there is in the links to the Search Engine Optimisation page... Maybe a topic for another post.

Here we can see that Wikipedia is proving especially useful in helping us understand the Syrian uprising, the Indian Social activist Anna Hazare and the Asian (US) census, who all have pages in the top 100 list on Wikipedia. It would not be a wasted exercise to go through this list of 100 and look at what is generating the imagination of what must be a fairly intelligent audience, to select maybe 5 or so topics that might be the basis for blogs and developing social media equity

Looking at the top line numbers

One set of metrics that does not fall down on scale is the headline counts. Here we have a number of useful metrics. I am using the fresh data here – and plotting the changes to this fresh data daily over time is even more helpful. Here are some observations:


Looking at the educational and government strength

Not surprisingly, Wikipedia has plenty of links from Educational domains. Comparing this with another site (as a percentage of total referring domains, rather than as an absolute comparison) will give a good indication as to whether Wikipedia is trusted amongst the educational sites more than other sites. One potentially alarming item here is just how many government-run sites link to the public encyclopaedia. Wikipedia is (as the links to the Search Engine Optimisation page makes clear) extremely easy to change and manipulate. If Governments are sending their citizens to Wikipedia as the fount of all wisdom, then there surely has to be a question as to whether there is the potential to undermine sectors of society by changing these pages slightly.

Another Decay Metric

These top line numbers also give another indication of decay, with the “deleted” figure in the fresh index. Increasingly, links are coming via blog posts, and the most active blogs add content every day. This means that within a reasonably short amount of time, a link may drop off the home page of a website. After that point, it will be marked as “deleted” when our crawlers next come to the home page, but will still be present on the inner pages. From the numbers above, we can see that about 7.6% of links decay over a 30 day period for Wikipedia (11.9K/156.7K). Again, this looks like the site is stronger than Tumblr, which has a comparative decal rate of 11.3%.

Looking at individual Backlinks by Page, not by Site

usually, when we pull a full report, we simply pull a report for a domain. With a site like Wikipedia, this would give you 20,000 links to the home page in a standard report. This barely scratches the surface. However, by pulling reports by PAGE/URL you get much more targeted data. For example, one of the top 20 pages on Wikipedia is the page about the Libyan conflict. This has 19,000 links to it. If you click the little "report" tab next to that link, you can get a standard page report containing all 19,000 links to that individual page - and it only costs one standard report! Quick - Focused - and Detailed.

 

 

 

 

And there's More...

I think the conclusion is that whilst trying to analyze sites with Billions of links, you need to first focus on what you are really trying to measure, before going too granular in your analysis. we would love to hear other thoughts from readers as to how they look at very large websites.

 

Posted In: Research

Q: How large does a computer need to be for good SEO? A: TeraFlops!

By Dixon June 25, 2010
Majestic 12's Dixon Jones shows off new Teraflop Cluster

Dixon Jones shows off new Teraflop Cluster

Today, Majestic 12, the company powering MajesticSEO, announces the commissioning of a new cluster of super computers.

Whilst most search engine marketers get their data from their PC and whatever Google and Yahoo hands them on a plate, a relatively small subset of the search community build their own programs and their own technologies to lift their understanding of search engine algorithms.

Most agencies can build these tools on PCs, drawing data from other sources – but eventually, someone needs to either get the data from first principals or rely on the benevolence of a search engine’s data… who, as Internet marketers will know, can be a fickle master.

So how far beyond a standard PC do you need to go to get the web’s back link data from first principals for example? (Back-link data is information about how different websites link to each other.)

Today, Majestic-12 – a leader in this technology – has announced that it has commissioned a new Teraflop Cluster to do its data crunching and storage. So is that big? Well some PCs now come with hard drives of a terabyte in size. Majestic almost have that much RAM. (That’s the fast memory bit, used for doing lots of fast calculations.)

When it comes to hard drives, their new computer cluster has raw storage of some 300 terabytes. So about 300-600 PCs worth… except that the speed with which they can be accessed is considerably faster., with the new computer cluster comprising of 12 dual processor nodes based on recently released six core Intel Xeon X5670 (2.93 Ghz) processors. Each node in the 12 node Cluster was tested with Intel Linpack benchmark to produce in excess of 130 GFLOPS each with total processing capacity in excess of 1.5 TFLOPs.

The Cluster runs bespoke software, which will enable Majestic to increase the speed of production of it’s Trillion-Scale Back-links index – which Majestic believes is the largest publicly available index of it’s type commercially available in the world today.

In addition to the Teraflop cluster, Majestic uses hundreds of donor computers around the world to crawl the web through its distributed crawler and additionally has servers in a data-center dedicated to the web interface for users to be able to access the compiled version of the database for under UK£10 (US$15) a month.

To subscribe to the system, choose a plan online at http://www.majesticseo.com/subscription-packages.php.

The hardware was supplier by Gigaserver – our supplier based in Holland.

Posted In: Research