How big is Google? Why should you care?

Remember way back when Google used to show the number of pages they have indexed on their home page? Remember the war between Yahoo and Google where they competed to get the most pages indexed? It seemed that the operators of the big engines felt that if their index was bigger, they were the better search engine. It was fun to watch at the time, but eventually those numbers quietly disappeared.

However, when the Cuil search engine came out last year, its creators made the bold claim that it was the largest search engine in the world. Its home page, as of this blog post, proclaims its index to be 127 billion pages. Interestingly, just three days before Cuil officially launched on July 28, 2008, Google made the statement that their search engine was aware of one trillion urls, but added that they "don't index every one of those trillion pages." For a moment I wondered whether the number wars were going to start up again. They didn't.

But is it true? Does Cuil index more pages than Google? And why should you care either way?

First, let's see if it's true, then we'll talk about the implications to you as a webmaster.

Here's a simple way to find out: search for a single word term in Cuil and Google and see how many pages comes back in the result counts. For optimal results, the word should be extremely common — likely to appear on just about every single (English) content page indexed by both engines.

For example, the word the. I searched for "the" in Cuil and Google (and, for fun, Yahoo and Bing). Here are the numbers I got back, sorted with the engines having the most results first.

Results for "the"
Cuil 89,042,476,840
Yahoo 32,700,000,000
Google 14,850,000,000
Bing 6,700,000,000

 

Rather dramatic differences! Based on these results, Cuil does seem to index a lot more pages than Google and the other major search engines (at least pages written in English).

Remember, though: Google made it clear that they don't index every page they are aware of. In fact, assuming that Cuil actually indexes most of the publicly available content on the web, that means that Google is choosing not to index more than 80% of pages which contain the word "the" (which it's safe to say appears on pretty much all content written in English).

What causes Google to filter a page from its index? The previously referenced blog post on Google's blog says that "many [pages] are similar to each other, or represent auto-generated content … that isn't very useful to searchers."

Google is notorious for making vague statements that are understood by just about nobody. So what's the real truth? Let's disect their statement a bit and find out.

Google says that pages which are "similar to each other" aren't necessarily indexed. These kinds of statements from Google have really caused a lot of misunderstanding and the dissemination of misinformation by self-proclaimed gurus of search engine optimization, who often claim that your page will get penalized if it's a duplicate of some other page.

We can prove from Google's own results that the engine does, in fact, index duplicate content. How? It's easy: Hop over to EzineArticles.com and grab the title of the most published article in any given category, then search for that title in Google using the "intitle:" operator.

For example, the most published article in the last 60 days in the Finance category right now is titled "Same Day Loans - When You Are Running Out of Options!"

Go to Google and search for this:

intitle:"Same Day Loans - When You Are Running Out of Options!"

Right now 7 results show up. When I click the link at the bottom of the results to show duplicates as well, I get 87 results. That means Google has 87 copies of the same article in its index. Clearly, the fact that the content is the same doesn't prevent Google from adding a page to its index.

Reading Google's own words about duplicate content in their support material gives the impression that when they refer to duplicate content, they're mostly talking about content on the same site. They also state that "duplicate content on a site is not grounds for action on that site unless it appears that the intent of the duplicate content is to be deceptive and manipulate search engine results."

So Google appears to be indicating that similar content is not indexed if it's perceived to be for search engine manipulation.

That may be their goal, but it's not the case in actuality. Content is very often created and syndicated for the purpose of building links that get a page ranked in Google. That practice works very well, too. Perhaps if the duplication is egregious enough… but unless you're doing some large scale duplication and distribution you generally have nothing to worry about.

The second part of Google's statement indicated that a page may not be indexed if it's "auto-generated content … that isn't very useful to searchers." An example they give is a calendar script that would create an infinite number of pages if the search engine crawler kept following all of the links for all of the dates going forward in time. Google was not specifically talking about page content that's generated by software.

Again, we can prove that Google indexes software-generated content by using their own index. I went to Google's shopping page and clicked on one of the "recently found" links (in this case, it was "bicycle trailers"). The first result was from Amazon.com, which has an API that allows people to use Amazon product information on their own sites (e.g., you can create Amazon-like product pages using software).

I searched at Google for two sentences found in the product description (with quotes):

"The Burley Solo trailer is the top of the line when it comes to a single child trailer. With its newly designed reclining seat and full suspension axel your child will be riding in comfort."

That search returned 10 pages, and when I clicked the link for showing filtered results, 46 pages. So obviously Google has no problem indexing that kind of content, either.

So what is it, then, that will prevent Google from indexing a page? The answer is simple, and it's one that you'll probably never hear from Google.

Sometimes I wonder why Google bothers putting up support materials when they never give you a real answer to anything. They want to be vague to prevent people from manipulating the results, but guess what? People manipulate it all the time anyway!

So here it is, the real answer to why Google's index is 80% smaller than Cuil's, and why so many pages go unindexed:

No links = No Indexing

That is, if a page is a duplicate of some other page on some other site, but the page has no links to it, Google will crawl the page — and even put it in their index for a while — but after a few days or weeks the page will usually be removed from their index.

I say "usually", because if the site that the duplicate page appears on has enough links to it overall, then the page will stay indexed even if there aren't any links directly to it. That's why sites like EzineArticles.com, for instance, has 4,690,000 pages in Google's index even though many of those pages don't have any external links to them — the site as a whole has enough links for Google to feel it's worth keeping anything that appears on that site indexed.

That makes sense, right? Why muck up the index with massive amounts of duplicate content that isn't important enough for anybody to link to it?

What all of this means for you as a webmaster is simple: if you're going to distribute articles or other duplicate content in order to build links to your web site and rank better in Google, you need to make sure that the content you distribute is linked to by other external pages. Whether you accomplish that by social bookmarking or writing additional articles on EzineArticles that link to your articles on "lesser" sites or though some other method, you need to be sure that the content is linked to.

So, sure, Cuil's database might be a lot larger than Google's, but that doesn't mean that it's better — Google just does more filtering. That's important for you, because if you want your content to stick around, you need to make sure Google considers it valuable by throwing some links at it.

Please post your thoughts and questions in a comment below.

Like what you see? Then subscribe to Marketing Insiders and reap big benefits!

By subscribing to my free Marketing Insiders email list, you will regularly receive special member-only insider information, discounts and freebies. You will also be notified when new articles are posted here at the blog.

It's absolutely free to subscribe, and you can leave the list at any time.

For subscribing today, I will give you a valuable free gift as well!

First Name:
Email:

 

82 Comments

The problem with link-based search results.

If you've been following my blog for any time at all, then you know that I am fascinated with search engines, ranking, algorithms and the like. It's my dream one day to design a better search engine, and I'm always tinkering and working on ideas to that end.

As I run queries on the major search engines these days, I'm finding that link-based ranking of pages has a major drawback: the most relevant results often don't make it to the top.

Let's take, for instance, the query "acne home remedies". Run that phrase through Google and you'll get back the results that are the best optimized (that is, that have the most in-bound links related to the query).

As of right now, the #1 ranking result in Google for "acne home remedies" is rather mediocre. You have to click onto a bunch of other pages linked to on that page to get to any real information. It's time consuming and difficult.

For me, the number one result would ideally contain a general summary of information related to the query. That is, "acne home remedies" should show pages that list a number of home remedies for acne on the ranking page — not just links to other pages that talk about those remedies. And the top ranking pages should talk about a number of remedies, not just one. Also, the remedies that are talked about should be well known and referenced on other web pages so that I, the searcher, can have a reasonable amount of trust in the information.

How well-linked a page is should play a part, because those links help establish some authority for the page, but they should not be so strong a factor that the links cause mediocre pages to rank the way they do for a lot queries in Google and Yahoo and Bing these days.

So how do we get search results that use linking to help judge authority, but that contain solid information that is reasonably trustworthy?

That's the goal of my latest search engine, Shablast. The way it works is pretty simple, but very effective (in my opinion):

  1. Get the top ranking pages from a major search engine (in this case, Bing).
  2. Analyze each result to see what topics are being discussed on the ranking pages.
  3. Resort the results, showing the pages that touch on the greatest number of popular topics first.
  4. Filter out the obvious spam.

The results I'm seeing from this four step process are pretty good so far, but I need a lot more people to test it out and let me know what they think about it and where it falls short (which, no doubt, this early in the game it does for some queries).

So what I'd like for you to do is hop over to Shablast.com and run some subjects through it that you are familiar with and let me know if the results are good or poor, and why the results are good or poor. Keep in mind that Shablast is designed primarily for informational queries, so don't expect grand results when doing product-based searches.

You can post a comment here, or (preferably) you can go to the Shablast Forum and post a message with the keywords you searched and what was good or bad about the results returned by Shablast. I can then ask you questions (if needed) and refine the algorithm to improve the process.

If this is something you're interested in experimenting with, why not take a moment to hop over to Shablast and give it a go?

Thanks in advance, and please feel free to post your thoughts and questions in a comment below.

Like what you see? Then subscribe to Marketing Insiders and reap big benefits!

By subscribing to my free Marketing Insiders email list, you will regularly receive special member-only insider information, discounts and freebies. You will also be notified when new articles are posted here at the blog.

It's absolutely free to subscribe, and you can leave the list at any time.

For subscribing today, I will give you a valuable free gift as well!

First Name:
Email:

 

72 Comments

Niche site case study 2 year update.

Back in August of 2007 I decided to perform a case study by building a small, 10 page niche content site from scratch and see how well it performed over time. The prime purpose of that case study was to prove that my 3WayLinks.net network was a powerful way to get sites ranked in Google — and keep them there.

It's been a little over two years since I created that case study, and I thought you might be interested in knowing how things are going with that little content site. Yes, it's still up and running. Yes, it's still well ranked in Google. Yes, it's still making money. No, I haven't done any additional work to keep it ranked.

As a recap, here's what I did:

  1. I did some research and discovered a niche in the fitness market that I felt was ripe for the picking (today I use Niche Horde for that–it's a lot easier).
  2. I used my Instant Article Wizard software to create 10 unique articles that would make up the site content.
  3. I submitted an additional 10 articles to EzineArticles.com so that each of the inner pages would have a few links to it.
  4. I added the site to my 3WayLinks.net network to grow the backlinks to it.

When I setup this site, there were some dissenters. "Oh yeah," they said, "it does well right now, but Almighty Google is going to catch on and deindex the site, just you wait!"

Well, that's dissenters for you. Over two years later, here's my latest AdSense report from that little niche site that sits untouched, happily generating income for me month after month:

If you were around when I did the original case study, then you might recall that my goal for the site was $3 a day, or about $1,000 a year. As you can see from last month's AdSense revenue, the site is doing much better than that. It actually earned over $7 a day — more than twice my original goal.

But was that a fluke? How has the site done overall? Here's the total 26 month report:

Yup, this little 10 page site (which doesn't look very professional, btw, and only took about 5 hours to create) is about to hit the $5,000 earnings mark. That means that the site has earned, on average, $6 a day since I first created it — twice my goal. It also means the site will soon have earned me $1,000 for each hour of work I put into it.

It has required no extra work on my part, with one small exception: I had a 3-day server outage in January of last year that caused the site to drop out of the Google rankings until I installed a blog and threw up some fresh, relevant content for a couple of weeks. That would not ordinarily be required, but since the site disappeared for three days Google wanted some affirmation that it was not dead and gone, and fresh content was the ticket to get the rankings restored.

I just can't emphasize enough how many people threw up contrary opinions, proclaiming how the Google Deity in its all-knowing wisdom and power would discover and neutralize my 3WayLinks.net network. And yet the site still ranks #4 and #5 for its primary and secondary keyword phrases, and has done so consistently for the last two years — and it is not alone, not by a long shot. 3WayLinks is more powerful than ever since I've continually made improvements to the way the network builds and maintains links to your site.

You may not be able to live off of $5,000 in two years, but imagine building 50 or 100 successful sites like this one (certainly possible considering it only took five hours to begin with). Even if you could only spare 10 hours a week, that's two sites a week, or more than 100 sites a year. Even if you could only reach my original $3 per day per site goal, that's $300 a day, which comes to over 100,000 powerful reasons each year to start building content sites and putting them into 3WayLinks.net.

Please post your questions and thoughts in a comment below.

Like what you see? Then subscribe to Marketing Insiders and reap big benefits!

By subscribing to my free Marketing Insiders email list, you will regularly receive special member-only insider information, discounts and freebies. You will also be notified when new articles are posted here at the blog.

It's absolutely free to subscribe, and you can leave the list at any time.

For subscribing today, I will give you a valuable free gift as well!

First Name:
Email:

 

75 Comments

Rodney's 404 Handler Plugin plugged in.