Google numbers don't seem to add up

newton saber

Fantastic information. Great analysis you are doing. Really cool stuff. :thumbsup:

newton saber

Okay, for arguments sake let's say there are 1 billion sites out there (unlikely) if the search results of only the 850K sites are really accessbile then that's all users ever see anyway. No one would ever find the others, right? Interesting.

Dr Gadgit

Well thank you sir

harrymc

Google search algorithm ranks websites mainly by their number of links from other external domains. I think that the explanation for the limited number of sites that you found, is that not all of the billion websites in existence have links enough pointing at them in order to gain a ranking that justifies (for Google) their inclusion in search results.

Lost User

What kind of hat? I just want to be sure it is worth the effort. :)

9082365

Quote:

the 5-20 million that I was originally expecting

Well, there's your problem! I see no reason to expect anything of the sort especially when you've conducted the search in such an arbitrary fashion. How many sites are intentionally excluded from Google? How many are referenced by their IP address only? Casting your net in the Google Sea is no more likely to allow you to identify every domain any more than casting it in the Indian Ocean would get you every species of fish.

Dr Gadgit

"How many sites are intentionally excluded from Google" Lot's i would say but it never use to be that way and was more a level playing field but today you are forced to play by googles rules, google special meta tags in HTML pages, Google spyware scripts on the site and then to not poke your neck out too much. Censorship in my book but most people don't see it. Well done, nail on the head

Dr Gadgit

Go for it, straw hat

Dr Gadgit

"Google search algorithm ranks websites mainly by their number of links from other external domains." To some degree you must be right because lots of people banging on about SEO who earn a living from it spend their nights spamming sites (Organic grouth thye call it) but nothing beats paying google some money to hit top of the page. I was reading 1000 links per search term and i don't think i would have got many more unique Url's even if i read 10,000 per search, they just ran out of domains as far as i could see at about 800k I am sure i could get lots more domains from alexa.com and faster by running a web-bot but the point I want to make is that Google now hides most of the internet, goodbye little guys, well hello controlled opersition or paying customers. Gone are the days where "Little Mrs Smith" will help you knit a pair of socks and pay for her site using a few adverts, no sir, it's all in Ebay or Wikipedia,Facebook it's all you need to know in life. Google is in court every other week and having to pay fines so it's not like we can trust them to give us the truth and they are part of a monoply, we all know the names of the next six players in the game and i for one don't think that this is a good thing for any of us.

dannomanno

To continue the fishing analogy this is casting the net 27,000 times in different locations within the sea, I'd say it should return a decent reflection of the overall fish in the Google ocean. Good experiment, I always doubted the billions upon billions of web pages supposedly searched to return my results in a fraction of a second.

patbob

I wouldn't be surprised to find that Google also trims the tail. Any site with only one or two links probably isn't interesting to anybody, so why waste the disk space.

We can program with only 1's, but if all you've got are zeros, you've got nothing.

Robert J Good

I love this experiment in reality vs. theory! I agree with your premise and expected results. Results of a simple .com test should be in the tens-of-millions results. http://www.internetlivestats.com/total-number-of-websites/ Is there any way to take your same experiment and test against Bing and Ask? I expect that you are scraping the google results DOM or something like that, so maybe not feasible. No matter how "big data" you are, there just isn't enough computing power to iterate through billions of *raw* records for the 5.7 billion daily searches. You transform data, then search on the transformed result. This very well might be a result of the google data transformation layer filtering out the millions of sites that don't have a high confidence score for your search terms. I think you are finding how powerful search engines are...they define our internet reality. Companies go out of business when google drops their URL. Just the past few weeks, windows developers have complained about their google Ad revenue dropping by up to 95%. Must feel good to have the power to nuke entire companies and industries by dropping them from search results. Robert

jschell

Dr Gadgit wrote:

No they don't and instead always try to sell you something.

Xince the vast majority of sites do not have any financial relationship with google then, by your very own assumption, then it would explain why there are so few results. But google doesn't solely base results on paid relationships.

Dr Gadgit wrote:

Don't take my word for it and read up on bidding for Goiogle add-words

You do of course realize that the vast majority of those 1 billion sites do not in fact have anything to do with ad-words? So of course that is meaningless. Except of course for the fact that those that do pay will in fact occupy higher spots and thus, guarantee, that some slots way down the list will be pushed out.

Dr Gadgit wrote:

Please advise of a better method

Facts are 1- 1 billion sites 2- Results that are sorted based on some 'best' criteria 3- Results are limited to 1000. 4- Certainly some factor that many of those sites are not in english. The above together would seem to insure, to me, that there is no way, one is going to get close.

Dr Gadgit wrote:

I decided to try 27,000 unique search terms

Oxford says there are 170,000 english words. If each of those words returned 1000 unique sites then that would be 170,000,000 sites. Naturally of course there is no way you are going to get unique sets for each. Thus 170 million is the absolute theoretical maximum and common sense would suggest it will be far lower.

jschell

newton.saber wrote:

Okay, for arguments sake let's say there are 1 billion sites out there (unlikely)

That is however what they say there are. http://www.internetlivestats.com/total-number-of-websites/[^]

newton.saber wrote:

No one would ever find the others, right?

No one? Obviously someone created it so presumably that person can find it. And some sites are intentionally not supposed to be findable of course. The topic here isn't of course whether one can get to it at all but rather whether one can get to it via a search engine. And per my other reply that seems unlikely (on average.)

Dr Gadgit

if you make enought requests and read 1000 links each time then sooner or later you will cover every site that google is not hiding and no you don't need to pay google money to be on page ten of the search results. Google never use to be like this and i think that the nanny state is working to limit just what you can read on the internet or where else have the other 90% plus of sites all gone