Google numbers don't seem to add up
-
Most from Google.com or Co.UK would be english sites and english is quite common for website located in countries where english is not the offical langwage. A finger in the air guess by me would say 1/3 of the internet uses english or has the option of being viewed in english and would be indexed by google and be returned when searching via Google.com or UK Pushing the boat out I still don't think I could get above 4 million sites from Google even if I scanned everyone of it's country based servers. I can tell you for a fact that some domain parks run about 20,000 sites each and they host sites that relay google add-word adverts but i would not know just how many google filters out from its results. If these fake park sites didn't get any hits then they would not do it. See http://ww25.krvkr.com/[^] or http://ww2.bangalorewalkin.com/[^] Richard i did not know you was running a seach engine :) http://www.searchinguncovered.com/?pid=7PO3Q136C&dn=RichardMacCutchan.com&rpid=7POL08WI8[^] The fake ones I am talking about work a bit like this but just work of the parked domain name
Fantastic information. Great analysis you are doing. Really cool stuff. :thumbsup:
-
Dr Gadgit wrote:
and have reached about 850k domains in the database ...I read once that the internet contains about a billion websites...I am seeing does not look anywhere near the 5-20 million that I was originally expecting.
Expecting it why? Google uses algorithms to return 'best' matches. They vary results for various reasons but that doesn't mean that the 20 million site is going to suddenly show up at the number one spot. Especially since the queries only return a fixed, much smaller size, anyways. So more likely that set is the one with varying results.
Okay, for arguments sake let's say there are 1 billion sites out there (unlikely) if the search results of only the 850K sites are really accessbile then that's all users ever see anyway. No one would ever find the others, right? Interesting.
-
Fantastic information. Great analysis you are doing. Really cool stuff. :thumbsup:
-
I've been scanning Google search results for months now using only googles .Com, .BE and Co.UK but for the life of me I cannot seem to get more than about 850k unique domain names. At first I used about one thousand hand coded search terms and read the first ten pages returned from google that included about a hundred results per page and soon got to about half a million domains but then it slowed down to a stop as it reached 700k. Six months later I ran a scan to see how many of the 700k sites had gone 404 and also had the DNS entry removed and found that the number of sites up and running were dropping like flies (top of head, about 20% gone) Now I have a whopping twenty seven thousand search terms pulled from meta data and have been throwing these (Very slowly) at Google and have reached about 850k domains in the database and it has all but come to a stop again and the 850k includes about 20% that went dead pulled six month previously. I read once that the internet contains about a billion websites not that I believed these numbers and many of the domains I have collected point to foreign sites in places like China or Korea so it is a bit of a mixed bag of results and I also understand that many domain have been parked (lots are linked back to fake sites sharing the same IP and running add-words, google does not mind) but this 850k numbers I am seeing does not look anywhere near the 5-20 million that I was originally expecting.
Google search algorithm ranks websites mainly by their number of links from other external domains. I think that the explanation for the limited number of sites that you found, is that not all of the billion websites in existence have links enough pointing at them in order to gain a ranking that justifies (for Google) their inclusion in search results.
-
If i run a query on the 850k domains and GROUP BY IP, Count Doamins then the winner is WordPress with 1485 domains on the same IP-Address 192.0.78.16 using SELECT Host, IP, ASN, DateScan FROM dbo.Hosts WHERE (Host LIKE '%WordPress.%') ORDER BY Host Then this gets me 3235 results with the first host being '02varvara.wordpress.com' and the last one being 'zxksinglegaymendating.wordpress.com' << SORRY DO NOT VIEW (The above two domains might not be on 192.0.78.16) if you check out http://www.my-ip-neighbors.com/?domain=192.0.78.16[^] For the address then you get 263 results for the address that i found 1485 domain names on so I must be getting close with my research by having six times that number of domains hosted on that single IP-Address. I will eat my hat if anyone can find a host name below "02varvara" and ends in ".wordpress.com" on any google search results page that works in english.
-
I've been scanning Google search results for months now using only googles .Com, .BE and Co.UK but for the life of me I cannot seem to get more than about 850k unique domain names. At first I used about one thousand hand coded search terms and read the first ten pages returned from google that included about a hundred results per page and soon got to about half a million domains but then it slowed down to a stop as it reached 700k. Six months later I ran a scan to see how many of the 700k sites had gone 404 and also had the DNS entry removed and found that the number of sites up and running were dropping like flies (top of head, about 20% gone) Now I have a whopping twenty seven thousand search terms pulled from meta data and have been throwing these (Very slowly) at Google and have reached about 850k domains in the database and it has all but come to a stop again and the 850k includes about 20% that went dead pulled six month previously. I read once that the internet contains about a billion websites not that I believed these numbers and many of the domains I have collected point to foreign sites in places like China or Korea so it is a bit of a mixed bag of results and I also understand that many domain have been parked (lots are linked back to fake sites sharing the same IP and running add-words, google does not mind) but this 850k numbers I am seeing does not look anywhere near the 5-20 million that I was originally expecting.
Quote:
the 5-20 million that I was originally expecting
Well, there's your problem! I see no reason to expect anything of the sort especially when you've conducted the search in such an arbitrary fashion. How many sites are intentionally excluded from Google? How many are referenced by their IP address only? Casting your net in the Google Sea is no more likely to allow you to identify every domain any more than casting it in the Indian Ocean would get you every species of fish.
-
Quote:
the 5-20 million that I was originally expecting
Well, there's your problem! I see no reason to expect anything of the sort especially when you've conducted the search in such an arbitrary fashion. How many sites are intentionally excluded from Google? How many are referenced by their IP address only? Casting your net in the Google Sea is no more likely to allow you to identify every domain any more than casting it in the Indian Ocean would get you every species of fish.
"How many sites are intentionally excluded from Google" Lot's i would say but it never use to be that way and was more a level playing field but today you are forced to play by googles rules, google special meta tags in HTML pages, Google spyware scripts on the site and then to not poke your neck out too much. Censorship in my book but most people don't see it. Well done, nail on the head
-
Google search algorithm ranks websites mainly by their number of links from other external domains. I think that the explanation for the limited number of sites that you found, is that not all of the billion websites in existence have links enough pointing at them in order to gain a ranking that justifies (for Google) their inclusion in search results.
"Google search algorithm ranks websites mainly by their number of links from other external domains." To some degree you must be right because lots of people banging on about SEO who earn a living from it spend their nights spamming sites (Organic grouth thye call it) but nothing beats paying google some money to hit top of the page. I was reading 1000 links per search term and i don't think i would have got many more unique Url's even if i read 10,000 per search, they just ran out of domains as far as i could see at about 800k I am sure i could get lots more domains from alexa.com and faster by running a web-bot but the point I want to make is that Google now hides most of the internet, goodbye little guys, well hello controlled opersition or paying customers. Gone are the days where "Little Mrs Smith" will help you knit a pair of socks and pay for her site using a few adverts, no sir, it's all in Ebay or Wikipedia,Facebook it's all you need to know in life. Google is in court every other week and having to pay fines so it's not like we can trust them to give us the truth and they are part of a monoply, we all know the names of the next six players in the game and i for one don't think that this is a good thing for any of us.
-
Quote:
the 5-20 million that I was originally expecting
Well, there's your problem! I see no reason to expect anything of the sort especially when you've conducted the search in such an arbitrary fashion. How many sites are intentionally excluded from Google? How many are referenced by their IP address only? Casting your net in the Google Sea is no more likely to allow you to identify every domain any more than casting it in the Indian Ocean would get you every species of fish.
To continue the fishing analogy this is casting the net 27,000 times in different locations within the sea, I'd say it should return a decent reflection of the overall fish in the Google ocean. Good experiment, I always doubted the billions upon billions of web pages supposedly searched to return my results in a fraction of a second.
-
Google search algorithm ranks websites mainly by their number of links from other external domains. I think that the explanation for the limited number of sites that you found, is that not all of the billion websites in existence have links enough pointing at them in order to gain a ranking that justifies (for Google) their inclusion in search results.
-
I've been scanning Google search results for months now using only googles .Com, .BE and Co.UK but for the life of me I cannot seem to get more than about 850k unique domain names. At first I used about one thousand hand coded search terms and read the first ten pages returned from google that included about a hundred results per page and soon got to about half a million domains but then it slowed down to a stop as it reached 700k. Six months later I ran a scan to see how many of the 700k sites had gone 404 and also had the DNS entry removed and found that the number of sites up and running were dropping like flies (top of head, about 20% gone) Now I have a whopping twenty seven thousand search terms pulled from meta data and have been throwing these (Very slowly) at Google and have reached about 850k domains in the database and it has all but come to a stop again and the 850k includes about 20% that went dead pulled six month previously. I read once that the internet contains about a billion websites not that I believed these numbers and many of the domains I have collected point to foreign sites in places like China or Korea so it is a bit of a mixed bag of results and I also understand that many domain have been parked (lots are linked back to fake sites sharing the same IP and running add-words, google does not mind) but this 850k numbers I am seeing does not look anywhere near the 5-20 million that I was originally expecting.
I love this experiment in reality vs. theory! I agree with your premise and expected results. Results of a simple .com test should be in the tens-of-millions results. http://www.internetlivestats.com/total-number-of-websites/ Is there any way to take your same experiment and test against Bing and Ask? I expect that you are scraping the google results DOM or something like that, so maybe not feasible. No matter how "big data" you are, there just isn't enough computing power to iterate through billions of *raw* records for the 5.7 billion daily searches. You transform data, then search on the transformed result. This very well might be a result of the google data transformation layer filtering out the millions of sites that don't have a high confidence score for your search terms. I think you are finding how powerful search engines are...they define our internet reality. Companies go out of business when google drops their URL. Just the past few weeks, windows developers have complained about their google Ad revenue dropping by up to 95%. Must feel good to have the power to nuke entire companies and industries by dropping them from search results. Robert
-
"Expecting it why? Google uses algorithms to return 'best' matches" No they don't and instead always try to sell you something. Don't take my word for it and read up on bidding for Goiogle add-words or read comments from people who complain about how google returns results. I did 1000 hand coded search request and took the first 1000 links and towards the end i found that i had recorded each host name already in 99.999% of cases so i must had about maxed google out. Not convinced by this, 6 months later I decided to try 27,000 unique search terms to see if i could tease any more names from google and did manage another 150k ish but in the same period of time 150k names had been removed, 404, no DNS entry. I am working on the logic that if i play enought hands of cards then sooner or later I will come to learn just how many cards their are in the pack. Maybe the only google search know to man that will return abc123def456ghi.com is to type that domain into the google search box but having tried 27,000 search terms only to get no new results in 99.99999 of cases says to me that I am about home for english results. Please advise of a better method if you know of one or a dictionary I should be using and i will give it a try.
Dr Gadgit wrote:
No they don't and instead always try to sell you something.
Xince the vast majority of sites do not have any financial relationship with google then, by your very own assumption, then it would explain why there are so few results. But google doesn't solely base results on paid relationships.
Dr Gadgit wrote:
Don't take my word for it and read up on bidding for Goiogle add-words
You do of course realize that the vast majority of those 1 billion sites do not in fact have anything to do with ad-words? So of course that is meaningless. Except of course for the fact that those that do pay will in fact occupy higher spots and thus, guarantee, that some slots way down the list will be pushed out.
Dr Gadgit wrote:
Please advise of a better method
Facts are 1- 1 billion sites 2- Results that are sorted based on some 'best' criteria 3- Results are limited to 1000. 4- Certainly some factor that many of those sites are not in english. The above together would seem to insure, to me, that there is no way, one is going to get close.
Dr Gadgit wrote:
I decided to try 27,000 unique search terms
Oxford says there are 170,000 english words. If each of those words returned 1000 unique sites then that would be 170,000,000 sites. Naturally of course there is no way you are going to get unique sets for each. Thus 170 million is the absolute theoretical maximum and common sense would suggest it will be far lower.
-
Okay, for arguments sake let's say there are 1 billion sites out there (unlikely) if the search results of only the 850K sites are really accessbile then that's all users ever see anyway. No one would ever find the others, right? Interesting.
newton.saber wrote:
Okay, for arguments sake let's say there are 1 billion sites out there (unlikely)
That is however what they say there are. http://www.internetlivestats.com/total-number-of-websites/[^]
newton.saber wrote:
No one would ever find the others, right?
No one? Obviously someone created it so presumably that person can find it. And some sites are intentionally not supposed to be findable of course. The topic here isn't of course whether one can get to it at all but rather whether one can get to it via a search engine. And per my other reply that seems unlikely (on average.)
-
Dr Gadgit wrote:
No they don't and instead always try to sell you something.
Xince the vast majority of sites do not have any financial relationship with google then, by your very own assumption, then it would explain why there are so few results. But google doesn't solely base results on paid relationships.
Dr Gadgit wrote:
Don't take my word for it and read up on bidding for Goiogle add-words
You do of course realize that the vast majority of those 1 billion sites do not in fact have anything to do with ad-words? So of course that is meaningless. Except of course for the fact that those that do pay will in fact occupy higher spots and thus, guarantee, that some slots way down the list will be pushed out.
Dr Gadgit wrote:
Please advise of a better method
Facts are 1- 1 billion sites 2- Results that are sorted based on some 'best' criteria 3- Results are limited to 1000. 4- Certainly some factor that many of those sites are not in english. The above together would seem to insure, to me, that there is no way, one is going to get close.
Dr Gadgit wrote:
I decided to try 27,000 unique search terms
Oxford says there are 170,000 english words. If each of those words returned 1000 unique sites then that would be 170,000,000 sites. Naturally of course there is no way you are going to get unique sets for each. Thus 170 million is the absolute theoretical maximum and common sense would suggest it will be far lower.
if you make enought requests and read 1000 links each time then sooner or later you will cover every site that google is not hiding and no you don't need to pay google money to be on page ten of the search results. Google never use to be like this and i think that the nanny state is working to limit just what you can read on the internet or where else have the other 90% plus of sites all gone