Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. The Lounge
  3. Google numbers don't seem to add up

Google numbers don't seem to add up

Scheduled Pinned Locked Moved The Lounge
databasecom
27 Posts 11 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • L Lost User

    Dr Gadgit wrote:

    foreign sites in places like China or Korea

    So sites in other countries are not foreign?

    D Offline
    D Offline
    Dr Gadgit
    wrote on last edited by
    #3

    Most from Google.com or Co.UK would be english sites and english is quite common for website located in countries where english is not the offical langwage. A finger in the air guess by me would say 1/3 of the internet uses english or has the option of being viewed in english and would be indexed by google and be returned when searching via Google.com or UK Pushing the boat out I still don't think I could get above 4 million sites from Google even if I scanned everyone of it's country based servers. I can tell you for a fact that some domain parks run about 20,000 sites each and they host sites that relay google add-word adverts but i would not know just how many google filters out from its results. If these fake park sites didn't get any hits then they would not do it. See http://ww25.krvkr.com/[^] or http://ww2.bangalorewalkin.com/[^] Richard i did not know you was running a seach engine :) http://www.searchinguncovered.com/?pid=7PO3Q136C&dn=RichardMacCutchan.com&rpid=7POL08WI8[^] The fake ones I am talking about work a bit like this but just work of the parked domain name

    A N 2 Replies Last reply
    0
    • D Dr Gadgit

      I've been scanning Google search results for months now using only googles .Com, .BE and Co.UK but for the life of me I cannot seem to get more than about 850k unique domain names. At first I used about one thousand hand coded search terms and read the first ten pages returned from google that included about a hundred results per page and soon got to about half a million domains but then it slowed down to a stop as it reached 700k. Six months later I ran a scan to see how many of the 700k sites had gone 404 and also had the DNS entry removed and found that the number of sites up and running were dropping like flies (top of head, about 20% gone) Now I have a whopping twenty seven thousand search terms pulled from meta data and have been throwing these (Very slowly) at Google and have reached about 850k domains in the database and it has all but come to a stop again and the 850k includes about 20% that went dead pulled six month previously. I read once that the internet contains about a billion websites not that I believed these numbers and many of the domains I have collected point to foreign sites in places like China or Korea so it is a bit of a mixed bag of results and I also understand that many domain have been parked (lots are linked back to fake sites sharing the same IP and running add-words, google does not mind) but this 850k numbers I am seeing does not look anywhere near the 5-20 million that I was originally expecting.

      L Offline
      L Offline
      Lost User
      wrote on last edited by
      #4

      Dr Gadgit wrote:

      I've been scanning Google search results for months now using only googles .Com, .BE and Co.UK

      How? AFAIK, you'd need help from Google to go beyond a certain number of requests.

      Bastard Programmer from Hell :suss: If you can't read my code, try converting it here[^][](X-Clacks-Overhead: GNU Terry Pratchett)

      D 1 Reply Last reply
      0
      • D Dr Gadgit

        Most from Google.com or Co.UK would be english sites and english is quite common for website located in countries where english is not the offical langwage. A finger in the air guess by me would say 1/3 of the internet uses english or has the option of being viewed in english and would be indexed by google and be returned when searching via Google.com or UK Pushing the boat out I still don't think I could get above 4 million sites from Google even if I scanned everyone of it's country based servers. I can tell you for a fact that some domain parks run about 20,000 sites each and they host sites that relay google add-word adverts but i would not know just how many google filters out from its results. If these fake park sites didn't get any hits then they would not do it. See http://ww25.krvkr.com/[^] or http://ww2.bangalorewalkin.com/[^] Richard i did not know you was running a seach engine :) http://www.searchinguncovered.com/?pid=7PO3Q136C&dn=RichardMacCutchan.com&rpid=7POL08WI8[^] The fake ones I am talking about work a bit like this but just work of the parked domain name

        A Offline
        A Offline
        Amarnath S
        wrote on last edited by
        #5

        Would some of the 'fake' ones be cyber-squatted ones too? (Perhaps the Bangalore walk-in site you mention is one such).

        D 1 Reply Last reply
        0
        • D Dr Gadgit

          I've been scanning Google search results for months now using only googles .Com, .BE and Co.UK but for the life of me I cannot seem to get more than about 850k unique domain names. At first I used about one thousand hand coded search terms and read the first ten pages returned from google that included about a hundred results per page and soon got to about half a million domains but then it slowed down to a stop as it reached 700k. Six months later I ran a scan to see how many of the 700k sites had gone 404 and also had the DNS entry removed and found that the number of sites up and running were dropping like flies (top of head, about 20% gone) Now I have a whopping twenty seven thousand search terms pulled from meta data and have been throwing these (Very slowly) at Google and have reached about 850k domains in the database and it has all but come to a stop again and the 850k includes about 20% that went dead pulled six month previously. I read once that the internet contains about a billion websites not that I believed these numbers and many of the domains I have collected point to foreign sites in places like China or Korea so it is a bit of a mixed bag of results and I also understand that many domain have been parked (lots are linked back to fake sites sharing the same IP and running add-words, google does not mind) but this 850k numbers I am seeing does not look anywhere near the 5-20 million that I was originally expecting.

          G Offline
          G Offline
          GuyThiebaut
          wrote on last edited by
          #6

          I think you will find that google is detecting your ip address scraping information and consequently is limiting what is being returned to you.

          “That which can be asserted without evidence, can be dismissed without evidence.”

          ― Christopher Hitchens

          D 1 Reply Last reply
          0
          • L Lost User

            Dr Gadgit wrote:

            I've been scanning Google search results for months now using only googles .Com, .BE and Co.UK

            How? AFAIK, you'd need help from Google to go beyond a certain number of requests.

            Bastard Programmer from Hell :suss: If you can't read my code, try converting it here[^][](X-Clacks-Overhead: GNU Terry Pratchett)

            D Offline
            D Offline
            Dr Gadgit
            wrote on last edited by
            #7

            No i don't need any help from Google but i must admit that it's a bit of an art to fool Google in to not blocking the searches. The first trick is going slow but also varings the delays between requests (30-60 seconds) and the second trick is to read the names and values of the HTML input boxes and buttons from the form to make up the next request URL needed for a prefect forgery. If you don't send the cookies back then they will in time block you and it's best to use HTTPS because Google redirects to SSL after a while, it's helps them to hide spyware scripts from most people. What i am doing might not work for much longer because google is removing all traces of domain names from its search results on mobile devices and if no one gets upset about that then they will at a guess do the same to all search results.

            1 Reply Last reply
            0
            • G GuyThiebaut

              I think you will find that google is detecting your ip address scraping information and consequently is limiting what is being returned to you.

              “That which can be asserted without evidence, can be dismissed without evidence.”

              ― Christopher Hitchens

              D Offline
              D Offline
              Dr Gadgit
              wrote on last edited by
              #8

              Indeed a good theory and I am sure they could limit me to a sub-set of just 850k domain names but in general when they catch you and think you are upto no good (Try using google from Tor) Google will send you to a Captua screen where you need to type in a number. They cannot just block based on the number of searches at such slow rates because most work places sit behind a router so everyone in the office shares the same public IP-Address and the code I use changes the User-Agent now and then to give them that impression. Some times I also switch over to a VPN so that the requests come from various locations from around the world. Google are good, very good with masive data centers that have more security than fort knocks but they still need to use web-farms and the IP-Address for Google.com gets changed all the time and you get a more local address depending on where you request the DNS lookup from. I don't need to spread the requests to take advantage of this but i would if I had too.

              1 Reply Last reply
              0
              • A Amarnath S

                Would some of the 'fake' ones be cyber-squatted ones too? (Perhaps the Bangalore walk-in site you mention is one such).

                D Offline
                D Offline
                Dr Gadgit
                wrote on last edited by
                #9

                I think we get a bit of a mix but in general Google does manage to filter most out. Big domain parks pointing lots of domains to the same IP would show up but I don't see anything that stands out.

                1 Reply Last reply
                0
                • D Dr Gadgit

                  I've been scanning Google search results for months now using only googles .Com, .BE and Co.UK but for the life of me I cannot seem to get more than about 850k unique domain names. At first I used about one thousand hand coded search terms and read the first ten pages returned from google that included about a hundred results per page and soon got to about half a million domains but then it slowed down to a stop as it reached 700k. Six months later I ran a scan to see how many of the 700k sites had gone 404 and also had the DNS entry removed and found that the number of sites up and running were dropping like flies (top of head, about 20% gone) Now I have a whopping twenty seven thousand search terms pulled from meta data and have been throwing these (Very slowly) at Google and have reached about 850k domains in the database and it has all but come to a stop again and the 850k includes about 20% that went dead pulled six month previously. I read once that the internet contains about a billion websites not that I believed these numbers and many of the domains I have collected point to foreign sites in places like China or Korea so it is a bit of a mixed bag of results and I also understand that many domain have been parked (lots are linked back to fake sites sharing the same IP and running add-words, google does not mind) but this 850k numbers I am seeing does not look anywhere near the 5-20 million that I was originally expecting.

                  J Offline
                  J Offline
                  jschell
                  wrote on last edited by
                  #10

                  Dr Gadgit wrote:

                  and have reached about 850k domains in the database ...I read once that the internet contains about a billion websites...I am seeing does not look anywhere near the 5-20 million that I was originally expecting.

                  Expecting it why? Google uses algorithms to return 'best' matches. They vary results for various reasons but that doesn't mean that the 20 million site is going to suddenly show up at the number one spot. Especially since the queries only return a fixed, much smaller size, anyways. So more likely that set is the one with varying results.

                  D N 2 Replies Last reply
                  0
                  • D Dr Gadgit

                    I've been scanning Google search results for months now using only googles .Com, .BE and Co.UK but for the life of me I cannot seem to get more than about 850k unique domain names. At first I used about one thousand hand coded search terms and read the first ten pages returned from google that included about a hundred results per page and soon got to about half a million domains but then it slowed down to a stop as it reached 700k. Six months later I ran a scan to see how many of the 700k sites had gone 404 and also had the DNS entry removed and found that the number of sites up and running were dropping like flies (top of head, about 20% gone) Now I have a whopping twenty seven thousand search terms pulled from meta data and have been throwing these (Very slowly) at Google and have reached about 850k domains in the database and it has all but come to a stop again and the 850k includes about 20% that went dead pulled six month previously. I read once that the internet contains about a billion websites not that I believed these numbers and many of the domains I have collected point to foreign sites in places like China or Korea so it is a bit of a mixed bag of results and I also understand that many domain have been parked (lots are linked back to fake sites sharing the same IP and running add-words, google does not mind) but this 850k numbers I am seeing does not look anywhere near the 5-20 million that I was originally expecting.

                    D Offline
                    D Offline
                    Dr Gadgit
                    wrote on last edited by
                    #11

                    If i run a query on the 850k domains and GROUP BY IP, Count Doamins then the winner is WordPress with 1485 domains on the same IP-Address 192.0.78.16 using SELECT Host, IP, ASN, DateScan FROM dbo.Hosts WHERE (Host LIKE '%WordPress.%') ORDER BY Host Then this gets me 3235 results with the first host being '02varvara.wordpress.com' and the last one being 'zxksinglegaymendating.wordpress.com' << SORRY DO NOT VIEW (The above two domains might not be on 192.0.78.16) if you check out http://www.my-ip-neighbors.com/?domain=192.0.78.16[^] For the address then you get 263 results for the address that i found 1485 domain names on so I must be getting close with my research by having six times that number of domains hosted on that single IP-Address. I will eat my hat if anyone can find a host name below "02varvara" and ends in ".wordpress.com" on any google search results page that works in english.

                    L 1 Reply Last reply
                    0
                    • J jschell

                      Dr Gadgit wrote:

                      and have reached about 850k domains in the database ...I read once that the internet contains about a billion websites...I am seeing does not look anywhere near the 5-20 million that I was originally expecting.

                      Expecting it why? Google uses algorithms to return 'best' matches. They vary results for various reasons but that doesn't mean that the 20 million site is going to suddenly show up at the number one spot. Especially since the queries only return a fixed, much smaller size, anyways. So more likely that set is the one with varying results.

                      D Offline
                      D Offline
                      Dr Gadgit
                      wrote on last edited by
                      #12

                      "Expecting it why? Google uses algorithms to return 'best' matches" No they don't and instead always try to sell you something. Don't take my word for it and read up on bidding for Goiogle add-words or read comments from people who complain about how google returns results. I did 1000 hand coded search request and took the first 1000 links and towards the end i found that i had recorded each host name already in 99.999% of cases so i must had about maxed google out. Not convinced by this, 6 months later I decided to try 27,000 unique search terms to see if i could tease any more names from google and did manage another 150k ish but in the same period of time 150k names had been removed, 404, no DNS entry. I am working on the logic that if i play enought hands of cards then sooner or later I will come to learn just how many cards their are in the pack. Maybe the only google search know to man that will return abc123def456ghi.com is to type that domain into the google search box but having tried 27,000 search terms only to get no new results in 99.99999 of cases says to me that I am about home for english results. Please advise of a better method if you know of one or a dictionary I should be using and i will give it a try.

                      J 1 Reply Last reply
                      0
                      • D Dr Gadgit

                        Most from Google.com or Co.UK would be english sites and english is quite common for website located in countries where english is not the offical langwage. A finger in the air guess by me would say 1/3 of the internet uses english or has the option of being viewed in english and would be indexed by google and be returned when searching via Google.com or UK Pushing the boat out I still don't think I could get above 4 million sites from Google even if I scanned everyone of it's country based servers. I can tell you for a fact that some domain parks run about 20,000 sites each and they host sites that relay google add-word adverts but i would not know just how many google filters out from its results. If these fake park sites didn't get any hits then they would not do it. See http://ww25.krvkr.com/[^] or http://ww2.bangalorewalkin.com/[^] Richard i did not know you was running a seach engine :) http://www.searchinguncovered.com/?pid=7PO3Q136C&dn=RichardMacCutchan.com&rpid=7POL08WI8[^] The fake ones I am talking about work a bit like this but just work of the parked domain name

                        N Offline
                        N Offline
                        newton saber
                        wrote on last edited by
                        #13

                        Fantastic information. Great analysis you are doing. Really cool stuff. :thumbsup:

                        D 1 Reply Last reply
                        0
                        • J jschell

                          Dr Gadgit wrote:

                          and have reached about 850k domains in the database ...I read once that the internet contains about a billion websites...I am seeing does not look anywhere near the 5-20 million that I was originally expecting.

                          Expecting it why? Google uses algorithms to return 'best' matches. They vary results for various reasons but that doesn't mean that the 20 million site is going to suddenly show up at the number one spot. Especially since the queries only return a fixed, much smaller size, anyways. So more likely that set is the one with varying results.

                          N Offline
                          N Offline
                          newton saber
                          wrote on last edited by
                          #14

                          Okay, for arguments sake let's say there are 1 billion sites out there (unlikely) if the search results of only the 850K sites are really accessbile then that's all users ever see anyway. No one would ever find the others, right? Interesting.

                          J 1 Reply Last reply
                          0
                          • N newton saber

                            Fantastic information. Great analysis you are doing. Really cool stuff. :thumbsup:

                            D Offline
                            D Offline
                            Dr Gadgit
                            wrote on last edited by
                            #15

                            Well thank you sir

                            1 Reply Last reply
                            0
                            • D Dr Gadgit

                              I've been scanning Google search results for months now using only googles .Com, .BE and Co.UK but for the life of me I cannot seem to get more than about 850k unique domain names. At first I used about one thousand hand coded search terms and read the first ten pages returned from google that included about a hundred results per page and soon got to about half a million domains but then it slowed down to a stop as it reached 700k. Six months later I ran a scan to see how many of the 700k sites had gone 404 and also had the DNS entry removed and found that the number of sites up and running were dropping like flies (top of head, about 20% gone) Now I have a whopping twenty seven thousand search terms pulled from meta data and have been throwing these (Very slowly) at Google and have reached about 850k domains in the database and it has all but come to a stop again and the 850k includes about 20% that went dead pulled six month previously. I read once that the internet contains about a billion websites not that I believed these numbers and many of the domains I have collected point to foreign sites in places like China or Korea so it is a bit of a mixed bag of results and I also understand that many domain have been parked (lots are linked back to fake sites sharing the same IP and running add-words, google does not mind) but this 850k numbers I am seeing does not look anywhere near the 5-20 million that I was originally expecting.

                              H Offline
                              H Offline
                              harrymc
                              wrote on last edited by
                              #16

                              Google search algorithm ranks websites mainly by their number of links from other external domains. I think that the explanation for the limited number of sites that you found, is that not all of the billion websites in existence have links enough pointing at them in order to gain a ranking that justifies (for Google) their inclusion in search results.

                              D P 2 Replies Last reply
                              0
                              • D Dr Gadgit

                                If i run a query on the 850k domains and GROUP BY IP, Count Doamins then the winner is WordPress with 1485 domains on the same IP-Address 192.0.78.16 using SELECT Host, IP, ASN, DateScan FROM dbo.Hosts WHERE (Host LIKE '%WordPress.%') ORDER BY Host Then this gets me 3235 results with the first host being '02varvara.wordpress.com' and the last one being 'zxksinglegaymendating.wordpress.com' << SORRY DO NOT VIEW (The above two domains might not be on 192.0.78.16) if you check out http://www.my-ip-neighbors.com/?domain=192.0.78.16[^] For the address then you get 263 results for the address that i found 1485 domain names on so I must be getting close with my research by having six times that number of domains hosted on that single IP-Address. I will eat my hat if anyone can find a host name below "02varvara" and ends in ".wordpress.com" on any google search results page that works in english.

                                L Offline
                                L Offline
                                Lost User
                                wrote on last edited by
                                #17

                                What kind of hat? I just want to be sure it is worth the effort. :)

                                D 1 Reply Last reply
                                0
                                • D Dr Gadgit

                                  I've been scanning Google search results for months now using only googles .Com, .BE and Co.UK but for the life of me I cannot seem to get more than about 850k unique domain names. At first I used about one thousand hand coded search terms and read the first ten pages returned from google that included about a hundred results per page and soon got to about half a million domains but then it slowed down to a stop as it reached 700k. Six months later I ran a scan to see how many of the 700k sites had gone 404 and also had the DNS entry removed and found that the number of sites up and running were dropping like flies (top of head, about 20% gone) Now I have a whopping twenty seven thousand search terms pulled from meta data and have been throwing these (Very slowly) at Google and have reached about 850k domains in the database and it has all but come to a stop again and the 850k includes about 20% that went dead pulled six month previously. I read once that the internet contains about a billion websites not that I believed these numbers and many of the domains I have collected point to foreign sites in places like China or Korea so it is a bit of a mixed bag of results and I also understand that many domain have been parked (lots are linked back to fake sites sharing the same IP and running add-words, google does not mind) but this 850k numbers I am seeing does not look anywhere near the 5-20 million that I was originally expecting.

                                  9 Offline
                                  9 Offline
                                  9082365
                                  wrote on last edited by
                                  #18

                                  Quote:

                                  the 5-20 million that I was originally expecting

                                  Well, there's your problem! I see no reason to expect anything of the sort especially when you've conducted the search in such an arbitrary fashion. How many sites are intentionally excluded from Google? How many are referenced by their IP address only? Casting your net in the Google Sea is no more likely to allow you to identify every domain any more than casting it in the Indian Ocean would get you every species of fish.

                                  D D 2 Replies Last reply
                                  0
                                  • 9 9082365

                                    Quote:

                                    the 5-20 million that I was originally expecting

                                    Well, there's your problem! I see no reason to expect anything of the sort especially when you've conducted the search in such an arbitrary fashion. How many sites are intentionally excluded from Google? How many are referenced by their IP address only? Casting your net in the Google Sea is no more likely to allow you to identify every domain any more than casting it in the Indian Ocean would get you every species of fish.

                                    D Offline
                                    D Offline
                                    Dr Gadgit
                                    wrote on last edited by
                                    #19

                                    "How many sites are intentionally excluded from Google" Lot's i would say but it never use to be that way and was more a level playing field but today you are forced to play by googles rules, google special meta tags in HTML pages, Google spyware scripts on the site and then to not poke your neck out too much. Censorship in my book but most people don't see it. Well done, nail on the head

                                    1 Reply Last reply
                                    0
                                    • L Lost User

                                      What kind of hat? I just want to be sure it is worth the effort. :)

                                      D Offline
                                      D Offline
                                      Dr Gadgit
                                      wrote on last edited by
                                      #20

                                      Go for it, straw hat

                                      1 Reply Last reply
                                      0
                                      • H harrymc

                                        Google search algorithm ranks websites mainly by their number of links from other external domains. I think that the explanation for the limited number of sites that you found, is that not all of the billion websites in existence have links enough pointing at them in order to gain a ranking that justifies (for Google) their inclusion in search results.

                                        D Offline
                                        D Offline
                                        Dr Gadgit
                                        wrote on last edited by
                                        #21

                                        "Google search algorithm ranks websites mainly by their number of links from other external domains." To some degree you must be right because lots of people banging on about SEO who earn a living from it spend their nights spamming sites (Organic grouth thye call it) but nothing beats paying google some money to hit top of the page. I was reading 1000 links per search term and i don't think i would have got many more unique Url's even if i read 10,000 per search, they just ran out of domains as far as i could see at about 800k I am sure i could get lots more domains from alexa.com and faster by running a web-bot but the point I want to make is that Google now hides most of the internet, goodbye little guys, well hello controlled opersition or paying customers. Gone are the days where "Little Mrs Smith" will help you knit a pair of socks and pay for her site using a few adverts, no sir, it's all in Ebay or Wikipedia,Facebook it's all you need to know in life. Google is in court every other week and having to pay fines so it's not like we can trust them to give us the truth and they are part of a monoply, we all know the names of the next six players in the game and i for one don't think that this is a good thing for any of us.

                                        1 Reply Last reply
                                        0
                                        • 9 9082365

                                          Quote:

                                          the 5-20 million that I was originally expecting

                                          Well, there's your problem! I see no reason to expect anything of the sort especially when you've conducted the search in such an arbitrary fashion. How many sites are intentionally excluded from Google? How many are referenced by their IP address only? Casting your net in the Google Sea is no more likely to allow you to identify every domain any more than casting it in the Indian Ocean would get you every species of fish.

                                          D Offline
                                          D Offline
                                          dannomanno
                                          wrote on last edited by
                                          #22

                                          To continue the fishing analogy this is casting the net 27,000 times in different locations within the sea, I'd say it should return a decent reflection of the overall fish in the Google ocean. Good experiment, I always doubted the billions upon billions of web pages supposedly searched to return my results in a fraction of a second.

                                          1 Reply Last reply
                                          0
                                          Reply
                                          • Reply as topic
                                          Log in to reply
                                          • Oldest to Newest
                                          • Newest to Oldest
                                          • Most Votes


                                          • Login

                                          • Don't have an account? Register

                                          • Login or register to search.
                                          • First post
                                            Last post
                                          0
                                          • Categories
                                          • Recent
                                          • Tags
                                          • Popular
                                          • World
                                          • Users
                                          • Groups