Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
CODE PROJECT For Those Who Code
  • Home
  • Articles
  • FAQ
Community
  1. Home
  2. General Programming
  3. Java
  4. making a Search Engine

making a Search Engine

Scheduled Pinned Locked Moved Java
helpjavaphpquestion
9 Posts 3 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • S Offline
    S Offline
    sangeeta2009
    wrote on last edited by
    #1

    All you Technically smart guys.. I have decide to work on a project of Search Engine on any one of 2 platform..JSP or PHP.. I have worked on various projects related to this theme earlier, but this time i want it to be working on Web. The problem for me is "Crawl the Web".. Could anyone please help that how can i Access the all websites URL.. OK let me explain: In my previous projects i have already developed the concept of ranking a particular URL stored in my table according to frequency of keywords and little more concept.. So now i just wanted to ask two more things.. 1)Could any one please tell me, will it be Possible for me to Crawl The Web i.e i want the all URL available on internet and not just few that are stored in my tables... 2)a little more concept on ranking of webpages like frquency...and now avoiding those fake keyword in webpages and various other good concepts that i can implement... I will be very thankful for any kind of related help... Please reply soon

    N 1 Reply Last reply
    0
    • S sangeeta2009

      All you Technically smart guys.. I have decide to work on a project of Search Engine on any one of 2 platform..JSP or PHP.. I have worked on various projects related to this theme earlier, but this time i want it to be working on Web. The problem for me is "Crawl the Web".. Could anyone please help that how can i Access the all websites URL.. OK let me explain: In my previous projects i have already developed the concept of ranking a particular URL stored in my table according to frequency of keywords and little more concept.. So now i just wanted to ask two more things.. 1)Could any one please tell me, will it be Possible for me to Crawl The Web i.e i want the all URL available on internet and not just few that are stored in my tables... 2)a little more concept on ranking of webpages like frquency...and now avoiding those fake keyword in webpages and various other good concepts that i can implement... I will be very thankful for any kind of related help... Please reply soon

      N Offline
      N Offline
      Nagy Vilmos
      wrote on last edited by
      #2

      You have no idea how over your head this subject is. It's a case of if you need to ask, you don't want to know... May I suggest you read everything on the subject you can find; wikipedia has a lot of information. Next you'll need to build a spider, it must be able to read eight pages at a time that's why we call them spiders, to go and read the web for you. Buy some really big storage devices, you're going to need something in the region of 843 petabytes [1] to store the indexes alone. The easy part is take the HTML from the spiders doing the crawling and get the keywords and any links. How you rank the page is upto you, personally I would +1 for every reference to bunnies and -1 for cats. [1] figures are +/- 10 orders of magnitude.


      Panic, Chaos, Destruction. My work here is done.

      S 1 Reply Last reply
      0
      • N Nagy Vilmos

        You have no idea how over your head this subject is. It's a case of if you need to ask, you don't want to know... May I suggest you read everything on the subject you can find; wikipedia has a lot of information. Next you'll need to build a spider, it must be able to read eight pages at a time that's why we call them spiders, to go and read the web for you. Buy some really big storage devices, you're going to need something in the region of 843 petabytes [1] to store the indexes alone. The easy part is take the HTML from the spiders doing the crawling and get the keywords and any links. How you rank the page is upto you, personally I would +1 for every reference to bunnies and -1 for cats. [1] figures are +/- 10 orders of magnitude.


        Panic, Chaos, Destruction. My work here is done.

        S Offline
        S Offline
        sangeeta2009
        wrote on last edited by
        #3

        Well thanks for Information. Please tell me isn't it possible to use some available spider code or i have to start from scratch. And the thing that you mention about large storage, how can that be avoided by using Html from spider..as you mention.. please reply... I just wanted to develop a working frame work..but not an actual Search Engine as compared to Google. [not even (20%) in comparison]

        N 1 Reply Last reply
        0
        • S sangeeta2009

          Well thanks for Information. Please tell me isn't it possible to use some available spider code or i have to start from scratch. And the thing that you mention about large storage, how can that be avoided by using Html from spider..as you mention.. please reply... I just wanted to develop a working frame work..but not an actual Search Engine as compared to Google. [not even (20%) in comparison]

          N Offline
          N Offline
          Nagy Vilmos
          wrote on last edited by
          #4

          This reply shows your total lack of understanding. The spider [plenty of crawlers available - try searching] retrieves the data for you. You then need to index the data. Search engines DO NOT look at every web-site when performing a search, instead they use a keyword ind... ...why am I bothering? Really, why?


          Panic, Chaos, Destruction. My work here is done.

          S L 2 Replies Last reply
          0
          • N Nagy Vilmos

            This reply shows your total lack of understanding. The spider [plenty of crawlers available - try searching] retrieves the data for you. You then need to index the data. Search engines DO NOT look at every web-site when performing a search, instead they use a keyword ind... ...why am I bothering? Really, why?


            Panic, Chaos, Destruction. My work here is done.

            S Offline
            S Offline
            sangeeta2009
            wrote on last edited by
            #5

            Thanks for understanding my lack of knowledge.. Respected sir.. I have 6 months to develop it and i just need maximum 2 weeks to understand the whole concept of CRAWLING. Today here i just wanted to now -- What problem/consequences/difficulties will be faced by me when i will be having proper knowledge of this Concept almost like yours.. so just please tell me what concepts to LEARN and what will be difficult for me at that time... Hope you get why you are Helping me?? Please reply..

            N 1 Reply Last reply
            0
            • N Nagy Vilmos

              This reply shows your total lack of understanding. The spider [plenty of crawlers available - try searching] retrieves the data for you. You then need to index the data. Search engines DO NOT look at every web-site when performing a search, instead they use a keyword ind... ...why am I bothering? Really, why?


              Panic, Chaos, Destruction. My work here is done.

              L Offline
              L Offline
              Lost User
              wrote on last edited by
              #6

              Nagy Vilmos wrote:

              ...why am I bothering? Really, why?

              Probably because you are a nice guy, who is motivated by a desire to help others. And every so often you get the sort of feedback that makes it all worthwhile. ;)

              N 1 Reply Last reply
              0
              • S sangeeta2009

                Thanks for understanding my lack of knowledge.. Respected sir.. I have 6 months to develop it and i just need maximum 2 weeks to understand the whole concept of CRAWLING. Today here i just wanted to now -- What problem/consequences/difficulties will be faced by me when i will be having proper knowledge of this Concept almost like yours.. so just please tell me what concepts to LEARN and what will be difficult for me at that time... Hope you get why you are Helping me?? Please reply..

                N Offline
                N Offline
                Nagy Vilmos
                wrote on last edited by
                #7

                Go back to my original answer. Look on wikipedia and READ the articles. There is a lot of information there so start reading. - you need to be able to retrieve and index pages. - as you index each page pull out the links. - retrieve and index those. The likes of google or bing are continuously crawling, looking for new or changed pages. The concept is simple, the practice is hard due to the sheer volume. Start a crawler going on your / your college's home page and index ONLY that domain; see how big it gets.


                Panic, Chaos, Destruction. My work here is done.

                1 Reply Last reply
                0
                • L Lost User

                  Nagy Vilmos wrote:

                  ...why am I bothering? Really, why?

                  Probably because you are a nice guy, who is motivated by a desire to help others. And every so often you get the sort of feedback that makes it all worthwhile. ;)

                  N Offline
                  N Offline
                  Nagy Vilmos
                  wrote on last edited by
                  #8

                  You should change that for a sarcasm icon! I am only cuddly in the way that lions are cuddly. Hear me growl "gerr - row - well", scared now? :rolleyes:


                  Panic, Chaos, Destruction. My work here is done.

                  L 1 Reply Last reply
                  0
                  • N Nagy Vilmos

                    You should change that for a sarcasm icon! I am only cuddly in the way that lions are cuddly. Hear me growl "gerr - row - well", scared now? :rolleyes:


                    Panic, Chaos, Destruction. My work here is done.

                    L Offline
                    L Offline
                    Lost User
                    wrote on last edited by
                    #9

                    AAArgh! Mummy! :wtf:

                    1 Reply Last reply
                    0
                    Reply
                    • Reply as topic
                    Log in to reply
                    • Oldest to Newest
                    • Newest to Oldest
                    • Most Votes


                    • Login

                    • Don't have an account? Register

                    • Login or register to search.
                    • First post
                      Last post
                    0
                    • Categories
                    • Recent
                    • Tags
                    • Popular
                    • World
                    • Users
                    • Groups