making a Search Engine

sangeeta2009

All you Technically smart guys.. I have decide to work on a project of Search Engine on any one of 2 platform..JSP or PHP.. I have worked on various projects related to this theme earlier, but this time i want it to be working on Web. The problem for me is "Crawl the Web".. Could anyone please help that how can i Access the all websites URL.. OK let me explain: In my previous projects i have already developed the concept of ranking a particular URL stored in my table according to frequency of keywords and little more concept.. So now i just wanted to ask two more things.. 1)Could any one please tell me, will it be Possible for me to Crawl The Web i.e i want the all URL available on internet and not just few that are stored in my tables... 2)a little more concept on ranking of webpages like frquency...and now avoiding those fake keyword in webpages and various other good concepts that i can implement... I will be very thankful for any kind of related help... Please reply soon

Nagy Vilmos

You have no idea how over your head this subject is. It's a case of if you need to ask, you don't want to know... May I suggest you read everything on the subject you can find; wikipedia has a lot of information. Next you'll need to build a spider, it must be able to read eight pages at a time that's why we call them spiders, to go and read the web for you. Buy some really big storage devices, you're going to need something in the region of 843 petabytes [1] to store the indexes alone. The easy part is take the HTML from the spiders doing the crawling and get the keywords and any links. How you rank the page is upto you, personally I would +1 for every reference to bunnies and -1 for cats. [1] figures are +/- 10 orders of magnitude.

Panic, Chaos, Destruction. My work here is done.

sangeeta2009

Well thanks for Information. Please tell me isn't it possible to use some available spider code or i have to start from scratch. And the thing that you mention about large storage, how can that be avoided by using Html from spider..as you mention.. please reply... I just wanted to develop a working frame work..but not an actual Search Engine as compared to Google. [not even (20%) in comparison]

Nagy Vilmos

This reply shows your total lack of understanding. The spider [plenty of crawlers available - try searching] retrieves the data for you. You then need to index the data. Search engines DO NOT look at every web-site when performing a search, instead they use a keyword ind... ...why am I bothering? Really, why?

Panic, Chaos, Destruction. My work here is done.

sangeeta2009

Thanks for understanding my lack of knowledge.. Respected sir.. I have 6 months to develop it and i just need maximum 2 weeks to understand the whole concept of CRAWLING. Today here i just wanted to now -- What problem/consequences/difficulties will be faced by me when i will be having proper knowledge of this Concept almost like yours.. so just please tell me what concepts to LEARN and what will be difficult for me at that time... Hope you get why you are Helping me?? Please reply..

Lost User

Nagy Vilmos wrote:

...why am I bothering? Really, why?

Probably because you are a nice guy, who is motivated by a desire to help others. And every so often you get the sort of feedback that makes it all worthwhile. ;)

Nagy Vilmos

Go back to my original answer. Look on wikipedia and READ the articles. There is a lot of information there so start reading. - you need to be able to retrieve and index pages. - as you index each page pull out the links. - retrieve and index those. The likes of google or bing are continuously crawling, looking for new or changed pages. The concept is simple, the practice is hard due to the sheer volume. Start a crawler going on your / your college's home page and index ONLY that domain; see how big it gets.

Panic, Chaos, Destruction. My work here is done.

Nagy Vilmos

You should change that for a sarcasm icon! I am only cuddly in the way that lions are cuddly. Hear me growl "gerr - row - well", scared now? :rolleyes:

Panic, Chaos, Destruction. My work here is done.

Lost User

AAArgh! Mummy! :wtf: