Imagen de avatar Engine Valve Tappets Energy Transformation Manufacturers
carenginebasic

Why would you want to make a search engine anyway

Why would you want to make a search engine anyway? There already is a search engine to rule them all.Other obstacles you will find when you start indexing real Internet content is the fact that there is wast amounts of useless junk floating around everywhere and eventually your index will become full of spam, affiliate pages, parked domains, work in progress homepages without content, link farms used by search engine optimizers, mirror sites using data feeds to create thousands of pages with product listings or other reproduced content etc, etc.com?After completing basic search functionality there’s a lot of work before anyone will want to use your new machine. If you don’t make special code to handle that you’ll soon have 4 results in your search engine (one for every URL) all going to the same page.org/index.

This works but will also remove a lot of legitimate content (think forums) from your index.This is a large part of the work on a general search engine, to be able to read all sorts of content.To be able to read all pages you will also need to parse client side java script, handle frames, CSS and iframes. Or to strip the query string from pages.txt, site maps, redirects, proxies, recognizing content types, advanced ranking algorithms as well as handling terabytes of data.I’ll cover more detail in a future article.orgwww.When indexing from the Internet you will have to find ways to filter out the junk content from what people are actually reading and searching for.PARSING WEBSITESThere’s a million ways, both right and wrong to write HTML and when you index from the Internet you will need to handle all of them. To make good scoring you will also want to boost keywords found in the URL of the page and check the anchor text of inbound links.When parsing keywords from pages you not only need to handle the complete HTML standard but also all the non-standard ways that is unofficially supported by Internet browsers.orgdmoz.

To the search engine there will be a really big number of pages all containing the same content.The quick fix of course is to not index pages that include a query string.There is also the possibility of query strings where a session ID after the question mark in the URL will create almost infinite URLs for the same web page.To start with you could limit how deep into sub directories you crawl, how many link hops from a domain index page you crawl and how many links per web page to allow.com?Just look at this example:dmoz.. Like robots. Good luck with your next search engine project. You can use Google to find just about anything in the Internet and I doubt you will ever have the same computing and storage capabilities as the big G.

An index is not enough. It can be done in a day if you know regular expressions and have some experience with HTML and databases.Keeping track of inbound links is the most useful and most challenging of the above, you’ll need to keep a separate database table with info on all links between pages you index. Users will not like you.The third application is a customized, high speed site search for you largethousands of pages website.To make a search engine you have to do four things:Decide what pages to fetch and fetch themParse out words, phrases and Car Engine Part Distributors links from the pageGive a score to every keyword or key phrase indicating how well the phrase relates to that pages and store the scores in the search engine indexProvide a way for users to query the index and get a list of matching web pagesThis is not hard for a seasoned programmer.. If you’re going for a general Internet search engine there’s a lot more details you need to include.THE BASICS OF SEARCHThe basis of any BIG search engine is a word to web page index, basically a long list of words and how well they relate to different web pages. What’s challenging is how to score pages to give the end user the search results that’s most relevant to his idea of what hi is searching for.htmlwww

Deja un comentario