Imagen de avatar make a engine parts
enginesoparts

If you’re going for a general Internet search engine

Making a search engine for the public Internet is tricky and if you’re like me you like to solve tricky problems.The quick fix of course is to not index pages that include a query string. Just look at this example:dmoz.THE BASICS OF SEARCHThe basis of any BIG search engine is a word to web page index, basically a long list of words and how well they relate to different web pages.You now have all the information you need to make a site search engine.dmoz.You’ll need to decide how much weight to put on keywords in the tile tag, description and main web page contents. and to become famous as the creator of the next big search engine or because as a programmer or engineer you like challenges. An indexed search engine will be a lot faster thana full text search function and if Google’s site search isn’t flexible enoughfor your site you Hydraulic Roller Lifters FOR GM can make your own search functionality. It can be done in a day if you know regular expressions and have some experience with HTML and databases.The third application is a customized, high speed site search for you largethousands of pages website.

There’s a million ways, both right and wrong to write HTML and when you index from the Internet you will need to handle all of them.htmlwww.I’ll cover more detail in a future article. If you don’t make special code to handle that you’ll soon have 4 results in your search engine (one for every URL) all going to the same page.Keeping track of inbound links is the most useful and most challenging of the above, you’ll need to keep a separate database table with info on all links between pages you index. Good luck with your next search engine project.To be able to read all pages you will also need to parse client side java script, handle frames, CSS and iframes.When indexing from the Internet you will have to find ways to filter out the junk content from what people are actually reading and searching for.This is a large part of the work on a general search engine, to be able to read all sorts of content.WHY SO MANY URLS?Finally you’ll need to deal with the fact that many websites have many URLS pointing to the same web page.htmlAll those URLs point to the same web page.

To make good scoring you will also want to boost keywords found in the URL of the page and check the anchor text of inbound links.. Or to strip the query string from pages.com?SID=4387483748377google. What’s challenging is how to score pages to give the end user the search results that’s most relevant to his idea of what hi is searching for. If you’re not prepared to go that far a one terabyte disk will hold an index of about 50 million pages.dmoz.org/index.To start with you could limit how deep into sub directories you crawl, how many link hops from a domain index page you crawl and how many links per web page to allow.HOW TO SCORE PAGESAfter completing basic search functionality there’s a lot of work before anyone will want to use your new machine.So why then make your own search engine?To make money of course!. This works but will also remove a lot of legitimate content (think forums) from your index.To make a search engine you have to do four things:Decide what pages to fetch and fetch themParse out words, phrases and links from the pageGive a score to every keyword or key phrase indicating how well the phrase relates to that pages and store the scores in the search engine indexProvide a way for users to query the index and get a list of matching web pagesThis is not hard for a seasoned programmer.Now you have a working search engine, just add a lot of computers and hard drives and you’ll soon index all of the Internet.google.orgdmoz..

If you’re going for a general Internet search engine there’s a lot more details you need to include.WHAT TO INDEX AND NOT TO INDEXOther obstacles you will find when you start indexing real Internet content is the fact that there is wast amounts of useless junk floating around everywhere and eventually your index will become full of spam, affiliate pages, parked domains, work in progress homepages without content, link farms used by search engine optimizers, mirror sites using data feeds to create thousands of pages with product listings or other reproduced content etc, etc. Like robots.com?SID=4434324325325google.There is also the possibility of query strings where a session ID after the question mark in the URL will create almost infinite URLs for the same web page.org/index.An index is not enough. Users will not like you.txt, site maps, redirects, proxies, recognizing content types, advanced ranking algorithms as well as handling terabytes of data.

Deja un comentario