understand-how-does-web-crawler-work

What is a web crawler and how it works ?

Web crawler is an automated program or script that methodically scans or crawls through websites to create an index for the data it’s looking for. By creating index of particular words searched, results displayed by search engine are more relevant and accurate. There are approximately 200 parameters by which Google search engine index the website like information in meta tag, keywords being used in web page, keyword density etc. These all help in indexing and page ranking of website.

WebCrawlerArchitecture
WebCrawlerArchitecture

ALGORITHM FOR WEB CRAWLER

Step – 1 : Begin with seed(known to us already) urls

Step – 2 : Fetch and parse them

Step – 3 : Extract urls they point to

Step – 4 : Place the extracted url on the queue

Step – 5 : Fetch each url on the queue and repeat from step 2

EXPLANATION :

Web Crawler begins with seed url’s which can be said as url’s which we know already. Then these are parsed and the url’s they point to are en-queued now till the queue is not empty the parsing process repeats itself with not visited url from Queue. As a web crawlers goes to every url that seed one point to, that is why it is also sometimes called as spider or robot.

Web crawlers can be used for one time use only, but when there is a long term like in the case of search engine they are automated to visit the visited url’s periodically to observe some significant changes like if a site is experiencing heavy traffic or technical difficulties, the spider may be programmed to note that and revisit the site again, hopefully after the technical issues have subsided.