Skip to content
Snippets Groups Projects
Commit 90070b11 authored by jsclose's avatar jsclose
Browse files

crawler.h and tests

parent d9a730a4
Branches
No related tags found
No related merge requests found
.idea/*
.vagrant/*
CMakeLists.txt
cmake-build-debug/*
Vagrantfile
/*
*
* Must provide
Robustness:
The Web contains servers that create spider traps, which are generators of web pages that mislead crawlers into getting stuck fetching an infinite number of pages in a particular domain. Crawlers must be designed to be resilient to such traps. Not all such traps are malicious; some are the inadvertent side-effect of faulty website development.
Politeness:
Web servers have both implicit and explicit policies regulating the rate at which a crawler can visit them. These politeness policies must be respected.
*/
class Crawler {
//robots.txt cache
public:
};
//spiders : threads doing work of fetching urls
//houseKeeper : This thread is generally quiescent except that it wakes up once every few seconds to log crawl progress statistics
// (URLs crawled, frontier size, etc.), decide whether to terminate the crawl, or (once every few hours of crawling) checkpoint the crawl. In checkpointing, a snapshot of the crawler's state (say, the URL frontier) is committed to disk. In the event of a catastrophic crawler failure, the crawl is restarted from the most recent checkpoint.
\ No newline at end of file
<!DOCTYPE html>
<html>
<head>
<!-- HTML Codes by Quackit.com -->
<title>
Story of Cat</title>
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="keywords" content="cat story">
<meta name="description" content="This is the tale of a cat names joe">
<style>
body {background-color:#ffffff;background-repeat:no-repeat;background-position:top left;background-attachment:fixed;}
h1{font-family:Arial, sans-serif;color:#000000;background-color:#ffffff;}
p {font-family:Georgia, serif;font-size:14px;font-style:normal;font-weight:normal;color:#000000;background-color:#ffffff;}
</style>
</head>
<body>
<h1>Joe the cat</h1>
<p>On Saturday, joe the cat went to the store. He climbed up a mountain? It was weird. The store was called Food Store</p>
</body>
</html>
<!DOCTYPE html>
<html>
<head>
<!-- HTML Codes by Quackit.com -->
<title>
Food store is here</title>
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="keywords" content="store food dinner lunch">
<meta name="description" content="The food store sells cat food for dinner, lunch, and breakfast.">
<style>
body {background-color:#ffffff;background-repeat:no-repeat;background-position:top left;background-attachment:fixed;}
h1{font-family:Arial, sans-serif;color:#000000;background-color:#ffffff;}
p {font-family:Georgia, serif;font-size:14px;font-style:normal;font-weight:normal;color:#000000;background-color:#ffffff;}
</style>
</head>
<body>
<h1>COme shop Come shop at our Store</h1>
<p>Please come to our store!</p>
</body>
</html>
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment