crawler.h and tests

90070b11 · jsclose · d9a730a4 · 90070b11 · 90070b11 · 90070b11
Commit 90070b11 authored 7 years ago by jsclose
--- a/.gitignore
+++ b/.gitignore
+.idea/*
+.vagrant/*
+CMakeLists.txt
+cmake-build-debug/*
+Vagrantfile
--- a/crawler/crawler.h
+++ b/crawler/crawler.h
+
+
+/*
+ *
+ * Must provide
+ Robustness:
+The Web contains servers that create spider traps, which are generators of web pages that mislead crawlers into getting stuck fetching an infinite number of pages in a particular domain. Crawlers must be designed to be resilient to such traps. Not all such traps are malicious; some are the inadvertent side-effect of faulty website development.
+Politeness:
+Web servers have both implicit and explicit policies regulating the rate at which a crawler can visit them. These politeness policies must be respected.
+
+ */
+
+
+class Crawler {
+
+    //robots.txt cache
+
+
+
+
+public:
+
+
+
+
+
+
+
+
+
+
+};
+
+
+//spiders : threads doing work of fetching urls
+//houseKeeper : This thread is generally quiescent except that it wakes up once every few seconds to log crawl progress statistics
+// (URLs crawled, frontier size, etc.), decide whether to terminate the crawl, or (once every few hours of crawling) checkpoint the crawl. In checkpointing, a snapshot of the crawler's state (say, the URL frontier) is committed to disk. In the event of a catastrophic crawler failure, the crawl is restarted from the most recent checkpoint.
\ No newline at end of file
--- a/tests/cats.html
+++ b/tests/cats.html
+<!DOCTYPE html>
+<html>
+<head>
+<!-- HTML Codes by Quackit.com -->
+<title>
+Story of Cat</title>
+<meta name="viewport" content="width=device-width, initial-scale=1">
+<meta name="keywords" content="cat story">
+<meta name="description" content="This is the tale of a cat names joe">
+<style>
+body {background-color:#ffffff;background-repeat:no-repeat;background-position:top left;background-attachment:fixed;}
+h1{font-family:Arial, sans-serif;color:#000000;background-color:#ffffff;}
+p {font-family:Georgia, serif;font-size:14px;font-style:normal;font-weight:normal;color:#000000;background-color:#ffffff;}
+</style>
+</head>
+<body>
+<h1>Joe the cat</h1>
+<p>On Saturday, joe the cat went to the store. He climbed up a mountain? It was weird. The store was called Food Store</p>
+</body>
+</html>
--- a/tests/store.html
+++ b/tests/store.html
+<!DOCTYPE html>
+<html>
+<head>
+<!-- HTML Codes by Quackit.com -->
+<title>
+Food store is here</title>
+<meta name="viewport" content="width=device-width, initial-scale=1">
+<meta name="keywords" content="store food dinner lunch">
+<meta name="description" content="The food store sells cat food for dinner, lunch, and breakfast.">
+<style>
+body {background-color:#ffffff;background-repeat:no-repeat;background-position:top left;background-attachment:fixed;}
+h1{font-family:Arial, sans-serif;color:#000000;background-color:#ffffff;}
+p {font-family:Georgia, serif;font-size:14px;font-style:normal;font-weight:normal;color:#000000;background-color:#ffffff;}
+</style>
+</head>
+<body>
+<h1>COme shop Come shop at our Store</h1>
+<p>Please come to our store!</p>
+</body>
+</html>