In a 21st-century world, we're increasingly living our lives online, sharing everything in one quick click from vacation photos and observances of daily life to breaking news stories and expressions of our deepest insecurities. It's easy to ignore or forget the millions of people who suddenly have access to your life online, but behind the clever screen names and witty captions hides a dark digital world with real dangers and risks.
Runtime: 60 minutes
Web of Lies - Web crawler - Netflix
A Web crawler, sometimes called a spider, is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing (web spidering). Web search engines and some other sites use Web crawling or spidering software to update their web content or indices of others sites' web content. Web crawlers copy pages for processing by a search engine which indexes the downloaded pages so users can search more efficiently. Crawlers consume resources on visited systems and often visit sites without approval. Issues of schedule, load, and “politeness” come into play when large collections of pages are accessed. Mechanisms exist for public sites not wishing to be crawled to make this known to the crawling agent. For instance, including a robots.txt file can request bots to index only parts of a website, or nothing at all. The number of Internet pages is extremely large; even the largest crawlers fall short of making a complete index. For this reason, search engines struggled to give relevant search results in the early years of the World Wide Web, before 2000. Today relevant results are given almost instantly. Crawlers can validate hyperlinks and HTML code. They can also be used for web scraping (see also data-driven programming).
Web of Lies - Crawling the deep web - Netflix
A vast amount of web pages lie in the deep or invisible web. These pages are typically only accessible by submitting queries to a database, and regular crawlers are unable to find these pages if there are no links that point to them. Google's Sitemaps protocol and mod oai are intended to allow discovery of these deep-Web resources. Deep web crawling also multiplies the number of web links to be crawled. Some crawlers only take some of the URLs in form. In some cases, such as the Googlebot, Web crawling is done on all text contained inside the hypertext content, tags, or text. Strategic approaches may be taken to target deep Web content. With a technique called screen scraping, specialized software may be customized to automatically and repeatedly query a given Web form with the intention of aggregating the resulting data. Such software can be used to span multiple Web forms across multiple Websites. Data extracted from the results of one Web form submission can be taken and applied as input to another Web form thus establishing continuity across the Deep Web in a way not possible with traditional web crawlers. Pages built on AJAX are among those causing problems to web crawlers. Google has proposed a format of AJAX calls that their bot can recognize and index.
Web of Lies - References - Netflix