The Knowledgebase contains a variety of information about 80legs.
Table of Contents
General
How does 80legs work?
We've put together a distributed grid computing system that connects over 50,000 computers to provide computing power and bandwidth for 80legs users. When a user submits a job, that job is processed by 80legs' back-end and then URLs to crawl are distributed (along with the user's 80app) to the thousands of computers in the grid. Once each computer crawls and processes the content at its designated set of URLs, the results are sent back to 80legs and compiled for the user.
What is an 80legs job?
An 80legs job is a custom crawl that is run just for you. You specify a set of seed urls and criteria for your crawl and 80legs runs the crawl in our distributed cloud and gives you your results. Simple 80legs jobs take advantage of our built in regex/string matching. More advanced 80legs jobs use an 80app to control the crawl and/or control the analysis that is done on the page content accessed during the crawl. To run your first job, start with at the portal and follow the instructions on the wiki. Also, see the FAQ - Jobs section below.
What is an 80app?
An 80app is custom code created by a third party to run in 80legs for the purpose of controlling a crawl or analyzing content. The two main components of an 80app are:
- parseLinks() - create exactly which links you want to follow from any given document. This is typically used to find links in HTML files, but it can easily be used to process other document types (e.g. .pdf, .doc, or even images) as well.
- processDocument() - extract or compute exactly what you want from any given document. You can use processDocument() to process HTML, images, or any other document type to perform exactly the analysis you need.
80apps always implement the I80App interface. To make an 80app, start with the 80app page, but also see the information about the I80App interface and read the Custom Code FAQ.
How fast is 80legs?
Depending on the workload on 80legs at the time of your job and the amount of analysis you are performing on each page, your crawl will probably proceed at a rate of about 1 million pages every 5-10 minutes. If your crawl is broad, is not getting throttled due to a lack of domain diversity, has a light processing load, and there aren't many competing jobs on the system, it might go significantly faster.
How does 80legs crawl web pages?
There are many factors that affect the results of any web crawler. Here's how 80legs handles a few of these issues:
User Agent
We identify our web spider as 008. It crawls with its user agent set to the latest Firefox release.
Robots.txt
008 obeys access restrictions in all robots.txt files. 80legs servers retrieve the robots.txt file for each domain that we are actively crawling once every 3 hours, although in some circumstances the robots.txt file may be retrieved more frequently.
How we crawl
008 uses regular expressions to retrieve URLs from HTML elements that contain the attributes "href" and "src." It ignores URLs that are contained within <script> elements and URLs that begin with "javascript:," "mailto:," "#," and "about:."
DNS resolution
80legs servers resolve the DNS for all domains that we are actively crawling once an hour, although in some circumstances this may occur more frequently. As of now, all 80legs DNS resolution servers are located in Houston, Texas.
Rate limiting
80legs defaults to an average rate limit of 1 page per domain per second, but this rate limit may be increased over time for certain domains. Some sites have a higher average rate limit. Additional rate limiting rules follow:
- We rate limit each individual IP to 10 hits per second, regardless of domain.
- Due to the distributed nature of the 80legs crawler, 008 might hit a domain or IP more frequently than it's rate limit for a brief interval, followed by an interval with no hits. 80legs rate limits are computed as average rates over time.
- 80legs can manually override the rate limit for any domain based on the domain-owner's request. Rate limits can be increased or decreased. Contact us with concerns.
- Domain-specific rules:
- We rate limit domains based on a guess at their parent domain. For the gTLD domains listed here, we assume a parent domain has two levels (for example, google.com). For all others, we assume three levels (for example, google.co.uk).
- If the rate at which 80legs receives requests to crawl a domain exceeds the rate limit for that domain significantly over a sustained period of time, 80legs will automatically double the rate limit for that domain, up to 10 pages per domain per second.
What is the release schedule?
See the release page for coming features and historical releases.
How can I retrieve all the content from the pages I crawl?
This is the most common thing that people want to do when they first come to 80legs. It is very possible, but it is not really the best use of 80legs. See this faq question for more details.
Jobs
How do I use 80legs jobs?
Jobs are the basic entity you create when using 80legs. Jobs control when you want to crawl, how often you want to crawl, what you want to crawl, how you want to analyze content, and much more. Jobs are divided into four major settings categories:
- Job settings: provide scheduling control
- Crawl settings: provide control over crawling settings (which parts of the web you want to access)
- Analysis settings: provide control over what content to analyze and how to analyze it
- Result settings: provide control over how you want to retrieve results
Should I submit my job to the sandbox or the live server?
The sandbox is for running small test jobs in a controlled environment. When you run a sandbox job, it runs on one of our servers and we save the stdout from your I80App methods for you to see in the portal. This is very useful for debugging your code. If you are using the built-in string/regex matching functionality without custom code, there is no reason to use the sandbox. Also, when you are ready for larger tests, you will need to use the live server.
Why does my crawl fail to go past my seed link(s)?
There are many reasons for this. You can find the reason the page was skipped in your Crawled URLs results file. The most common reasons are:
- Robots.txt - We obey robots.txt, so if access to the page you are attempting to access is restricted by the domain's robots.txt file we will skip it.
- The page does not exist, contains no outgoing links, or was not read properly for some reason. If you believe that we should be able to read the page, please contact support.
Why is my job so slow?
Almost without exception, slow running jobs are due to throttling. We limit the rate we crawl single domains and single IP addresses (see the rate limiting section above). In the worst case, if your job is limited to a single domain it will run at a rate of slightly less than 1-10 pages/second depending on the domain. This will be a frustratingly slow 3600-36000 pages/hour. To ensure your crawl works as fast as possible, please follow these tips:
- Do not limit your job to staying on specific domains unless necessary - this includes making your "Crawl Regular Expression" as broad as possible or even blank if appropriate)
- If you implement a custom parseLinks() method in your I80App, ensure that it is generating sufficient diversity for your crawl.
Can I control the crawl?
Yes, you can control the crawl at least three different ways:
- You can use the settings in the 80legs Web Portal to control the way your job runs. All of the Crawl Settings control the way your job moves from page to page.
- You can create an 80app with a parseLinks() method. This method lets you control exactly which pages you want to visit from each document. You can use this to do more sophisticated link following so you can:
- find extra links
- choose to not follow certain links
- extract links from unusual sources (e.g. javascript, pdfs, docs, images, etc)
- speculatively create links to try based on the current URL
- You can use the 80legs API to start a job, add URLs to the job, and retrieve results dynamically. This will let you control every aspect of choosing the pages you need to visit. This is a very advanced feature that will not be needed by many users (most custom applications will be able to use parseLinks() above), but it will be quite powerful for those that need it.
Is there an API that we can use instead of the Web Portal to control jobs programmatically?
Yes, we have released Java and .NET API. See the 80legs API page for details about the available APIs.
Results
What are the 80legs results?
When your jobs complete, 80legs will create result files that you can download. See more information about results.
What are the two file types that I get in my results?
The first one is the Crawled URLs. This file is a comma-separated-values (.csv) file that contains the URLs that were crawled, the crawl status (e.g. 200, 404, robots.txt, etc), and the analysis status. The second file type is the Analysis Results. If you used our built-in string/regex matching functionality, the results for the various options are described here. Otherwise, the results are an 80legs specific file format (.80 files) that contain pairs of URLs and your custom results as returned from processDocument(). To read these results in java, you can use the CustomerResults class.
Why do I get NO_PROCESS (or other) values in my crawl results file?
See here for a complete description of the values. Specifically, NO_PROCESS shows up in your crawl results when the document from the current url cannot or should not be processed. Examples of reasons why it might not be processed are that your MIME type excludes it, the page was not crawled for some reason, or your analysis regular expression excluded the url.
How big are the results?
The Crawled URLs files average about 25-30MB per million URLs. The Analyzed URLs files can be much larger because you control the size of the analysis results. For example, the method mentioned above to extract all the page contents will produce huge results files that you will need to access. For example, a 100m page crawl where each result was 10KB will return 1TB (one terabyte) of results that you will need to download. This will take over 18 days to download if you have a fairly typical 5Mbps broadband connection and will still take 1 day if you have a 100Mbps connection. So please consider doing as much processing inside 80legs as possible to reduce the size of your results!
How long do you keep my results?
While in beta, we don't keep them for a specified length of time, but we intend to keep them for weeks. We will soon add additional results storage rules. We are considering keeping the results files for seven days.
How do I get my custom results into a database?
After running your job with your 80app, to use your job results in a database or your own application you will first need to extract them from the portal (or the soon-to-be-released API). After you extract your results, you will have lots of url/result pairs. The url is just the String representing the document that was crawled. The result is the byte[] that you returned from your 80app's processDocument() function. You need to deserialize the result yourself based on how you created the byte[] in your processDocument() code. Once everything is extracted, you can populate your own database or custom application as needed.
What format do I need to return from processDocument() in my 80app?
See answer below in FAQ - Custom Code.
Custom Code
How do I get started writing custom code to analyze content in 80legs?
You need to create your own 80app. See the Writing Your Code section to get started with a sample. From our Google code repository, see a simple sample or a complete 80app implementing our own custom regex/string matching.
How can I just get the content of all the pages I crawl?
We limit the results from any given page (see current limits). The reason we limit the results is that we want our users to get accustomed to doing their processing inside 80legs since that is the best way to use the system. Pulling back all of the contents from the pages ends up using lots of our bandwidth and your bandwidth. As an example, if you are wanting to parse out information from inside the pages you crawl on yourdomain.com, just create the code necessary to do the parsing using out I80App interface. You would write a processDocument() method for your I80App class that pulls out what you need and return that from the page. Something like this:
public byte[] processDocument(byte[] documentContent, String url, Map<String, String> headers, String statusCodeLine) {
return extractWhatIsNeeded ( documentContent ); // you will write extractWhatIsNeeded() to do what you want
}
That said, you can simply write a processDocument() method for your 80app that does the following and it will return as much of the page as we allow. It is currently limited, but we'll let you pay the bandwidth costs for extra results later if that's the way you need to go. (please see next question about result file sizes)
public byte[] processDocument(byte[] documentContent, String url, Map<String, String> headers, String statusCodeLine) {
return documentContent;
}
What libraries do I have access to when writing an 80app?
Generally, any non-platform specific libraries that are included in the JRE are available to you. If you need to include other libraries, we recommend that you include the source code for them in your jar. We currently do not support loading 3rd-party jars from your 80app, but we are developing a way to wrap up 3rd party functionality using our 80Lib concept.
What format do I need to return from processDocument() in my 80app?
The return value from processDocument() is completely free-form. You can return whatever data you want in a byte[] in whatever format you want. This can be structured data like XML, binary data like image information, or unstructured text information. Your final results will be presented to you in .80 files that contain URL/Result pairs where the Result is the byte[] you returned for the given URL.
Do I have access to a common store from my 80app?
No, any given instance of your 80app knows nothing about any other documents being processed. It should be written solely to process the single document that it is given. Your 80app's lifecycle does not guarantee any order of pages or knowledge of anything else in the 80legs universe except what is passed into your I80App object.
How does 80legs "data" work?
If your 80app needs custom data that varies with each job, you can use 80legs data to avoid creating a new 80app for each variation. Just upload the data to the portal using the interface and select it when creating a new job. The contents of your data file will be passed into your 80app's initialize() method as a byte[].
Why do I sometimes get strange characters in my strings? Do I need to take any special care with character encoding?
Your 80app executes across many different machines with potentially different character encodings than you are used to seeing. We recommend that you use the UTF-8 character encoding explicitly to ensure that you don't get any other character encodings mixed in.
To make a string from a byte[] use this:
String s = String ( myByteArray, 0, myByteArray.length, "UTF-8" );
To get the UTF-8 bytes from a String, use this:
byte[] b = myString.getBytes("UTF-8");
Should I compress the results from processDocument()?
No, we compress the data internally and compress your results in large chunks, so you won't gain any size reduction by compressing inside processDocument() and you will spend a little more cpu-time to do it.
How can I modify the default 80legs links using my 80app's parseLinks()?
Our concept with the 80app when using parseLinks is that we want the app developer to decide how to parse the links if they want to. If they decide to use our default link parsing, that's easy. If they have their own link parsing, it's easy.
If you just want to edit our link list, it's still easy but a little more involved because we don't want to burn the CPU time running our parseLinks() method if you aren't going to use the results. So, the way to do it is to include our default link parsing code (the DefaultParseLinks class and dependecies from the Google repository: http://code.google.com/p/eightylegs/source/browse/#svn/trunk/80AppDefaultApp/src/com/eightylegs/customer/default80legs). Then your parseLinks() method can just call our DefaultParseLinks class and you can add or eliminate any urls.
Here's an example:
public Collection<String> parseLinks(byte[] documentContent, String url, Map<String, String> headers, String statusCodeLine) {
try {
boolean allowQueryStrings = Boolean.valueOf(properties.getProperty(ContentSelectUserParameterStrings.ALLOW_QUERY_STRINGS, "false"));
String pageContents = new String (documentContent, 0, documentContent.length, "UTF-8");
Collection<String> tmpLinks = DefaultParseLinks.parseLinks ( pageContents, new URL(url), allowQueryStrings);
// insert your code here to modify tmpLinks as necessary
}
catch ( Exception e ) {}
return null;
}
How can I get the outgoing links of the pages I crawl?
You can simply implement the processDocument() method for your 80app that returns outgoing links. You can use your own link parsing or our default link parsing.
Here's an example:
public byte[] processDocument(byte[] documentContent, String url, Map<String, String> headers, String statusCodeLine) {
StringBuilder resultSb = new StringBuilder();
try {
boolean allowQueryStrings = Boolean.valueOf(properties.getProperty(ContentSelectUserParameterStrings.ALLOW_QUERY_STRINGS, "false"));
String pageContents = new String (documentContent, 0, documentContent.length, "UTF-8");
Collection<String> tmpLinks = DefaultParseLinks.parseLinks ( pageContents, new URL(url), allowQueryStrings ); //our default link parsing
for (String tmp : tmpLinks) {
resultSb.append(tmp).append("\r\n"); //it's up to you how you want to format the results.
}
return resultSb.toString().getBytes("UTF-8");;
}
catch ( Exception e ) {
}
return null;
}
Security
Can other customers see my results?
No 80legs customer can see the results of any other customer.
Can other customers use my custom 80app?
No 80legs customer can use your custom 80app without your permission. In the future we will allow customers to share their custom 80legs code, but it will always be optional.
What additional steps can I take for security?
If you have a particularly sensitive code base or results information, we recommend taking the following steps:
- Obfuscate your jar.
- Encrypt your results inside your 80app's processDocument() method.
- Keep your ideas confidential and only tell them to people that need to know.
Can 80legs be used for a DDOS attack?
No. 80legs is tightly controlled by the 80legs team. Several measures have been put in place to protect against any unauthorized or abusive use of 80legs, inclusiding:
- Throttling crawls run by 80legs jobs on a global, system-wide level
- Controlling the processing and analysis code run by users
- Actively monitoring the jobs being run
Billing
How much does it cost?
See rates here.
Do you charge CPU time when we do not execute any custom code?
You will actually be charged a very small amount of cpu-time for the link parsing and custom string matching. This typically comes out to around $0.25 per million pages crawled, but it can be higher if you have some complex regex matching or lots of strings to match. We run our default link-parsing and regex/string matching through a custom 80app of our own. We start the clock on the cpu-time immediately before parseLinks() and stop it immediately after processDocument(). You are not charged cpu-time for any other job-related overheads.
Content
What type of content can I access from the web?
You can access anything that is accessible through a simple URL without session information as long as it is not larger than our maximum document size (see limits).
Can I access Flash videos?
Not currently, but this is in development for a future release
Can I access content behind forms or use POST?
Not currently, but this is in development for a future release
Webmasters
Why is 80legs crawling my site?
80legs is crawling your site because one of our users has set up a job with certain crawling specifications that led to the crawling of your site. We're very interesting in proper crawling behavior; you can refer to the question on how 80legs crawls web pages for information on how we are crawling your site. There are actually some important benefits to allowing 80legs to crawl your site. If you have more questions about 80legs, please contact us.
What benefit does my site get from allowing 80legs to crawl it?
By allowing 80legs to crawl your site, you encourage developers that would otherwise use unregulated crawling tools to use a highly controlled and manable crawling service. In other words, if your site is crawled by 80legs, you can be sure that it's crawled at a rate that your servers can handle. In fact, you can contact us and let us know exactly at what rate you would like your site to be crawled.
What is 008?
008 is the user-agent used by 80legs' crawler. You can learn more about 008 and how 80legs crawls sites by reading this FAQ.
How do I slow down 008 when crawling my site?
The easiest solution is to contact us and let us know at what rate you would like us to throttle crawls on your site (in terms of requests/second).
How do I stop 008 from crawling my site?
Instead of stopping 008 from crawling your site altogether, it's also possible to slow the rate at which 008 crawls your site. If you feel you must stop 008 from crawling, you can do so by adding the following lines to your robots.txt file:
User-agent: 008
Disallow: /
To block access to all polite crawlers including 008:
User-agent: *
Disallow: /
If you don't have a robots.txt file on your server, simply add one to your site's root directory.
Comments (0)
You don't have permission to comment on this page.