Table of Contents
Introduction
80Apps are custom code written by you that allow you to process the content on crawled pages any way you want. This means that there are near-limitless ways to use 80legs to crawl and analyze the web. Possible applications include semantic analysis, video processing, information extraction, page scraping, and much more.
An 80App can be selected when creating a job. When selected for a job, the 80App will be run on the content of any page that is analyzed by the job. The basic process flow is shown below:
- Job crawls a page.
- Job pulls down content of page.
- Job runs the 80App's processDocument() function, which is written by you, takes in the page content as an input and returns a result.
- If specified by 80App, job runs 80App parseLinks() function, which is also written by you, takes in the page content as input and determines which pages to crawl next.
- Once the job completes, all results from processDocument are stored in a .80 file, available for download by you.
This simple process repeats over every page that your job crawls. To get started with writing your own 80Apps, just follow the instructions below.
Setting Up Your Development Environment
You'll first need to setup your development environment. Two popular choices are Eclipse and Netbeans. Follow the instructions below to setup your preferred development environment:
Of course, you aren't required to use either of these IDEs and can use any development environment you wish.
Writing Your Code
After you've setup your development environment, it's time to start writing your first 80App. Follow the instructions below to write an 80App in the appropriate environment:
Exporting Your JAR
Now that your code is finished, you'll need to export it in the form of a JAR so you can use it in 80legs. Follow the instructions below to export your 80App in the appropriate environment:
Testing Your Code
To run your own 80legs analysis code in Java, follow these steps:
- Test your code on your local machine
- Download the latest 80legsprocesstest.jar from http://code.google.com/p/eightylegs/downloads/list.
- Test your code locally (instructions on using 80legsprocesstest) until it works for your test cases.
- Test your code in the sandbox using small live page crawls
- Upload your JAR file to 80legs in the Code section, which is accessible through the 80legs Portal.
- Test your code by creating a job in the Sandbox environment (selectable when creating a job).
- Instructions on creating a job are here.
- Run you Sandbox job and check your results.
- Debug your code as necessary.
Running Your Code
To run your own 80legs analysis code in Java, follow these steps:
- Upload your code (Note: We do not require you to upload source code, only JAR files).
- Go to the Code section in the Web Portal and click 'Upload new code'.
- Give your code a name and select your JAR file.
- Click the 'Upload Code' button.
- Run the approval process.
- Once your code is uploaded, you'll be asked to run the approval process.
- If you want, you can select data to run with your code during approval.
- Click the 'Run Approval Process' button.
- If your code fails the approval process, you will be given an error code corresponding to the reason your code failed. See the Error Codes page for more information.
- After your code is approved, create an 80legs job and run it in the Live environment, and specify which JAR file you want to use for computations.
- Retrieve the results from your job in the 80legs Portal.
- Extract your results from the .80 files using the instructions here.
Limitations on Your Code
80legs runs your code on a distributed computing system, which consists of a wide variety of computers. Due to heterogenous nature of our infrastructure, we must impose a few limitations on your code.
JVM
Your code will only run on computers in this network that have Java 1.5+. The maximum amount of memory available on these nodes is 256MB, but in some cases there may not be this much available. 80legs runs code in a very limited Java sandbox. This sandbox will prevent you from making any network connections. Here is the policy file used by 80legs that specifies the permissions granted to your code:
grant codeBase {
permission java.lang.RuntimePermission "stopThread";
permission java.net.SocketPermission "localhost:1024-", "listen";
permission java.util.PropertyPermission "java.version", "read";
permission java.util.PropertyPermission "java.vendor", "read";
permission java.util.PropertyPermission "java.vendor.url", "read";
permission java.util.PropertyPermission "java.class.version", "read";
permission java.util.PropertyPermission "os.name", "read";
permission java.util.PropertyPermission "os.version", "read";
permission java.util.PropertyPermission "os.arch", "read";
permission java.util.PropertyPermission "file.separator", "read";
permission java.util.PropertyPermission "path.separator", "read";
permission java.util.PropertyPermission "line.separator", "read";
permission java.util.PropertyPermission "java.specification.version", "read";
permission java.util.PropertyPermission "java.specification.vendor", "read";
permission java.util.PropertyPermission "java.specification.name", "read";
permission java.util.PropertyPermission "java.vm.specification.version", "read";
permission java.util.PropertyPermission "java.vm.specification.vendor", "read";
permission java.util.PropertyPermission "java.vm.specification.name", "read";
permission java.util.PropertyPermission "java.vm.version", "read";
permission java.util.PropertyPermission "java.vm.vendor", "read";
permission java.util.PropertyPermission "java.vm.name", "read";
};
Data Size
We require that data be less than 10MB in size. The smaller your data, the better it is for you and us.
Code Size
Your JAR file must be less than 10MB in size as well. Again, smaller is better.
Time Limits
80legs enforces time limits on all custom code. If these timeouts are exceeded for any given page, the processing will be abandoned on that document and a timeout error will be logged for your job. If your job generates too many timeout errors, your job will be stopped (reasons that jobs are stopped). The current limits are:
- Your constructor and initialize() method must complete within 60 seconds.
- Your parseLinks() and processDocument() methods must complete within a total 10 seconds per document processed.
Other Useful Resources
Serialization:
- Simple class serialization in Java here.
- Evolved class serialization in Java: here.
Comments (0)
You don't have permission to comment on this page.