Here are the steps to acheive what you require and overcome the issues cited by other members.
- Create a PHP function "parse_url()
- Access the URL using the file() function
- Download an existing HTML parser from phpclasses.org (particularly the one that converts HTML to a PHP array and extract all the href elements from A tags.)
- Store the content of the HTML and the URL into a database. Keep another column to store "last_updated_on" as NOW().
- Store ALL the extracted URLs with last_updated_on as NULL.
- Then use another funtion to extract all URLs from the database that have last_updated_on as NULL and run parse_url() on each of the URLs.
- The above will not be able to crawl any javascript links.
- To crawl image maps you will need to use the HTML2Array class again to extract the SRC element in IMAGE tags.
- To overcome the time limit issues either use set_time_limit() function OR run the PHP file from windows/linux command line
- eg. c:\php>php.exe c:\phpwork\myapp.php
- To further improve the efficiency of your crawlers read about
- FullText Search MATCH()
- strip_html() function
- HowTo store BLOB in compressed cells in MySQL.
- Storing MD#5 of HTML in a database and using to compare live URLs before performing updates.
--
Eduardo R. Coll
No hay comentarios:
Publicar un comentario