lunes, 6 de abril de 2009

PHP Crawler

Very late reply but none the less !

Here are the steps to acheive what you require and overcome the issues cited by other members.
  1. Create a PHP function "parse_url()
    1. Access the URL using the file() function
    1. Download an existing HTML parser from phpclasses.org (particularly the one that converts HTML to a PHP array and extract all the href elements from A tags.)
  2. Store the content of the HTML and the URL into a database. Keep another column to store "last_updated_on" as NOW().
  3. Store ALL the extracted URLs with last_updated_on as NULL.
  4. Then use another funtion to extract all URLs from the database that have last_updated_on as NULL and run parse_url() on each of the URLs.
Notes
  1. The above will not be able to crawl any javascript links.
  2. To crawl image maps you will need to use the HTML2Array class again to extract the SRC element in IMAGE tags.
  3. To overcome the time limit issues either use set_time_limit() function OR run the PHP file from windows/linux command line
    1. eg. c:\php>php.exe c:\phpwork\myapp.php
  4. To further improve the efficiency of your crawlers read about
    1. FullText Search MATCH()
    2. strip_html() function
    3. HowTo store BLOB in compressed cells in MySQL.
    4. Storing MD#5 of HTML in a database and using to compare live URLs before performing updates.
Link> http://www.webmaster-talk.com/php-forum/36717-php-crawler.html

--
Eduardo R. Coll

No hay comentarios: