Skip to content

Crawl websites

Website plugin

The Website plugin is responsible for managing drivers and establishing the crawl process for websites.

Directory Structure

.
├── driver_plugin                    
│   ├── manage
│   |    └── driver_manage.py
│   | 
│   ├── operators
│   │   └── websites
│   │   |    ├── article_comment.py
│   │   |    └── article.py
│   |   │    └── review.py

Operators

Operators represent tasks in a DAG, encapsulating the logic for each step of the workflow.

1. WebsiteOperator

Handles the collection of articles from websites and associated comments.

Workflow:

  • Extract Job Parameters:
  • Retrieves parameters like website_id, last_date, cities, keywords, and crawl_limit.
  • Parses and URL-encodes keywords and cities if necessary.

  • Crawl Articles and Comments:

  • Uses the driver's get_publications_from_search method to retrieve articles and their associated comments.
  • The crawl process iterates through all combinations of cities and keywords:

    • For each city in the provided list of cities:
    • For each keyword in the provided list of keywords:
      • Performs a targeted search using the city and keyword pair.
      • Collects up to crawl_limit publications per combination (i.e., for each city-keyword pair).
      • For each collected article:
      • Saves it to MongoDB using ArticleHook.
      • Prepares related media (e.g., article images and videos).
      • Processes comments associated with the article:
        • Saves each comment using CommentHook.
        • Prepares media such as author profile pictures and comment images.
      • If the article is recent and the system allows it, triggers the creation of a new comment job using prepare_comment_job, allowing for future updates on the article’s comments.
  • Comment Job Creation:

  • If the article is recent and the operator is available, prepares a new_article_comment_job in MongoDB for future comment updates.

  • Error Handling:

  • Raises PublicationCrawlException if any error occurs during the crawl process.

2. CommentOperator

Collects new comments for a previously crawled article. Please note that for now we don't have any comments available for the websites we're dealing with.

Workflow:

  • Extract Job Parameters:
  • Extracts fields such as article_id, article_url, website_id, last_date, and crawl_limit.

  • Retrieve Comments:

  • Uses the driver's get_comments_from_post method to retrieve comments since the last date.

  • Process Comments:

  • For each comment:

    • Saves the comment to MongoDB using CommentHook.
    • Prepares media:
    • Author profile pictures.
    • Images related to the comment.
  • Quota Management:

  • Updates API quotas if using an API-based tool.

  • Logging and Error Handling:

  • Logs the number of comments crawled.
  • Raises an exception if any error occurs during crawling.

3. ReviewOperator

Handles collection of reviews for websites, e.g., schools or colleges, using the driver.

Workflow

  • Extract Job Parameters:

  • website_id, last_date, crawl_limit, initial_crawl, type, nom, ville, secteur, and optional fiche_url.

  • Retrieve Reviews:

  • Uses the driver's get_reviews method with parameters like school type, name, city, sector, and date.
  • Limits the number of reviews fetched according to crawl_limit.

  • Process Reviews:

  • Saves each review to MongoDB via ReviewHook.

  • Supports both required fields (author, comment, rating_overall, date) and optional numeric ratings (rating_ambiance_eleves, rating_enseignement_professeurs, rating_soutien_eleves, rating_exigence_scolaire, rating_sport_culture).

  • Logging and Error Handling:

  • Logs the number of reviews crawled and saved.

  • Raises exceptions in case of crawl failures.