Crawl websites

Website plugin¶

The Website plugin is responsible for managing drivers and establishing the crawl process for websites.

Directory Structure¶

.
├── driver_plugin                    
│   ├── manage
│   |    └── driver_manage.py
│   | 
│   ├── operators
│   │   └── websites
│   │   |    ├── article_comment.py
│   │   |    └── article.py
│   |   │    └── review.py

Operators¶

Operators represent tasks in a DAG, encapsulating the logic for each step of the workflow.

1. `WebsiteOperator`¶

Handles the collection of articles from websites and associated comments.

Workflow:

Extract Job Parameters:
Retrieves parameters like website_id, last_date, cities, keywords, and crawl_limit.
Parses and URL-encodes keywords and cities if necessary.
Crawl Articles and Comments:
Uses the driver's get_publications_from_search method to retrieve articles and their associated comments.
The crawl process iterates through all combinations of cities and keywords:
- For each city in the provided list of cities:
- For each keyword in the provided list of keywords:
  - Performs a targeted search using the city and keyword pair.
  - Collects up to crawl_limit publications per combination (i.e., for each city-keyword pair).
  - For each collected article:
  - Saves it to MongoDB using ArticleHook.
  - Prepares related media (e.g., article images and videos).
  - Processes comments associated with the article:
    - Saves each comment using CommentHook.
    - Prepares media such as author profile pictures and comment images.
  - If the article is recent and the system allows it, triggers the creation of a new comment job using prepare_comment_job, allowing for future updates on the article’s comments.
Comment Job Creation:
If the article is recent and the operator is available, prepares a new_article_comment_job in MongoDB for future comment updates.
Error Handling:
Raises PublicationCrawlException if any error occurs during the crawl process.

2. `CommentOperator`¶

Collects new comments for a previously crawled article. Please note that for now we don't have any comments available for the websites we're dealing with.

Workflow:

Extract Job Parameters:
Extracts fields such as article_id, article_url, website_id, last_date, and crawl_limit.
Retrieve Comments:
Uses the driver's get_comments_from_post method to retrieve comments since the last date.
Process Comments:
For each comment:
- Saves the comment to MongoDB using CommentHook.
- Prepares media:
- Author profile pictures.
- Images related to the comment.
Quota Management:
Updates API quotas if using an API-based tool.
Logging and Error Handling:
Logs the number of comments crawled.
Raises an exception if any error occurs during crawling.

3. `ReviewOperator`¶

Handles collection of reviews for websites, e.g., schools or colleges, using the driver.

Workflow

Extract Job Parameters:
website_id, last_date, crawl_limit, initial_crawl, type, nom, ville, secteur, and optional fiche_url.
Retrieve Reviews:
Uses the driver's get_reviews method with parameters like school type, name, city, sector, and date.
Limits the number of reviews fetched according to crawl_limit.
Process Reviews:
Saves each review to MongoDB via ReviewHook.
Supports both required fields (author, comment, rating_overall, date) and optional numeric ratings (rating_ambiance_eleves, rating_enseignement_professeurs, rating_soutien_eleves, rating_exigence_scolaire, rating_sport_culture).
Logging and Error Handling:
Logs the number of reviews crawled and saved.
Raises exceptions in case of crawl failures.