Crawl websites
Website plugin¶
The Website plugin is responsible for managing drivers and establishing the crawl process for websites.
Directory Structure¶
.
├── driver_plugin
│ ├── manage
│ | └── driver_manage.py
│ |
│ ├── operators
│ │ └── websites
│ │ | ├── article_comment.py
│ │ | └── article.py
│ | │ └── review.py
Operators¶
Operators represent tasks in a DAG, encapsulating the logic for each step of the workflow.
1. WebsiteOperator¶
Handles the collection of articles from websites and associated comments.
Workflow:
- Extract Job Parameters:
- Retrieves parameters like
website_id,last_date,cities,keywords, andcrawl_limit. -
Parses and URL-encodes keywords and cities if necessary.
-
Crawl Articles and Comments:
- Uses the driver's
get_publications_from_searchmethod to retrieve articles and their associated comments. -
The crawl process iterates through all combinations of cities and keywords:
- For each
cityin the provided list of cities: - For each
keywordin the provided list of keywords:- Performs a targeted search using the city and keyword pair.
- Collects up to
crawl_limitpublications per combination (i.e., for each city-keyword pair). - For each collected article:
- Saves it to MongoDB using
ArticleHook. - Prepares related media (e.g., article images and videos).
- Processes comments associated with the article:
- Saves each comment using
CommentHook. - Prepares media such as author profile pictures and comment images.
- Saves each comment using
- If the article is recent and the system allows it, triggers the creation of a new comment job using
prepare_comment_job, allowing for future updates on the article’s comments.
- For each
-
Comment Job Creation:
-
If the article is recent and the operator is available, prepares a
new_article_comment_jobin MongoDB for future comment updates. -
Error Handling:
- Raises
PublicationCrawlExceptionif any error occurs during the crawl process.
2. CommentOperator¶
Collects new comments for a previously crawled article. Please note that for now we don't have any comments available for the websites we're dealing with.
Workflow:
- Extract Job Parameters:
-
Extracts fields such as
article_id,article_url,website_id,last_date, andcrawl_limit. -
Retrieve Comments:
-
Uses the driver's
get_comments_from_postmethod to retrieve comments since the last date. -
Process Comments:
-
For each comment:
- Saves the comment to MongoDB using
CommentHook. - Prepares media:
- Author profile pictures.
- Images related to the comment.
- Saves the comment to MongoDB using
-
Quota Management:
-
Updates API quotas if using an API-based tool.
-
Logging and Error Handling:
- Logs the number of comments crawled.
- Raises an exception if any error occurs during crawling.
3. ReviewOperator¶
Handles collection of reviews for websites, e.g., schools or colleges, using the driver.
Workflow
-
Extract Job Parameters:
-
website_id, last_date, crawl_limit, initial_crawl, type, nom, ville, secteur, and optional fiche_url.
-
Retrieve Reviews:
- Uses the driver's get_reviews method with parameters like school type, name, city, sector, and date.
-
Limits the number of reviews fetched according to crawl_limit.
-
Process Reviews:
-
Saves each review to MongoDB via ReviewHook.
-
Supports both required fields (author, comment, rating_overall, date) and optional numeric ratings (rating_ambiance_eleves, rating_enseignement_professeurs, rating_soutien_eleves, rating_exigence_scolaire, rating_sport_culture).
-
Logging and Error Handling:
-
Logs the number of reviews crawled and saved.
-
Raises exceptions in case of crawl failures.