Skip to content

Crawl social media profiles

Social Media Driver plugin

The Driver plugin is responsible for managing drivers and establishing the crawl process for social media platforms.

Directory Structure

.
├── driver_plugin                    
│   ├── manage
│   |    └── driver_manage.py
│   | 
│   ├── operators
│   │   └── social_media_profiles
│   │   |    ├── newcomment.py
│   │   |    ├── friend.py
│   │   |    ├── profile.py
|   |   |    ├── profile_replies.py
│   │   |    ├── publication.py
│   │   |    └── stories.py
│   |   │

Operators

Operators represent tasks in a DAG, encapsulating the logic for each step of the workflow.

1. ProfileOperator

Responsible for collecting and updating profile information.

Workflow:

  • Check Pseudo Existence:
  • Verifies if the profile exists using the driver's pseudo_exists method.
  • Updates API quota if using an API-based tool.
  • If the profile exists:
    • Checks if it's private and updates the profile document accordingly.
    • Toggles associated jobs (enables/disables) based on existence.
  • If the profile doesn't exist:

    • Deletes related jobs.
  • Crawl Profile:

  • Retrieves user info using the driver's get_user_info method.
  • Updates the profile in MongoDB with details like description, name, followers, and followings.
  • Prepares media (cover and profile pictures) for the profile.
  • Updates API quota if using an API-based tool.

  • Error Handling:

  • Raises ProfileCrawlException or ToggleProfileJobError if errors occur during crawling or job toggling.

2. FriendOperator

Collects followers and followings for a given profile.

Workflow:

  • Extracts job parameters like pseudo, profile_id, and crawl limits (nb_followers, nb_followings).
  • Uses the driver's get_followers_followings method to retrieve followers and followings.
  • For each follower/following:
  • Inserts the friend into MongoDB using FriendHook.
  • Prepares media (profile picture) for the friend.
  • Logs the number of followers and followings crawled.
  • Raises FriendCrawlException if an error occurs during crawling.

3. PublicationOperator

Collects posts and associated comments for a given profile.

Workflow:

  • Extracts job parameters like pseudo, profile_id, last_date, and crawl limits (nb_publications, nb_comments).
  • Retrieves publications and comments using the driver's get_publications_from_search method.
  • For each publication:
  • Saves the publication to MongoDB using PublicationHook.
  • Prepares media (images and videos) for the publication.
  • Processes associated comments using CommentHook, saving them to MongoDB and preparing media (author picture, comment images).
  • Prepares a new_comment_job for recent publications.
  • Updates API quota if using an API-based tool.
  • Enforces a time limit (16,000 seconds) to prevent excessive runtime.
  • Logs the number of publications crawled.

4. StoriesOperator

Collects stories for a given profile.

Workflow:

  • Extracts job parameters like pseudo and profile_id.
  • Retrieves stories using the driver's get_stories method.
  • For each story:
  • Marks the story with the profile's pseudo as the author.
  • Saves the story to MongoDB using PublicationHook.
  • Prepares media (images and videos) for the story.
  • Logs errors if crawling fails.

5. CommentOperator

Collects new comments for a given post.

Workflow:

  • Extracts job parameters like pseudo, profile_id, publication_id, publication_url, last_date, and crawl_limit.
  • Retrieves comments using the driver's get_comments_from_post method, filtering by since timestamp.
  • For each comment:
  • Saves the comment to MongoDB using CommentHook.
  • Prepares media (author picture, comment images) for the comment.
  • Updates API quota if using an API-based tool.
  • Logs the number of comments crawled.
  • Raises an exception if crawling fails.

6. ProfileRepliesOperator

Collects comments made by a profile on various publications.

Workflow:

  • Extracts job parameters like pseudo, profile_id, last_date, and crawl_limit.
  • Retrieves profile comments using the driver's get_profiles_replies method, filtering by since timestamp.
  • For each comment:
  • Associates the comment with the publication URL and marks the author as the profile's pseudo.
  • Saves the comment to MongoDB in the profile_reply collection, skipping duplicates.
  • Prepares media (images and videos) for the comment.
  • Updates API quota if using an API-based tool.
  • Enforces a time limit (16,000 seconds) to prevent excessive runtime.
  • Logs the number of publications crawled.