Skip to content

BaseSocialOperator

Overview

The BaseSocialOperator is a foundational Airflow operator designed for crawling and processing social media data.
It abstracts common tasks such as driver setup, job execution management, and cleanup logic.
This operator standardizes workflows across platforms and ensures that authentication, job state updates, and error handling are consistently managed.

It works in conjunction with the BaseDAGSocialMediaProfiles DAG, which handles DAG-level operations such as context logging and XCom cleanup.


Workflow

The BaseSocialOperator typically follows this process:

  1. Extract parameters and context from Airflow's execution context.
  2. Initialize the web driver (e.g., Selenium) and authentication credentials.
  3. Update job state to "running".
  4. Execute job-specific crawling (via the crawl(job) method).
  5. Update job state to "pending" if successful, or "failed" upon error.
  6. Retrieve and update cookies if authentication was used.
  7. Close the driver session and perform cleanup.
  8. Finalize the task by cleaning up DAG mappings and writing logs.

Key Components

1. Parameter Parsing

  • Parameters are accessed through context["params"] and may include:
  • jobs: The list of crawl jobs to process.
  • media_id, used_tool, platform, etc.
  • The operator uses this metadata to determine the logic for crawling, authentication, and job processing.

2. Driver Initialization

The _initialize_driver() method handles:

  • Choosing between API-based or Selenium-based crawling.
  • Authenticating via the DAuthenticatorHook, which retrieves account details, cookies, and credentials.
  • Creating a driver session via DriverManage if Selenium is required.

If any step fails, a DriverInitException is raised to prevent faulty execution.


3. Job Processing

The _process_jobs() method is responsible for:

  • Iterating through each job.
  • Setting job state to "running" before execution.
  • Calling the crawl(job) method (to be implemented in subclasses).
  • On success:
  • Sets job state to "pending" for future re-execution.
  • On failure:
  • Logs the error.
  • Marks job state as "failed" for monitoring.

4. Finalization and Cleanup

After jobs are processed:

  • If login was performed:
  • Cookies are retrieved and optionally updated.
  • The _close() method is called to terminate the driver session safely.
  • The finish_task() method:
  • Updates stored cookies (if needed).
  • Deletes DAG run mappings from the authentication system.
  • Writes final logs.

Integration with Base DAG

The BaseDAGSocialMediaProfiles class provides DAG-level utilities for social crawling pipelines:

  • Logs all parameters passed into the DAG.
  • Pushes account information to XCom using the DAG ID as the key.
  • Helps with debugging and tracing.

cleanup_xcom(context)

  • Cleans up old XCom entries associated with the DAG.
  • Ensures clean runs for each DAG execution cycle.