Skip to content

Retrieve and add domains

Retrieve Domains DAG

The retrieve_domains_dag is responsible of retrieving domains associated with a specific school or group of schools from a Webserver and store them in Crawlserver MongoDB database.

Parameters

The DAG accepts two parameters: - school_identifier (int): School ID for domain retrieval - group_school_identifier (int): Group school ID for domain retrieval

alt text

Tasks

The dags has 3 PythonOperator tasks:

alt text

1. validate_params

  • This task ensures that the input parameters are correctly set before proceeding with domain retrieval. It checks that either school_identifier or groupschool_identifier is provided but not both.

2. get_auth_token

  • Retrieves an authentication token from the Webserver to access protected API endpoints and stores it in XCom to be used by the subsequent tasks.

3. retrieve_domains

  • Fetches domains from the Webserver and stores them in the domain MongoDB collection. It uses the access token obtained from the get_auth_token task to send a request to the get-domains-for-crawlserver API with the appropriate parameters school or group school identifier. The task also verify the retrieved domains exist already in the collection and store them otherwise.

Cleaning Xcom

  • When the DAG execution is successfully completed, all XCom entries related to this DAG execution will be deleted