Execution date and time¶

Execution date¶

In this Section, we are going to present how to calculate the next execution date for Teams jobs.

To know how much days we should wait befor lunching the jobs again for the different tasks, we need to retrieve this information from the collection domain_setting in the Mongo Database crawlserver. The picture below contains the differents parameters need to configure the necessary settings for a domain.

settings

The result of the mongo request for to obtains the content of this collection is saved in airflow as a model called Config

Below is the explanation of each field:

_id : The object ID generated automaticca
domain_id : the domain Teams id in crawlserver database
contract_start_date: the date when the contract becomes effective, signaling the start of the data crawling process. (datetime, format "%Y-%m-%d %H:%M:%S")
contract_end_date : the conclusion date that marks the end of te contract, signaling the end of data crawling process. (datetime, format "%Y-%m-%d %H:%M:%S")
channel_message_initial_date: specifies the starting date from which the system initiates the collection of channels messages (datetime, format "%Y-%m-%d %H:%M:%S")
conversation_message_initial_date: specifies the starting date from which the system initiates the collection of conversations messages (datetime, format "%Y-%m-%d %H:%M:%S")
frequency_new_members: determines each how many days we collect and update the list of members.
frequency_new_user_conversation: determines each how many days we collect the new conversation of a user.
frequency_new_channel_message: determines each how many days we collect the new messages in a channel.
frequency_new_conversation_message: determines each how many days we collect the new messages in a conversation.
frequency_update_conversation_info: determines each how many days we collect the information of conversations (name, members, descriptions,…) and update it.
nb_month_inactive_conversation: the number of months passed without activity to considered a conversation as inactive.
nb_month_inactive_channel: the number of months passed without activity to considered a channel as inactive.
nb_batch_channel: the number of channel tasks to lunch in the same dag run
nb_batch_member: the number of member tasks to lunch in the same dag run
nb_batch_conversation: the number of conversation tasks to lunch in the same dag run
nb_batch_conversation_info: the number of conversation_info tasks to lunch in the same dag run
nb_batch_conversation_message: the number of conversation message tasks to lunch in the same dag run
nb_batch_channel_message: the number of message tasks to lunch in the same dag run

To calculate the next execution date for each job, we need to get the frequency value based on the task from the config and we will pass it to the function below

def calculate_next_datetime(current_date, task, days =1):

    minutes = 20 if task =="channel" else 0
    ten_pm_next_date = current_date.replace(hour=22, minute=minutes, second=0, microsecond=0)

    ten_pm_next_date += timedelta(days=days)
    return ten_pm_next_date

The calculate_next_datetime function calculate the next execution date for the jobs. It takes as parameter the current date, the name of the task (member, messages, conversation,...) and the numer of days (the frequency) that we need to add to the current date to obtain the next execution date. After adding the numer of days required it will change the time to 22pm so that we can lunch the recovery system at night

Note: You should not forget that the jobs are lunched by batch do there are not all executed at the execution date.

Example :

Task: message
Execution_date: 2024-03-15 22:00:00
frequency: 1
Number of message job that need to be lunched 4000
Number of batch for message: 250

the scheduler will start to collect the first batch (250) at 2024-03-15 22:00:00 first than after 15 minutes will lunch the crawl for the next batch and so on. So some tasks will be lunched after some hours from their execution date for example batch number 13 will be lunched at 2024-03-16 01:00:00 and when we add 1 day to calculate the next execution date it will be 2024-03-17 01:00:00 and then because recover system should be executed at 22 we will add 21 more hours so actually the next execution date for this job exceeded 1 Day.

Execution Time¶

In this part, we are going to discuss how we can calculate the execution time for the different task for each recovery system date.

As we presented previously in the domain scheduler the data collection dags are lunched through the domain_scheduler_dag and each 15 minutes we have a batch of jobs to execute for each task. So we don't actually have the information of the start date and the end date of the crawl because there are not executed all in the same dag run. For that we are going to use the value of the execution date in each job as a reference to calculate the sum of the execution time of the dag runs grouped by the execution date and the name of task. To store the information related to the execution time we will use mongo request to save that in the collection Teamsexecutiontime. This collection has the following fields:

domain_id: the ID of the domain Teams
task: the name of the task
date: the date of the recovery system
start_date: the start date of the crawl
end_date: the end date of the crawl
execution_time: the sum of the execution time for all the jobs for the same task and date
pipeline_duration: the duration of the pipeline. The difference betwwen the start and the end date for the same task and date

Batchs¶

Some tasks may have a lot of pending jobs like conversations jobs and messages job. So, there is a risk to exceed the allowed number of request call allowed by the graph api used to crawl the data. So, to solve this issue, we lunch the tasks of a domain by batch instead of lunching them all in the same dag run. The number of batch to set in defined in the collection settings as presented previously.