Execution date and time¶
Execution date¶
In this Section, we are going to present how to calculate the next execution date for Teams jobs.
To know how much days we should wait befor lunching the jobs again for the different tasks, we need to retrieve this information from the collection domain_setting in the Mongo Database crawlserver. The picture below contains the differents parameters need to configure the necessary settings for a domain.

The result of the mongo request for to obtains the content of this collection is saved in airflow as a model called Config
Below is the explanation of each field:
- _id : The object ID generated automaticca
- domain_id : the domain Teams id in crawlserver database
- contract_start_date: the date when the contract becomes effective, signaling the start of the data crawling process. (datetime, format "%Y-%m-%d %H:%M:%S")
- contract_end_date : the conclusion date that marks the end of te contract, signaling the end of data crawling process. (datetime, format "%Y-%m-%d %H:%M:%S")
- channel_message_initial_date: specifies the starting date from which the system initiates the collection of channels messages (datetime, format "%Y-%m-%d %H:%M:%S")
- conversation_message_initial_date: specifies the starting date from which the system initiates the collection of conversations messages (datetime, format "%Y-%m-%d %H:%M:%S")
- frequency_new_members: determines each how many days we collect and update the list of members.
- frequency_new_user_conversation: determines each how many days we collect the new conversation of a user.
- frequency_new_channel_message: determines each how many days we collect the new messages in a channel.
- frequency_new_conversation_message: determines each how many days we collect the new messages in a conversation.
- frequency_update_conversation_info: determines each how many days we collect the information of conversations (name, members, descriptions,…) and update it.
- nb_month_inactive_conversation: the number of months passed without activity to considered a conversation as inactive.
- nb_month_inactive_channel: the number of months passed without activity to considered a channel as inactive.
- nb_batch_channel: the number of channel tasks to lunch in the same dag run
- nb_batch_member: the number of member tasks to lunch in the same dag run
- nb_batch_conversation: the number of conversation tasks to lunch in the same dag run
- nb_batch_conversation_info: the number of conversation_info tasks to lunch in the same dag run
- nb_batch_conversation_message: the number of conversation message tasks to lunch in the same dag run
- nb_batch_channel_message: the number of message tasks to lunch in the same dag run
To calculate the next execution date for each job, we need to get the frequency value based on the task from the config and we will pass it to the function below
def calculate_next_datetime(current_date, task, days =1):
minutes = 20 if task =="channel" else 0
ten_pm_next_date = current_date.replace(hour=22, minute=minutes, second=0, microsecond=0)
ten_pm_next_date += timedelta(days=days)
return ten_pm_next_date
The calculate_next_datetime function calculate the next execution date for the jobs. It takes as parameter the current date, the name of the task (member, messages, conversation,...) and the numer of days (the frequency) that we need to add to the current date to obtain the next execution date. After adding the numer of days required it will change the time to 22pm so that we can lunch the recovery system at night
Note: You should not forget that the jobs are lunched by batch do there are not all executed at the execution date.
Example :
- Task: message
- Execution_date: 2024-03-15 22:00:00
- frequency: 1
- Number of message job that need to be lunched 4000
- Number of batch for message: 250
the scheduler will start to collect the first batch (250) at 2024-03-15 22:00:00 first than after 15 minutes will lunch the crawl for the next batch and so on. So some tasks will be lunched after some hours from their execution date for example batch number 13 will be lunched at 2024-03-16 01:00:00 and when we add 1 day to calculate the next execution date it will be 2024-03-17 01:00:00 and then because recover system should be executed at 22 we will add 21 more hours so actually the next execution date for this job exceeded 1 Day.
Execution Time¶
In this part, we are going to discuss how we can calculate the execution time for the different task for each recovery system date.
As we presented previously in the domain scheduler the data collection dags are lunched through the domain_scheduler_dag and each 15 minutes we have a batch of jobs to execute for each task. So we don't actually have the information of the start date and the end date of the crawl because there are not executed all in the same dag run. For that we are going to use the value of the execution date in each job as a reference to calculate the sum of the execution time of the dag runs grouped by the execution date and the name of task. To store the information related to the execution time we will use mongo request to save that in the collection Teamsexecutiontime. This collection has the following fields:
- domain_id: the ID of the domain Teams
- task: the name of the task
- date: the date of the recovery system
- start_date: the start date of the crawl
- end_date: the end date of the crawl
- execution_time: the sum of the execution time for all the jobs for the same task and date
- pipeline_duration: the duration of the pipeline. The difference betwwen the start and the end date for the same task and date
Batchs¶
Some tasks may have a lot of pending jobs like conversations jobs and messages job. So, there is a risk to exceed the allowed number of request call allowed by the graph api used to crawl the data. So, to solve this issue, we lunch the tasks of a domain by batch instead of lunching them all in the same dag run. The number of batch to set in defined in the collection settings as presented previously.