Crawlserver Periodic Tasks¶
In this section, we are going to present the list of crawlserver periodic tasks. Those tasks are responsible for the execution of a repetitive treatement and they are not related to the steps of the data collection pipeline.
check_conversation_activity_dag :¶
This dag is lunched each month and it has 2 tasks :
-
Task1 : process_inactive_conversations:
It checks if the inactive conversations still inactive, otherwise update them to active
-
Task2 : process_active_conversations:
It checks if the conversations that are active still active, otherwise update them to inactive
check_channel_activity_dag :¶
This dag is lunched each month and it has 2 tasks :
-
Task1 : process_inactive_channels:
It checks if the inactive channels still inactive, otherwise update them to active
-
Task2 : process_active_channels:
It checks if the channel that are active still active, otherwise update them to inactive
crawlserver_manage_blocked_jobs_dag :¶
This dag is lunched eac day at 6 am, it checks if there are jobs that are blocked at the state running more than 3 hours. If it’s the case, it will updates them.
crawlserver_push_file_minio_dag :¶
This dag is lunched each 3 days at 2am, it checks if there are files of type image or document during the last 2 weeks that are not pushed in minio and it will try to retrieve them from local path airflow/file_storage/domains_attachements and upload them to Minio. The, updates the field pushed to True and delete the file from local path if the upload is sucessful.