任务调度神器airflow怎么用呢

发布时间：2021-12-09 09:21:35 作者：柒染
来源：亿速云阅读：233

# 任务调度神器Airflow怎么用呢

## 一、什么是Airflow

Apache Airflow 是一个由Python编写的开源工作流管理平台，最初由Airbnb开发并于2016年开源。它通过**有向无环图(DAG)**的方式定义任务依赖关系，提供可视化界面监控任务执行状态，是现代数据工程领域的核心工具之一。

### 核心特性
1. **可视化工作流**：Web UI直观展示任务依赖和执行状态
2. **灵活调度**：支持基于时间、外部事件等多种触发方式
3. **可扩展性**：丰富的Operator和Hook支持各类系统集成
4. **编程式定义**：所有工作流通过Python代码定义，便于版本控制

## 二、核心概念解析

### 1. DAG (Directed Acyclic Graph)
```python
from airflow import DAG
from datetime import datetime

dag = DAG(
    'my_first_dag',
    start_date=datetime(2023, 1, 1),
    schedule_interval='@daily'
)

2. Operator

常见类型： - BashOperator：执行Shell命令 - PythonOperator：执行Python函数 - EmailOperator：发送邮件通知 - 数据库相关：PostgresOperator, MySqlOperator等

3. Task

from airflow.operators.bash import BashOperator

task1 = BashOperator(
    task_id='print_date',
    bash_command='date',
    dag=dag
)

4. Task Relationships

task1 >> task2  # 设置task2依赖task1
# 等价于
task2.set_upstream(task1)

三、完整安装指南

1. 基础环境准备

# Python 3.8+环境
sudo apt-get install python3 python3-pip

# 推荐使用虚拟环境
python3 -m venv airflow_env
source airflow_env/bin/activate

2. 安装Airflow

pip install "apache-airflow==2.6.1" \
    --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.6.1/constraints-3.8.txt"

3. 初始化数据库

airflow db init

4. 创建管理员用户

airflow users create \
    --username admin \
    --firstname Peter \
    --lastname Parker \
    --role Admin \
    --email spiderman@example.com

5. 启动服务

airflow webserver -p 8080
airflow scheduler

四、实战案例：电商数据处理流水线

案例背景

构建每日执行的电商数据处理流程： 1. 下载前日订单数据 2. 清洗转换数据 3. 加载到数据仓库 4. 生成销售报表 5. 发送通知邮件

完整DAG实现

from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator
from airflow.operators.email import EmailOperator

default_args = {
    'owner': 'ecom_team',
    'retries': 3,
    'retry_delay': timedelta(minutes=5)
}

def process_data(**context):
    import pandas as pd
    execution_date = context['execution_date']
    # 数据清洗逻辑
    print(f"Processing data for {execution_date}")

with DAG(
    'ecommerce_pipeline',
    default_args=default_args,
    start_date=datetime(2023, 6, 1),
    schedule_interval='0 3 * * *',  # 每天凌晨3点
    catchup=False
) as dag:
    
    download = BashOperator(
        task_id='download_orders',
        bash_command='wget https://example.com/orders/{{ ds }}.csv -O /tmp/orders.csv'
    )
    
    process = PythonOperator(
        task_id='process_data',
        python_callable=process_data,
        provide_context=True
    )
    
    load = BashOperator(
        task_id='load_to_dw',
        bash_command='psql -c "COPY orders FROM \'/tmp/orders.csv\' CSV HEADER"'
    )
    
    report = BashOperator(
        task_id='generate_report',
        bash_command='python /scripts/generate_report.py {{ ds }}'
    )
    
    notify = EmailOperator(
        task_id='send_email',
        to='team@example.com',
        subject='Daily Report {{ ds }}',
        html_content="<h1>Daily ETL Completed</h1>"
    )
    
    download >> process >> load >> [report, notify]

五、高级功能详解

1. 变量与连接管理

from airflow.models import Variable
from airflow.hooks.base import BaseHook

# 使用变量
api_key = Variable.get("MY_API_KEY")

# 获取数据库连接
conn = BaseHook.get_connection("postgres_default")
conn_string = f"postgresql://{conn.login}:{conn.password}@{conn.host}:{conn.port}/{conn.schema}"

2. XCom跨任务通信

# 生产者任务
def push_data(**context):
    context['ti'].xcom_push(key='my_key', value='secret_data')

# 消费者任务
def pull_data(**context):
    value = context['ti'].xcom_pull(key='my_key')

3. 动态DAG生成

for region in ['north', 'south', 'east', 'west']:
    dag_id = f'process_{region}_data'
    
    with DAG(dag_id, schedule_interval='@daily') as region_dag:
        start = DummyOperator(task_id='start')
        process = PythonOperator(
            task_id=f'process_{region}',
            python_callable=process_region,
            op_kwargs={'region': region}
        )
        start >> process

4. 自定义Operator

from airflow.models import BaseOperator

class MyCustomOperator(BaseOperator):
    def __init__(self, my_param, **kwargs):
        super().__init__(**kwargs)
        self.my_param = my_param
    
    def execute(self, context):
        print(f"Running with param: {self.my_param}")

六、最佳实践与常见问题

1. 性能优化建议

任务粒度：保持每个任务执行时间在1分钟到1小时之间
资源控制：合理设置pool和priority_weight
数据库优化：对MySQL/PostgreSQL进行适当配置
并行度：调整scheduler的parallelism参数

2. 常见错误排查

# 查看任务日志
airflow tasks log my_dag my_task 2023-07-01

# 测试单个任务
airflow tasks test my_dag my_task 2023-07-01

3. 安全配置要点

启用RBAC（基于角色的访问控制）
加密敏感变量
使用Fernet密钥加密数据库连接
定期审计用户权限

七、生产环境部署方案

1. 执行器选择对比

执行器类型	适用场景	优缺点
SequentialExecutor	开发测试	简单但无法并行
LocalExecutor	小型生产环境	支持并行但单点故障
CeleryExecutor	中大型生产环境	分布式但需要维护Redis
KubernetesExecutor	云原生环境	弹性伸缩但配置复杂

2. 高可用架构示例

                   +-----------------+
                   |  Load Balancer  |
                   +--------+--------+
                            |
           +----------------+----------------+
           |                                 |
+----------+---------+           +----------+---------+
|  Web Server 1      |           |  Web Server 2      |
+----------+---------+           +----------+---------+
           |                                 |
+----------+---------+           +----------+---------+
|  Scheduler 1       |           |  Scheduler 2       |
+--------------------+           +--------------------+
           |                                 |
           +----------------+----------------+
                            |
                   +--------+--------+
                   |  PostgreSQL HA |
                   +--------+--------+
                            |
                   +--------+--------+
                   |  Redis Cluster |
                   +-----------------+

八、生态整合与扩展

1. 常用Provider包

pip install apache-airflow-providers-google
pip install apache-airflow-providers-amazon
pip install apache-airflow-providers-microsoft-azure

2. 与大数据工具集成

Spark：使用SparkSubmitOperator
Hive：通过HiveOperator
Kafka：集成KafkaProducerOperator

3. 监控方案

Prometheus：使用airflow-exporter
Grafana：配置专用仪表盘
Sentry：错误日志收集

九、学习资源推荐

1. 官方文档

2. 社区资源

3. 进阶书籍

《Data Pipelines with Apache Airflow》
《Airflow in Action》

通过本文的全面介绍，您应该已经掌握了Airflow的核心概念和实战技巧。建议从简单DAG开始实践，逐步构建复杂的数据管道。Airflow的强大之处在于其灵活性，随着使用深入，您会发现它能解决各种自动化工作流需求。 “`