Ubuntu下Hadoop作业调度的实现路径
在Ubuntu上,Hadoop作业调度通常分为两个层面:一是集群侧的多租户与队列资源调度(YARN Scheduler),二是按时间或数据触发的工作流/定时调度(如 Oozie、crontab)。下面给出可直接落地的实现方案与关键配置。
一、集群侧调度 YARN Scheduler 选型与配置
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.queues</name>
<value>default</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.default.capacity</name>
<value>100</value>
</property>
$HADOOP_HOME/sbin/stop-yarn.sh
$HADOOP_HOME/sbin/start-yarn.sh
yarn jar /path/to/your-job.jar com.example.YourJobClass \
-Dmapreduce.job.queuename=default input output
二、按时间与数据触发的工作流调度
使用 Oozie 编排与定时
oozie-apps/
└─ mr-wordcount-wf/
├─ job.properties
├─ workflow.xml
└─ lib/(依赖jar)
nameNode=hdfs://master:8020
oozie.wf.application.path=${nameNode}/user/${user}/oozie-apps/mr-wordcount-wf/workflow.xml
inputDir=mr-wordcount-wf/input
outputDir=mr-wordcount-wf/output
<workflow-app name="mr-wordcount-wf" xmlns="uri:oozie:workflow:0.5">
<start to="mr-node"/>
<action name="mr-node">
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property><name>mapreduce.job.queuename</name><value>${queueName}</value></property>
<!-- 其他MR参数 -->
</configuration>
</map-reduce>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail"><message>Workflow failed</message></kill>
<end name="end"/>
</workflow-app>
# 上传应用
hdfs dfs -mkdir -p /user/${USER}/oozie-apps/mr-wordcount-wf
hdfs dfs -put oozie-apps/mr-wordcount-wf /user/${USER}/oozie-apps/
# 运行一次
oozie job -oozie http://<oozie-host>:11000/oozie -config oozie-apps/mr-wordcount-wf/job.properties -run
# 创建定时协调器(按时间触发,示例为每天2点)
# 在 coordinator.xml 中配置 <start>, <end>, <frequency> 与 <dataset> 等
oozie job -oozie http://<oozie-host>:11000/oozie -config coordinator.properties -run
使用 Linux crontab 触发 Hadoop/Hive/Sqoop 脚本
#!/usr/bin/env bash
set -e
LOG=~/etl/run_$(date +%F).log
hive -f /home/hadoop/sql/daily_import.hql >> "$LOG" 2>&1
crontab -e
0 2 * * * /home/hadoop/bin/run_etl.sh
三、监控与运维要点
yarn application -list
yarn application -status <application_id>
yarn logs -applicationId <application_id>
四、方案选型建议