Hive如何实现DML数据操作、分区表和分桶表

发布时间：2021-12-16 14:12:28 作者：小新
来源：亿速云阅读：158

这篇文章主要为大家展示了“Hive如何实现DML数据操作、分区表和分桶表”，内容简而易懂，条理清晰，希望能够帮助大家解决疑惑，下面让小编带领大家一起研究并学习一下“Hive如何实现DML数据操作、分区表和分桶表”这篇文章吧。

1、DML数据操作

1.1、数据导入

1.通过load data导入
	load data [local] inpath '数据的path' [overwrite] 
		#[local] ：如果不加该字段表示路径为HDFS。加上local表示本地路径
		#[overwrite] ：如果加该字段第二次导入会覆盖第一次导入的数据。不加会追加
		
	into table 表名 [partition (partcol1=val1,…)];
		#[partition (partcol1=val1,…)] ：指定分区的字段（后面再说）。
		
tip：set hive.exec.mode.local.auto=true; 使用本地模式去跑MR（只有在一定条件下才跑本地不满足还跑集群）


-----------------------------------------------------------
2.通过查询语句向表中插入数据（Insert）

	2.1 直接向表中插入新的数据
		insert into student values(1,'aa');

	2.2 将查询的结果插入到表中(注意：查询的结果的列数和原表的列必须保持一致（列的数量和类型）)
		insert overwrite table 表名 sql语句;


--------------------------------------------------------------
3.查询语句中创建表并加载数据（As Select）
	create table if not exists 表名
	as sql语句;
	
	
	
----------------------------------------------------------------
4.创建表时通过Location指定加载数据路径
	create table if not exists student3(
	id int,
	name string
	)
	row format delimited fields terminated by '\t'
	location '/input';


--------------------------------------------------------------------
5.导入数据（只有导出的数据才能导入）
	注意：表必须不存在，否则会报错
	import table 库名.表名  from 'HDFS导出的路径';

1.2、数据导出

1. insert导出
	insert overwrite [local] directory '路径'
	row format delimited fields terminated by '\t' #指定分隔符
            sql查询语句;
	#local:如果加上该字段导出的路径为本地。如果不加该字段导出的路径为HDFS

    例：
	insert overwrite local directory '/opt/module/hive/datas2' 
	row format delimited fields terminated by '\t'
	select * from db4.student3;

	insert overwrite directory '/output' 
	row format delimited fields terminated by '\t'
	select * from db4.student3;


-------------------------------------------------------------------
2. Hadoop命令导出到本地

	hadoop fs -get '表中数据的路径'  '本地路径'
	hdfs dfs -get '表中数据的路径'  '本地路径'
	在hive客户端中 ：dfs -get '表中数据的路径'  '本地路径'


--------------------------------------------------------------------
3.Hive Shell 命令导出
	bin/hive -e 'select * from 表名;' > 本地路径;


--------------------------------------------------------------------
4 Export导出到HDFS上

	export table 库名.表名 to 'HDFS路径';


--------------------------------------------------------------------
5.Sqoop导出
	后面会提。。。

2、分区表和分桶表

2.1、分区表

一 创建分区表
	create table 表名(
		deptno int, dname string, loc string
	)
	partitioned by (字段名 字段类型) #指定分区字段
	row format delimited fields terminated by '\t';

   案例：
	create table dept_partition(
	deptno int, dname string, loc string
	)
	partitioned by (day string)
	row format delimited fields terminated by '\t';


---------------------------------------------------------------------------------
二 分区表的操作：

	1.添加分区
	alter table 表名 add partition(分区字段名='值') partition(分区字段名='值') .......
	
	2.查看分区
	show partitions 表名;
	
	3.删除分区
	alter table 表名 drop partition(分区字段名='值'),partition(分区字段名='值').......
	
	4.向分区表中添加数据
	load data [local] inpath '路径' [overwrite] into table 表名 partition(分区字段名='值');


---------------------------------------------------------------------------------------
三 创建二级分区表
	create table 表名(
	deptno int, dname string, loc string
	 )
	partitioned by (字段名1 字段类型, 字段名2 字段类型,......)
	row format delimited fields terminated by '\t';

   案例：
	create table dept_partition2(
	deptno int, dname string, loc string
	)
	partitioned by (day string, hour string)
	row format delimited fields terminated by '\t';


   向二级分区表中添加数据（在load数据时如果分区不存在则直接创建）：
	load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table
	dept_partition2 partition(day='20200401', hour='12');

	load data local inpath '/opt/module/hive/datas/dept_20200402.log' into table
	dept_partition2 partition(day='20200401', hour='13');


---------------------------------------------------------------
四 数据和分区的关联方式

	1.执行修复命令
		msck repair table 表名;

	2.方式二：上传数据后添加分区
		alter table 表名 add partition(字段名='值');

	3.方式三：创建文件夹后load数据到分区(会直接创建该分区)
		load data local inpath '/opt/module/hive/datas/dept_20200402.log' into table
		dept_partition2 partition(day='20200401', hour='13');

2.2、分桶表

一 创建分桶表：
	create table 表名(id int, name string)
	clustered by(id) #id:分桶字段。分桶时就会根据此id进行分桶。
	into 桶的数量 buckets
	row format delimited fields terminated by '\t';

   案例：
	create table stu_buck(id int, name string)
	clustered by(id) 
	into 4 buckets
	row format delimited fields terminated by '\t';

   注意：
	 1.在hive的新版本当我们向一个分桶表中load数据时会跑MR
		所以load数据的路径最好放在HDFS上。

	 2.我们分桶的数量要和ReduceTask的数量相等。

	 3.分桶的原则：根据分桶的字段的内容的hashCode值 % 分桶的数量 算出数据应该进入到哪个桶。

以上是“Hive如何实现DML数据操作、分区表和分桶表”这篇文章的所有内容，感谢各位的阅读！相信大家都有了一定的了解，希望分享的内容对大家有所帮助，如果还想学习更多知识，欢迎关注亿速云行业资讯频道！