您好,登录后才能下订单哦!
# SQL Hive中的Select From语句使用详解
## 1. Hive SQL基础概述
Apache Hive是构建在Hadoop之上的数据仓库工具,它提供了类似SQL的查询语言——HiveQL(简称HQL),允许用户对存储在HDFS中的大规模数据集进行查询和分析。其中`SELECT FROM`语句作为最基础且核心的查询语法,承担着数据检索的重要功能。
### 1.1 Hive与传统SQL的异同
相同点:
- 语法结构高度相似
- 支持标准SQL操作(SELECT, JOIN, WHERE等)
- 使用表、列等关系型概念
不同点:
- Hive处理PB级数据,传统SQL通常处理GB/TB级
- Hive执行延迟较高(分钟级),传统SQL是毫秒级响应
- Hive底层是MapReduce/Tez/Spark作业
## 2. 基本SELECT FROM语法结构
### 2.1 最简查询形式
```sql
SELECT column1, column2
FROM database_name.table_name;
示例:
SELECT employee_id, name
FROM hr.employees;
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[ORDER BY col_list]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT number];
-- 选择单列
SELECT product_name FROM products;
-- 选择多列
SELECT product_id, category, price FROM products;
-- 选择所有列(生产环境慎用)
SELECT * FROM sales_records;
-- 选择特定前缀的列
SELECT sales.* FROM sales_2023 sales;
SELECT
user_id AS id,
registration_date AS reg_date
FROM users;
SELECT
product_id,
quantity,
price,
quantity * price AS total_amount
FROM order_items;
-- 数值比较
SELECT * FROM products WHERE price > 100;
-- 字符串比较
SELECT * FROM customers WHERE name = '张三';
-- 日期比较
SELECT * FROM orders WHERE order_date >= '2023-01-01';
-- AND 运算符
SELECT * FROM employees
WHERE department = 'Sales' AND salary > 5000;
-- OR 运算符
SELECT * FROM products
WHERE category = 'Electronics' OR category = 'Appliances';
-- NOT 运算符
SELECT * FROM customers
WHERE NOT (age BETWEEN 18 AND 30);
-- IN 操作符
SELECT * FROM products
WHERE category IN ('Books', 'Music', 'Movies');
-- BETWEEN 范围查询
SELECT * FROM transactions
WHERE amount BETWEEN 100 AND 1000;
-- LIKE 模糊匹配
SELECT * FROM users
WHERE email LIKE '%@gmail.com';
-- IS NULL 空值判断
SELECT * FROM employees
WHERE manager_id IS NULL;
-- 默认升序(ASC)
SELECT * FROM products
ORDER BY price;
-- 显式降序
SELECT * FROM sales
ORDER BY sale_date DESC;
SELECT employee_id, name, department, salary
FROM employees
ORDER BY department ASC, salary DESC;
-- 获取价格最高的10个产品
SELECT * FROM products
ORDER BY price DESC
LIMIT 10;
-- 按部门统计员工数
SELECT department, COUNT(*) AS emp_count
FROM employees
GROUP BY department;
常用聚合函数: - COUNT() - 计数 - SUM() - 求和 - AVG() - 平均值 - MAX()/MIN() - 最大/最小值 - VARIANCE()/STDDEV() - 方差/标准差
示例:
SELECT
product_category,
AVG(price) AS avg_price,
MAX(price) AS max_price,
COUNT(*) AS product_count
FROM products
GROUP BY product_category;
-- 筛选平均价格大于100的类别
SELECT
category,
AVG(price) AS avg_price
FROM products
GROUP BY category
HAVING avg_price > 100;
-- 获取唯一的部门列表
SELECT DISTINCT department FROM employees;
-- 多列去重
SELECT DISTINCT department, job_title FROM employees;
SELECT
product_id,
price,
CASE
WHEN price > 1000 THEN 'Premium'
WHEN price > 500 THEN 'Standard'
ELSE 'Budget'
END AS price_tier
FROM products;
-- 计算部门内薪资排名
SELECT
employee_id,
name,
department,
salary,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank
FROM employees;
-- 查询特定分区的数据
SELECT * FROM sales
WHERE dt = '2023-01-15' AND region = 'North';
-- 动态分区查询
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
-- 创建分桶表
CREATE TABLE bucketed_users (
id INT,
name STRING
) CLUSTERED BY (id) INTO 4 BUCKETS;
-- 分桶查询
SELECT * FROM bucketed_users
TABLESAMPLE(BUCKET 1 OUT OF 4 ON id);
-- 数据块抽样
SELECT * FROM large_table TABLESAMPLE(10 PERCENT);
-- 行数抽样
SELECT * FROM large_table TABLESAMPLE(100 ROWS);
SELECT *
SET hive.exec.parallel=true;
-- 分析每月销售趋势
SELECT
DATE_FORMAT(order_date, 'yyyy-MM') AS month,
COUNT(DISTINCT order_id) AS order_count,
SUM(amount) AS total_sales,
AVG(amount) AS avg_order_value
FROM orders
WHERE order_date BETWEEN '2022-01-01' AND '2022-12-31'
GROUP BY DATE_FORMAT(order_date, 'yyyy-MM')
ORDER BY month;
-- 识别高价值用户
SELECT
user_id,
COUNT(DISTINCT session_id) AS session_count,
SUM(CASE WHEN page_type = 'purchase' THEN 1 ELSE 0 END) AS purchase_count,
SUM(duration) AS total_time_spent
FROM user_sessions
GROUP BY user_id
HAVING purchase_count >= 3
ORDER BY total_time_spent DESC
LIMIT 100;
数据倾斜:使用SKEW JOIN
优化
SET hive.optimize.skewjoin=true;
SET hive.skewjoin.key=100000;
内存不足:调整Reducer数量
SET mapreduce.job.reduces=200;
小文件问题:合并输出
SET hive.merge.mapfiles=true;
SET hive.merge.size.per.task=256000000;
Hive的SELECT FROM语句作为数据查询的基石,虽然语法与传统SQL相似,但在大数据环境下有其独特的优化方式和执行特性。掌握其使用方法和优化技巧,能够显著提升海量数据查询的效率。随着Hive版本的迭代,更多现代SQL特性如CTE、高级窗口函数等也被逐步支持,使得HiveQL在处理复杂分析场景时更加得心应手。
注意:实际使用时请根据具体Hive版本调整语法,本文示例基于Hive 3.x版本。 “`
这篇文章共计约2900字,全面介绍了Hive中SELECT FROM语句的使用方法,从基础语法到高级特性,包含了大量实用示例和优化建议。内容采用Markdown格式,包含代码块、列表、标题等标准元素,适合技术文档的编写和阅读。
免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。