SQL-Hive中Select From怎么用

发布时间:2021-12-10 14:58:29 作者:小新
来源:亿速云 阅读:369
# SQL-Hive中Select From怎么用

## 一、基础语法与概念

### 1.1 SELECT FROM语句的核心作用
SELECT FROM是HiveQL(Hive Query Language)中最基础且最常用的查询语句,主要用于:
- 从Hive表中提取特定列的数据
- 配合WHERE等子句实现条件过滤
- 作为复杂查询的基础构建块

基本语法结构:
```sql
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[ORDER BY col_list]
[LIMIT number];

1.2 Hive与传统SQL的差异

  1. 执行环境:运行在Hadoop集群而非传统数据库
  2. 延迟特性:高延迟(分钟级)而非交互式查询
  3. 数据规模:针对TB/PB级数据优化
  4. 扩展语法:支持分区、分桶等特有操作

二、基础查询示例

2.1 单表查询

-- 查询所有列
SELECT * FROM employees;

-- 查询特定列
SELECT employee_id, name, department 
FROM employees;

-- 使用列别名
SELECT employee_id AS id, 
       name AS "员工姓名",
       salary*12 AS annual_salary
FROM employees;

2.2 使用DISTINCT去重

-- 获取不重复的部门列表
SELECT DISTINCT department FROM employees;

-- 多列组合去重
SELECT DISTINCT department, job_title 
FROM employees;

三、条件过滤(WHERE子句)

3.1 比较运算符

-- 基本比较
SELECT * FROM sales 
WHERE amount > 1000;

-- 日期比较(Hive需使用特定格式)
SELECT * FROM orders 
WHERE order_date >= '2023-01-01';

-- NULL值判断
SELECT * FROM customers 
WHERE phone_number IS NOT NULL;

3.2 逻辑运算符

-- AND/OR组合
SELECT * FROM products 
WHERE category = 'Electronics' 
AND price BETWEEN 500 AND 2000;

-- NOT运算符
SELECT * FROM employees 
WHERE NOT department = 'HR';

3.3 特殊操作符

-- IN运算符
SELECT * FROM students 
WHERE grade IN ('A', 'B');

-- LIKE模糊匹配
SELECT * FROM products 
WHERE product_name LIKE '%Pro%';

-- RLIKE正则匹配
SELECT * FROM logs 
WHERE message RLIKE 'error[0-9]+';

四、结果排序与限制

4.1 ORDER BY排序

-- 单列排序
SELECT * FROM employees 
ORDER BY hire_date DESC;

-- 多列排序
SELECT * FROM sales 
ORDER BY region ASC, amount DESC;

4.2 LIMIT限制结果

-- 返回前10条记录
SELECT * FROM large_table 
LIMIT 10;

-- 分页模拟(Hive 2.0+)
SELECT * FROM table 
ORDER BY id 
LIMIT 10 OFFSET 20;

五、聚合函数与分组

5.1 常用聚合函数

SELECT 
  COUNT(*) AS total_employees,
  AVG(salary) AS avg_salary,
  MAX(salary) AS max_salary,
  MIN(salary) AS min_salary,
  SUM(salary) AS total_payroll
FROM employees;

5.2 GROUP BY分组

-- 按部门统计
SELECT 
  department,
  COUNT(*) AS emp_count,
  AVG(salary) AS avg_salary
FROM employees
GROUP BY department;

-- 多列分组
SELECT 
  department, 
  job_level,
  COUNT(*) 
FROM employees
GROUP BY department, job_level;

5.3 HAVING子句

-- 筛选分组结果
SELECT 
  department,
  AVG(salary) AS avg_salary
FROM employees
GROUP BY department
HAVING avg_salary > 10000;

六、Hive特有功能

6.1 分区查询优化

-- 查询特定分区(显著提升性能)
SELECT * FROM sales 
WHERE dt='2023-01-15' 
AND region='APAC';

-- 显示分区信息
SHOW PARTITIONS sales;

6.2 分桶查询

-- 创建分桶表示例
CREATE TABLE bucketed_users (
  id INT,
  name STRING
) CLUSTERED BY (id) INTO 4 BUCKETS;

-- 对分桶列使用WHERE条件
SELECT * FROM bucketed_users 
WHERE id = 123;

6.3 抽样查询

-- 按百分比抽样
SELECT * FROM huge_table 
TABLESAMPLE(10 PERCENT);

-- 按数据量抽样
SELECT * FROM huge_table 
TABLESAMPLE(100 ROWS);

七、复杂查询示例

7.1 子查询

-- WHERE子句中的子查询
SELECT * FROM employees 
WHERE salary > (
  SELECT AVG(salary) FROM employees
);

-- FROM子句中的子查询
SELECT dept.avg_salary
FROM (
  SELECT department, AVG(salary) AS avg_salary
  FROM employees
  GROUP BY department
) dept
WHERE dept.avg_salary > 8000;

7.2 公用表表达式(CTE)

WITH 
dept_stats AS (
  SELECT department, COUNT(*) AS emp_count
  FROM employees
  GROUP BY department
),
high_emp_depts AS (
  SELECT department 
  FROM dept_stats 
  WHERE emp_count > 50
)
SELECT * FROM employees
WHERE department IN (
  SELECT department FROM high_emp_depts
);

八、性能优化建议

  1. 列裁剪:避免SELECT *,只查询必要列 “`sql – 不推荐 SELECT * FROM large_table;

– 推荐 SELECT id, name, status FROM large_table;


2. **分区裁剪**:充分利用分区字段过滤
   ```sql
   -- 高效查询
   SELECT * FROM partitioned_table 
   WHERE year=2023 AND month=6;
  1. 合理使用并行

    SET hive.exec.parallel=true;
    SET hive.exec.parallel.thread.number=8;
    
  2. JOIN优化

    -- 确保大表在JOIN右侧
    SELECT /*+ MAPJOIN(small) */ a.* 
    FROM big_table a JOIN small_table b 
    ON a.id = b.id;
    

九、常见问题排查

  1. 数据倾斜

    -- 检查键值分布
    SELECT key_column, COUNT(*) 
    FROM table 
    GROUP BY key_column 
    ORDER BY 2 DESC LIMIT 10;
    
  2. 内存配置

    SET hive.map.aggr.hash.percentmemory=0.5;
    SET mapreduce.reduce.memory.mb=4096;
    
  3. 执行计划分析

    EXPLN EXTENDED 
    SELECT count(*) FROM large_table;
    

十、最佳实践总结

  1. 查询设计原则

    • 先过滤后处理(WHERE在GROUP BY之前)
    • 减少中间数据量
    • 合理利用分区和索引
  2. 格式规范建议

    -- 良好的格式化示例
    SELECT 
     e.employee_id,
     e.name,
     d.department_name
    FROM 
     employees e
     JOIN departments d ON e.dept_id = d.dept_id
    WHERE 
     e.status = 'ACTIVE'
     AND e.hire_date > '2020-01-01'
    ORDER BY 
     e.salary DESC
    LIMIT 100;
    
  3. 监控与调优

    • 定期分析慢查询
    • 监控资源使用情况
    • 根据数据变化调整分区策略

通过掌握这些SELECT FROM的使用技巧,您将能够高效地从Hive中提取所需数据,为大数据分析奠定坚实基础。 “`

推荐阅读:
  1. select count(*) from v$lock 查询慢
  2. SQL中SELECT怎么用

免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。

sql hive select

上一篇:CSS如何显示HTML元素

下一篇:Hive基础操作的示例代码

相关阅读

您好,登录后才能下订单哦!

密码登录
登录注册
其他方式登录
点击 登录注册 即表示同意《亿速云用户服务条款》