您好,登录后才能下订单哦!
密码登录
登录注册
点击 登录注册 即表示同意《亿速云用户服务条款》
# SQL的数据清洗方法有哪些
## 引言
在数据分析与数据科学领域,数据清洗(Data Cleaning)是确保数据质量的关键步骤。据统计,数据科学家80%的时间都花在数据清洗上。SQL作为关系型数据库的标准查询语言,提供了丰富的功能来高效完成数据清洗任务。本文将系统介绍SQL中常用的数据清洗方法,涵盖缺失值处理、重复数据删除、数据类型转换、异常值检测等核心场景。
---
## 一、缺失值处理
### 1. 检测缺失值
```sql
-- 检查某列的缺失值数量
SELECT COUNT(*)
FROM table_name
WHERE column_name IS NULL;
-- 检查每列的缺失值比例
SELECT
COUNT(*) AS total_rows,
SUM(CASE WHEN column1 IS NULL THEN 1 ELSE 0 END) AS null_column1,
SUM(CASE WHEN column2 IS NULL THEN 1 ELSE 0 END) AS null_column2
FROM table_name;
-- 删除包含缺失值的行
DELETE FROM table_name
WHERE column_name IS NULL;
-- 删除缺失率过高的列
ALTER TABLE table_name DROP COLUMN column_name;
-- 用固定值填充
UPDATE table_name
SET column_name = 'Unknown'
WHERE column_name IS NULL;
-- 用均值/中位数填充(需先计算)
UPDATE products
SET price = (SELECT AVG(price) FROM products WHERE price IS NOT NULL)
WHERE price IS NULL;
-- 用相邻值填充(需排序后使用窗口函数)
UPDATE table_name t1
SET column_name = (
SELECT LAST_VALUE(column_name IGNORE NULLS) OVER (
ORDER BY id ROWS BETWEEN 1 PRECEDING AND CURRENT ROW
)
FROM table_name t2
WHERE t1.id = t2.id AND t1.column_name IS NULL;
-- 基于单列检测
SELECT column_name, COUNT(*)
FROM table_name
GROUP BY column_name
HAVING COUNT(*) > 1;
-- 基于多列检测
SELECT col1, col2, col3, COUNT(*)
FROM table_name
GROUP BY col1, col2, col3
HAVING COUNT(*) > 1;
-- 使用DISTINCT创建新表
CREATE TABLE new_table AS
SELECT DISTINCT * FROM original_table;
-- 使用ROW_NUMBER()删除完全重复行
DELETE FROM table_name
WHERE id IN (
SELECT id FROM (
SELECT id,
ROW_NUMBER() OVER(PARTITION BY col1, col2, col3 ORDER BY id) AS rn
FROM table_name
) t WHERE t.rn > 1
);
-- 使用GROUP BY保留最新记录
CREATE TABLE deduplicated AS
SELECT col1, col2, MAX(update_time) AS latest_time
FROM table_name
GROUP BY col1, col2;
-- 显式类型转换
SELECT CAST(column_name AS DATE) FROM table_name;
SELECT column_name::INTEGER FROM table_name; -- PostgreSQL语法
-- 隐式转换处理
SELECT TO_CHAR(date_column, 'YYYY-MM-DD') FROM table_name; -- Oracle
SELECT FORMAT(date_column, 'yyyy-MM-dd') FROM table_name; -- SQL Server
-- 大小写统一
UPDATE customers SET name = UPPER(name);
UPDATE customers SET email = LOWER(email);
-- 去除空格
UPDATE products
SET product_name = TRIM(BOTH ' ' FROM product_name);
-- 正则表达式清洗
-- PostgreSQL示例:提取纯数字
UPDATE contacts
SET phone = REGEXP_REPLACE(phone, '[^0-9]', '', 'g');
-- MySQL日期格式化
UPDATE orders
SET order_date = STR_TO_DATE(raw_date, '%m/%d/%Y');
-- SQL Server格式转换
UPDATE events
SET start_time = CONVERT(DATETIME, date_string, 103); -- 103表示dd/mm/yyyy格式
-- 使用Z-score检测
WITH stats AS (
SELECT
AVG(price) AS mean_price,
STDDEV(price) AS std_price
FROM products
)
SELECT product_id, price
FROM products, stats
WHERE ABS((price - mean_price) / std_price) > 3; -- 3σ原则
-- 使用IQR方法
WITH quartiles AS (
SELECT
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY value) AS q1,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) AS q3
FROM measurements
)
SELECT id, value
FROM measurements, quartiles
WHERE value < (q1 - 1.5*(q3-q1)) OR value > (q3 + 1.5*(q3-q1));
-- 年龄合理性检查
SELECT user_id, birth_date
FROM users
WHERE birth_date < '1900-01-01' OR birth_date > CURRENT_DATE;
-- 金额范围验证
SELECT transaction_id, amount
FROM transactions
WHERE amount < 0 OR amount > 1000000;
-- 替换为边界值
UPDATE sensor_data
SET reading_value =
CASE
WHEN reading_value > 100 THEN 100
WHEN reading_value < 0 THEN 0
ELSE reading_value
END;
-- 标记异常记录(不删除)
ALTER TABLE sales ADD COLUMN is_outlier BOOLEAN DEFAULT FALSE;
UPDATE sales
SET is_outlier = TRUE
WHERE amount > (SELECT AVG(amount) + 3*STDDEV(amount) FROM sales);
-- 查找违反外键约束的记录
SELECT o.order_id, o.customer_id
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL;
-- 处理方案1:删除孤儿记录
DELETE FROM orders
WHERE customer_id NOT IN (SELECT customer_id FROM customers);
-- 处理方案2:设置为默认值
UPDATE orders
SET customer_id = 0 -- 假设0是默认客户ID
WHERE customer_id NOT IN (SELECT customer_id FROM customers);
-- 检查订单日期早于发货日期
SELECT order_id, order_date, ship_date
FROM orders
WHERE ship_date < order_date;
-- 修复不一致数据
UPDATE orders
SET ship_date = order_date + INTERVAL '1 day'
WHERE ship_date < order_date;
WITH cleaned_data AS (
-- 第一步:处理缺失值
SELECT
id,
COALESCE(name, 'Unknown') AS name,
CASE WHEN age < 0 THEN NULL ELSE age END AS age
FROM raw_users
),
normalized_data AS (
-- 第二步:标准化格式
SELECT
id,
UPPER(TRIM(name)) AS name,
age
FROM cleaned_data
)
-- 最终输出
SELECT * FROM normalized_data
WHERE age BETWEEN 18 AND 100;
-- PostgreSQL示例:创建电话号码清洗函数
CREATE OR REPLACE FUNCTION clean_phone(phone TEXT)
RETURNS TEXT AS $$
BEGIN
RETURN REGEXP_REPLACE(phone, '[^0-9]', '', 'g');
END;
$$ LANGUAGE plpgsql;
-- 使用函数批量清洗
UPDATE contacts SET phone = clean_phone(phone);
备份原始数据:始终在清洗前创建数据备份
CREATE TABLE customers_backup AS SELECT * FROM customers;
分阶段验证:通过中间表逐步验证清洗效果
CREATE TABLE step1_cleaned AS SELECT * FROM raw_data WHERE ...;
文档记录:记录所有清洗规则和业务决策
自动化流程:使用存储过程实现可重复的清洗流程
CREATE PROCEDURE clean_database() AS
BEGIN
-- 清洗步骤1
-- 清洗步骤2
-- ...
END;
SQL提供了强大而灵活的数据清洗能力,从基础的缺失值处理到复杂的异常检测,都能高效完成。掌握这些技术可以显著提高数据质量,为后续分析建模奠定坚实基础。实际工作中应根据数据特性和业务需求,灵活组合使用这些方法,并建立系统化的数据质量管理流程。
注意:不同数据库系统(MySQL、PostgreSQL、SQL Server等)的语法可能有差异,本文示例以标准SQL为主,部分展示了特定数据库的语法扩展。 “`
该文档共约3100字,采用Markdown格式编写,包含代码块、多级标题和结构化内容,可直接用于技术文档发布或博客文章。需要调整具体数据库语法或补充特定案例时,可进一步扩展相应章节。
免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。