在Ubuntu上安装PostgreSQL是数据分析的基础,若需处理地理空间数据或增强功能,还需安装对应扩展。
sudo apt update
sudo apt install postgresql-14 postgresql-contrib-14
安装完成后,启动PostgreSQL服务并设置开机自启:sudo systemctl enable --now postgresql
sudo apt install postgis postgresql-14-postgis
CREATE EXTENSION tablefunc;
CREATE EXTENSION hypopg;
数据分析的前提是有高质量的数据,需完成数据导入与清洗。
COPY命令将本地CSV文件导入数据库表。假设存在sales.csv(含product_name,amount,sale_date字段):CREATE DATABASE data_warehouse; -- 创建数据仓库数据库
\c data_warehouse -- 切换到目标数据库
CREATE TABLE sales (id SERIAL PRIMARY KEY, product_name VARCHAR(255), amount INT, sale_date DATE); -- 创建表结构
COPY sales(product_name, amount, sale_date) FROM '/path/to/sales.csv' DELIMITER ',' CSV HEADER; -- 导入数据(HEADER表示第一行为列名)
raster2pgsql工具将TIF格式高程数据导入PostGIS。先安装工具(若未安装),再执行转换与导入:raster2pgsql -I -C -F -t 100x100 -s 4326 dem.tif public.dem > dem.sql # 转换为SQL脚本(-I创建空间索引,-C添加约束)
psql -U postgres -d data_warehouse -f dem.sql # 执行脚本导入数据
product_name和sale_date的最小ID):DELETE FROM sales
WHERE id NOT IN (
SELECT MIN(id)
FROM sales
GROUP BY product_name, sale_date
);
COALESCE函数将空值替换为默认值(如将amount列的空值替换为0):UPDATE sales SET amount = COALESCE(amount, 0) WHERE amount IS NULL;
PostgreSQL提供了丰富的分析功能,涵盖聚合、窗口函数、地理空间分析等。
COUNT、SUM、AVG等聚合函数统计汇总数据,结合GROUP BY按维度分组:
SELECT
department,
COUNT(*) AS employee_count,
AVG(salary) AS avg_salary
FROM employees
GROUP BY department;
WHERE过滤行):SELECT
department,
AVG(salary) AS avg_salary
FROM employees
WHERE hire_date >= '2020-01-01'
GROUP BY department;
HAVING过滤分组后的结果(如筛选平均工资超过5500的部门):SELECT
department,
AVG(salary) AS avg_salary
FROM employees
WHERE hire_date >= '2020-01-01'
GROUP BY department
HAVING AVG(salary) > 5500;
STRING_AGG合并字符串(如将同一部门员工姓名拼接成列表):SELECT
department,
STRING_AGG(name, ', ') AS employee_names
FROM employees
GROUP BY department;
RANK、DENSE_RANK、ROW_NUMBER等窗口函数实现排名、累计计算:
RANK会跳过相同值,DENSE_RANK不会):SELECT
name,
department,
salary,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank
FROM employees;
SELECT
name,
department,
salary,
SUM(salary) OVER (PARTITION BY department ORDER BY hire_date) AS cumulative_salary
FROM employees;
SELECT ST_Value(rast, ST_SetSRID(ST_MakePoint(106.1253, 29.5052), 4326)) AS elevation
FROM dem
WHERE ST_Intersects(rast, ST_SetSRID(ST_MakePoint(106.1253, 29.5052), 4326));
SELECT AVG((ST_PixelAsPolygons(rast)).val) AS avg_elevation
FROM dem
WHERE ST_Intersects(rast, ST_MakeEnvelope(106.0950, 29.4494, 106.1470, 29.5090, 4326));
大数据量下的分析性能至关重要,需通过索引、查询优化提升效率。
WHERE、ORDER BY查询):CREATE INDEX idx_sale_date ON sales(sale_date);
CREATE INDEX idx_dem_geom ON dem USING GIST (rast);
hypopg,模拟索引效果,无需实际创建):SELECT * FROM hypopg_create_index('public.sales_sale_date_idx', 'sales', 'sale_date');
EXPLAIN ANALYZE分析查询计划,找出性能瓶颈:EXPLAIN ANALYZE
SELECT department, AVG(salary) AS avg_salary
FROM employees
WHERE hire_date >= '2020-01-01'
GROUP BY department;
SELECT *,只查询需要的列,减少数据传输量。LIMIT限制返回行数,快速验证查询逻辑。数据分析的结果需通过可视化工具呈现,同时监控数据库性能以确保稳定性。
PgBadger工具解析PostgreSQL日志,生成HTML格式的性能报告(如查询耗时、锁等待等)。
sudo apt install pgbadger
postgresql.conf):logging_collector = on
log_directory = 'pg_log'
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
log_statement = 'all'
log_min_duration_statement = 0 # 记录所有查询的执行时间
log_connections = on
log_disconnections = on
pgbadger -f csv /var/lib/postgresql/14/main/pg_log/*.log -o pgbadger_report.html
pgbadger_report.html即可查看详细的性能分析结果。sales表中的数据导入Metabase,创建“各部门销售额趋势”折线图,直观呈现销售变化。通过以上流程,可在Ubuntu上利用PostgreSQL完成从数据准备到分析、优化、可视化的完整数据分析任务,满足业务决策需求。