ubuntu

Ubuntu HDFS数据如何压缩

小樊
32
2025-12-08 22:18:01
栏目: 智能运维

Ubuntu上HDFS数据压缩实操指南

一 常用压缩格式与适用场景

二 前置检查与集群配置

<configuration>
  <property>
    <name>io.compression.codecs</name>
    <value>
      org.apache.hadoop.io.compress.GzipCodec,
      org.apache.hadoop.io.compress.DefaultCodec,
      org.apache.hadoop.io.compress.BZip2Codec,
      org.apache.hadoop.io.compress.SnappyCodec,
      org.apache.hadoop.io.compress.ZstdCodec
    </value>
  </property>
</configuration>
<configuration>
  <!-- Map输出压缩 -->
  <property>
    <name>mapreduce.map.output.compress</name>
    <value>true</value>
  </property>
  <property>
    <name>mapreduce.map.output.compress.codec</name>
    <value>org.apache.hadoop.io.compress.SnappyCodec</value>
  </property>
  <!-- Reduce输出压缩 -->
  <property>
    <name>mapreduce.output.fileoutputformat.compress</name>
    <value>true</value>
  </property>
  <property>
    <name>mapreduce.output.fileoutputformat.compress.codec</name>
    <value>org.apache.hadoop.io.compress.SnappyCodec</value>
  </property>
</configuration>
<property>
  <name>io.compression.codec.gzip.level</name>
  <value>6</value>
</property>

三 三种常用压缩方式

Configuration conf = new Configuration();
conf.setBoolean("mapreduce.output.fileoutputformat.compress", true);
conf.set("mapreduce.output.fileoutputformat.compress.codec",
         "org.apache.hadoop.io.compress.SnappyCodec");
Job job = Job.getInstance(conf);
// ... 其他作业配置
CREATE TABLE my_table (
  id INT,
  name STRING
)
STORED AS ORC
TBLPROPERTIES ("orc.compress"="SNAPPY");
df.write
  .mode("overwrite")
  .option("compression", "snappy")
  .parquet("/hdfs/path/to/dest")

四 小文件与存储优化

五 快速选择建议

0
看了该问题的人还看了