引言

Hadoop是一个开源的大数据处理框架,适用于分布式存储和分布式计算。本文旨在为初学者提供一份详细的入门攻略,帮助您在CentOS单机环境下搭建Hadoop环境,并分享一些实战技巧。

一、Hadoop概述

1.1 Hadoop简介

Hadoop是一个由Apache软件基金会开发的开源分布式计算框架,用于处理大规模数据集。它由三个核心组件组成:Hadoop分布式文件系统(HDFS)、Hadoop MapReduce和Yet Another Resource Negotiator(YARN)。

1.2 Hadoop核心组件

1.2.1 Hadoop分布式文件系统(HDFS)

HDFS是一个分布式文件系统,用于存储大量数据。它将数据分割成小块,并存储在多个节点上,提高了数据的可靠性和访问速度。

1.2.2 Hadoop MapReduce

MapReduce是一种编程模型,用于处理大规模数据集。它将数据分割成小块,由多个节点并行处理,并最终合并结果。

1.2.3 YARN(Yet Another Resource Negotiator)

YARN是一个资源管理器,负责分配集群资源,如CPU和内存,并监控应用程序的执行。

二、CentOS单机环境搭建Hadoop

2.1 硬件与软件要求

  • 操作系统:CentOS 7
  • Java开发包:JDK 1.8
  • Hadoop版本:Hadoop 3.x

2.2 安装步骤

2.2.1 安装Java环境

  1. 安装Java开发包:
sudo yum install java-1.8.0-openjdk-devel
  1. 配置环境变量:
echo "export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.292.x86_" >> /etc/profile
echo "export PATH=\$JAVA_HOME/bin:\$PATH" >> /etc/profile
source /etc/profile

2.2.2 安装Hadoop

  1. 下载Hadoop安装包:
wget http://mirrors.cnnic.cn/apache/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz
  1. 解压安装包:
tar -zxvf hadoop-3.3.4.tar.gz -C /opt/hadoop
  1. 配置环境变量:
echo "export HADOOP_HOME=/opt/hadoop/hadoop-3.3.4" >> /etc/profile
echo "export PATH=\$HADOOP_HOME/bin:\$PATH" >> /etc/profile
source /etc/profile

2.3 配置Hadoop

  1. 配置hadoop-env.sh
cd $HADOOP_HOME/etc/hadoop
vi hadoop-env.sh

添加以下内容:

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.292.x86_
  1. 配置core-site.xml
vi core-site.xml

添加以下内容:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>
  1. 配置hdfs-site.xml
vi hdfs-site.xml

添加以下内容:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>
  1. 配置mapred-site.xml
vi mapred-site.xml

添加以下内容:

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>
  1. 配置yarn-site.xml
vi yarn-site.xml

添加以下内容:

<configuration>
    <property>
        <name>yarn.resourcemanager.host</name>
        <value>localhost</value>
    </property>
</configuration>

2.4 格式化HDFS

hdfs namenode -format

2.5 启动Hadoop服务

start-dfs.sh
start-yarn.sh

三、实战技巧

3.1 使用Hadoop命令行工具

  1. 查看HDFS文件:
hdfs dfs -ls /
  1. 上传文件到HDFS:
hdfs dfs -put /local/path/to/file /hdfs/path/to/file
  1. 下载文件从HDFS:
hdfs dfs -get /hdfs/path/to/file /local/path/to/file
  1. 删除HDFS文件:
hdfs dfs -rm /hdfs/path/to/file

3.2 编写MapReduce程序

  1. 创建一个MapReduce程序:
public class WordCount {
    public static class TokenizerMapper
            extends Mapper<Object, Text, Text, IntWritable> {

        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(Object key, Text value, Context context
        ) throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
            }
        }
    }

    public static class IntSumReducer
            extends Reducer<Text,IntWritable,Text,IntWritable> {
        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values,
                           Context context
        ) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "word count");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}
  1. 编译程序:
javac -classpath $HADOOP_HOME/share/hadoop/mapreduce/:$HADOOP_HOME/share/hadoop/common/:. WordCount.java
  1. 运行程序:
hadoop jar WordCount.jar wordcount input output

3.3 使用Hive和Spark

  1. 安装Hive:
sudo yum install hive
  1. 配置Hive:
cd /etc/hive
vi hive-site.xml

添加以下内容:

<configuration>
    <property>
        <name>hive.metastore.local</name>
        <value>true</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://localhost:3306/hive</value>
    </property>
</configuration>
  1. 安装Spark:
sudo yum install spark
  1. 配置Spark:
cd /etc/spark2
vi spark2-defaults.conf

添加以下内容:

spark.master yarn
spark.executor.memory 1g
  1. 运行Spark程序:
spark-submit --class com.example.WordCount --master yarn WordCount.jar input output

四、总结

通过本文,您应该能够掌握在CentOS单机环境下搭建Hadoop环境的方法。在实际应用中,您可以根据需要调整配置和优化性能。希望本文对您有所帮助!