引言
Hadoop是一个开源的大数据处理框架,适用于分布式存储和分布式计算。本文旨在为初学者提供一份详细的入门攻略,帮助您在CentOS单机环境下搭建Hadoop环境,并分享一些实战技巧。
一、Hadoop概述
1.1 Hadoop简介
Hadoop是一个由Apache软件基金会开发的开源分布式计算框架,用于处理大规模数据集。它由三个核心组件组成:Hadoop分布式文件系统(HDFS)、Hadoop MapReduce和Yet Another Resource Negotiator(YARN)。
1.2 Hadoop核心组件
1.2.1 Hadoop分布式文件系统(HDFS)
HDFS是一个分布式文件系统,用于存储大量数据。它将数据分割成小块,并存储在多个节点上,提高了数据的可靠性和访问速度。
1.2.2 Hadoop MapReduce
MapReduce是一种编程模型,用于处理大规模数据集。它将数据分割成小块,由多个节点并行处理,并最终合并结果。
1.2.3 YARN(Yet Another Resource Negotiator)
YARN是一个资源管理器,负责分配集群资源,如CPU和内存,并监控应用程序的执行。
二、CentOS单机环境搭建Hadoop
2.1 硬件与软件要求
- 操作系统:CentOS 7
- Java开发包:JDK 1.8
- Hadoop版本:Hadoop 3.x
2.2 安装步骤
2.2.1 安装Java环境
- 安装Java开发包:
sudo yum install java-1.8.0-openjdk-devel
- 配置环境变量:
echo "export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.292.x86_" >> /etc/profile
echo "export PATH=\$JAVA_HOME/bin:\$PATH" >> /etc/profile
source /etc/profile
2.2.2 安装Hadoop
- 下载Hadoop安装包:
wget http://mirrors.cnnic.cn/apache/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz
- 解压安装包:
tar -zxvf hadoop-3.3.4.tar.gz -C /opt/hadoop
- 配置环境变量:
echo "export HADOOP_HOME=/opt/hadoop/hadoop-3.3.4" >> /etc/profile
echo "export PATH=\$HADOOP_HOME/bin:\$PATH" >> /etc/profile
source /etc/profile
2.3 配置Hadoop
- 配置
hadoop-env.sh
:
cd $HADOOP_HOME/etc/hadoop
vi hadoop-env.sh
添加以下内容:
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.292.x86_
- 配置
core-site.xml
:
vi core-site.xml
添加以下内容:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
- 配置
hdfs-site.xml
:
vi hdfs-site.xml
添加以下内容:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
- 配置
mapred-site.xml
:
vi mapred-site.xml
添加以下内容:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
- 配置
yarn-site.xml
:
vi yarn-site.xml
添加以下内容:
<configuration>
<property>
<name>yarn.resourcemanager.host</name>
<value>localhost</value>
</property>
</configuration>
2.4 格式化HDFS
hdfs namenode -format
2.5 启动Hadoop服务
start-dfs.sh
start-yarn.sh
三、实战技巧
3.1 使用Hadoop命令行工具
- 查看HDFS文件:
hdfs dfs -ls /
- 上传文件到HDFS:
hdfs dfs -put /local/path/to/file /hdfs/path/to/file
- 下载文件从HDFS:
hdfs dfs -get /hdfs/path/to/file /local/path/to/file
- 删除HDFS文件:
hdfs dfs -rm /hdfs/path/to/file
3.2 编写MapReduce程序
- 创建一个MapReduce程序:
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
- 编译程序:
javac -classpath $HADOOP_HOME/share/hadoop/mapreduce/:$HADOOP_HOME/share/hadoop/common/:. WordCount.java
- 运行程序:
hadoop jar WordCount.jar wordcount input output
3.3 使用Hive和Spark
- 安装Hive:
sudo yum install hive
- 配置Hive:
cd /etc/hive
vi hive-site.xml
添加以下内容:
<configuration>
<property>
<name>hive.metastore.local</name>
<value>true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive</value>
</property>
</configuration>
- 安装Spark:
sudo yum install spark
- 配置Spark:
cd /etc/spark2
vi spark2-defaults.conf
添加以下内容:
spark.master yarn
spark.executor.memory 1g
- 运行Spark程序:
spark-submit --class com.example.WordCount --master yarn WordCount.jar input output
四、总结
通过本文,您应该能够掌握在CentOS单机环境下搭建Hadoop环境的方法。在实际应用中,您可以根据需要调整配置和优化性能。希望本文对您有所帮助!