目录
环境搭建
测试
Apache Spark是用于大规模数据处理的统一分析引擎;
spark 仅仅替代了hadoop的mapraduce;
spark比hadoop快一百倍;
环境搭建
1:解压;
2:配置spark环境变量:
vim /etc/profile
export SPARK_HOME=/opt/module/spark
export PYSPARK_PYTHON=/opt/module/anacond3/envs/pyspark/bin/python3.8
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$SPARK_HOME/bin
source /etc/profile
编辑:
vim ~/.bashrc
export JAVA_HOME=/opt/module/jdk
export PYSPARK_PYTHON=/opt/module/anacond3/envs/pyspark/bin
测试:
spark-submit --version
3:设置spark,yarn是hadoop的一部分,必须启动hadoop时才会运行,spark中配置的和hadoop有关的;
cp spark-env.sh.template spark-env.sh
HADOOP_CONF_DIR=/opt/module/hadoop/etc/hadoop
4:测试spark:
完成on yarn 相关配置,使用spark on yarn 的模式提交$SPARK_HOME/examples/jars/spark-examples_2.12.3.11.jar
运行的主类为org.apache.spark.examples.SparkPi
运行命令为:
spark-submit --master yarn --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.1.1.jar
yarn 需要配置:
yarn-site.xml:
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
hadoop必须启动!
如果提示safe mode问题需要执行:
hadoop dfsadmin -safemode leave