基础环境记录下
JDK8Python3.7 Window搭建Spark环境
先把JDK8和Python3装好,这里不赘述
安装hadoop2.7
下载地址:http://archive.apache.org/dist/hadoop/core/hadoop-2.7.7/hadoop-2.7.7.tar.gz
解压
下载hadoop的winutils:https://github.com/steveloughran/winutils
把下载的winutils放到解压的hadoop目录的bin目录下
设置hadoop的JAVA_HOME
修改hadoop目录下的etc/hadoop/hadoop-env.cmd文件:设置实际java安装目录
set JAVA_HOME=%JAVA_HOME%
改成
set JAVA_HOME=E:studyjdk1.8.0_144
设置HADOOP环境变量
方式同配置JDK环境变量一样:
新建HADOOP_HOME变量,值为解压的hadoop根目录;
把%HADOOP_HOME%bin加到Path中
cmd测试hadoop是否安装好
cmd --》运行hadoop 、hadoop version
C:Users明柯>hadoop version Hadoop 2.7.7 Subversion Unknown -r c1aad84bd27cd79c3d1a7dd58202a8c3ee1ed3ac Compiled by stevel on 2018-07-18T22:47Z Compiled with protoc 2.5.0 From source with checksum 792e15d20b12c74bd6f19a1fb886490 This command was run using /F:/ITInstall/hadoop-2.7.7/share/hadoop/common/hadoop-common-2.7.7.jar
若出现 Error: JAVA_HOME is incorrectly set,一般是你的jdk安装到了C盘,把他移到其他盘即可
安装spark2.4.x
这里我安装2.4.8版本
下载,下载地址:https://archive.apache.org/dist/spark/spark-2.4.8/spark-2.4.8-bin-hadoop2.7.tgz
解压到你的目录
设置SPARK_HOME环境变量,并把%SPARK_HOME%bin添加到Path中
cmd 测试
cmd–> 运行pyspark
C:Users明柯>pysparkPython 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)] on win32Type "help", "copyright", "credits" or "license" for more information.22/02/11 17:21:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform..、using builtin-java classes where applicableUsing Spark's default log4j profile: org/apache/spark/log4j-defaults.propertiesSetting default log level to "WARN".To adjust logging level use sc.setLogLevel(newLevel)、For SparkR, use setLogLevel(newLevel).Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /__ / .__/_,_/_/ /_/_ version 2.4.8 /_/Using Python version 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018 04:59:51)SparkSession available as 'spark'.>>>
quit()即可退出
测试spark任务
cmd --> spark-submit %SPARK_HOME%/examples/src/main/python/pi.py
从日志中可以看到计算结果:
Pi is roughly 3.142780
Linux搭建Spark环境也是提前需安装好JDK和Python3.7
这里也只演示单节点的,多节点的同理,Hadoop集群安装可参考:https://blog.csdn.net/maoyuanming0806/article/details/81177338
下载,下载路径同上
可以wget,或者window下载后通过传输工具传到linux中
解压
tar -zvxf spark-2.4.8-bin-hadoop2.7.tgz -C /opt/modulemv spark-2.4.8-bin-hadoop2.7 spark-2.4.8
测试
cd /opt/module/spark-2.4.8bin/spark-submit examples/src/main/python/pi.py打印日志里可以看到结果Pi is roughly 3.137780
设置环境变量
vi /etc/profile添加#==================spark====================export SPARK_HOME=/opt/module/spark-2.4.8export PATH=$PATH:$SPARK_HOME/binwq!保存后source /etc/profile
修改日志级别
修改conf目录下,复制log4j.properties.template 为 log4j.properties
修改里面的log4j.rootCategory=INFO, console即可
Jupyter Notebook安装 Linux环境JupyterNotebook整合pyspark
安装jupyterNotebook
pip3 install jupyter
安装findspark
jupyter访问spark需要findspark这个包
pip3 install findspark
启动jupyter
若不知道jupyter命令安装在哪可以先find一下
find / -name /jupyter
或者
cd /usr/local/python3/bin此目录下有jupyter命令,此目录如果不在环境变量中就需要这样来启动./jupyter notebook --allow-root
打开jupyternotebook的web页,测试
新建一个文件
import findsparkfindspark.init()from pyspark.sql import SparkSessionspark = SparkSession.builder.master("local[*]").appName("wordCount").getOrCreate()sc = spark.sparkContextrdd = sc.parallelize(["hello world", "hello spark"])rdd2 = rdd.flatMap(lambda line:line.split(" "))rdd3 = rdd2.map(lambda word:(word, 1))rdd5 = rdd3.reduceByKey(lambda a, b : a + b)print(rdd5.collect())sc.stop()
输出结果
[('hello', 2), ('spark', 1), ('world', 1)]
Window环境JupyterNotebook整合pyspark先安装Anaconda,自行百度下载安装包然后安装吧,没啥特别步骤,Anaconda会把JupyterNotebook安装,这个Anaconda是个集成环境,也方便其他工具和python包安装,建议安装好
安装Anaconda
进入到Anaconda目录
进入到scripts目录
在scripts目录下打开命令行cmd,一定要到此目录下,否则安装的工具包,jupyternotebook都找不到,这是windows环境的坑
安装findspark
pip3 install findspark
如果过程比较长,可以考虑换镜像为阿里的会快点
测试:启动jupyternotebook,然后在浏览器中打开web页
新建一个python3文件
import findsparkfindspark.init()from pyspark.sql import SparkSessionspark = SparkSession.builder.master("local[*]").appName("wordCount").getOrCreate()sc = spark.sparkContextrdd = sc.parallelize(["hello world", "hello spark"])rdd2 = rdd.flatMap(lambda line:line.split(" "))rdd3 = rdd2.map(lambda word:(word, 1))rdd5 = rdd3.reduceByKey(lambda a, b : a + b)print(rdd5.collect())sc.stop()
执行结果
[('hello', 2), ('spark', 1), ('world', 1)]