欢迎您访问365答案网,请分享给你的朋友!
生活常识 学习资料

Spark:JupyterNotebook整合PySpark开发环境

时间:2023-06-06

记录下

基础环境

JDK8Python3.7 Window搭建Spark环境

先把JDK8和Python3装好,这里不赘述

安装hadoop2.7

下载地址:http://archive.apache.org/dist/hadoop/core/hadoop-2.7.7/hadoop-2.7.7.tar.gz

解压

下载hadoop的winutils:https://github.com/steveloughran/winutils

把下载的winutils放到解压的hadoop目录的bin目录下

设置hadoop的JAVA_HOME

修改hadoop目录下的etc/hadoop/hadoop-env.cmd文件:设置实际java安装目录

set JAVA_HOME=%JAVA_HOME%

改成

set JAVA_HOME=E:studyjdk1.8.0_144

设置HADOOP环境变量

方式同配置JDK环境变量一样:

新建HADOOP_HOME变量,值为解压的hadoop根目录;

把%HADOOP_HOME%bin加到Path中

cmd测试hadoop是否安装好

cmd --》运行hadoop 、hadoop version

C:Users明柯>hadoop version Hadoop 2.7.7 Subversion Unknown -r c1aad84bd27cd79c3d1a7dd58202a8c3ee1ed3ac Compiled by stevel on 2018-07-18T22:47Z Compiled with protoc 2.5.0 From source with checksum 792e15d20b12c74bd6f19a1fb886490 This command was run using /F:/ITInstall/hadoop-2.7.7/share/hadoop/common/hadoop-common-2.7.7.jar

若出现 Error: JAVA_HOME is incorrectly set,一般是你的jdk安装到了C盘,把他移到其他盘即可

安装spark2.4.x

这里我安装2.4.8版本

下载,下载地址:https://archive.apache.org/dist/spark/spark-2.4.8/spark-2.4.8-bin-hadoop2.7.tgz

解压到你的目录

设置SPARK_HOME环境变量,并把%SPARK_HOME%bin添加到Path中

cmd 测试

cmd–> 运行pyspark

C:Users明柯>pysparkPython 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)] on win32Type "help", "copyright", "credits" or "license" for more information.22/02/11 17:21:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform..、using builtin-java classes where applicableUsing Spark's default log4j profile: org/apache/spark/log4j-defaults.propertiesSetting default log level to "WARN".To adjust logging level use sc.setLogLevel(newLevel)、For SparkR, use setLogLevel(newLevel).Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /__ / .__/_,_/_/ /_/_ version 2.4.8 /_/Using Python version 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018 04:59:51)SparkSession available as 'spark'.>>>

quit()即可退出

测试spark任务

cmd --> spark-submit %SPARK_HOME%/examples/src/main/python/pi.py

​ 从日志中可以看到计算结果:

Pi is roughly 3.142780

Linux搭建Spark环境

也是提前需安装好JDK和Python3.7

这里也只演示单节点的,多节点的同理,Hadoop集群安装可参考:https://blog.csdn.net/maoyuanming0806/article/details/81177338

下载,下载路径同上

可以wget,或者window下载后通过传输工具传到linux中

解压

tar -zvxf spark-2.4.8-bin-hadoop2.7.tgz -C /opt/modulemv spark-2.4.8-bin-hadoop2.7 spark-2.4.8

测试

cd /opt/module/spark-2.4.8bin/spark-submit examples/src/main/python/pi.py打印日志里可以看到结果Pi is roughly 3.137780

设置环境变量

vi /etc/profile添加#==================spark====================export SPARK_HOME=/opt/module/spark-2.4.8export PATH=$PATH:$SPARK_HOME/binwq!保存后source /etc/profile

修改日志级别

修改conf目录下,复制log4j.properties.template 为 log4j.properties

修改里面的log4j.rootCategory=INFO, console即可

Jupyter Notebook安装 Linux环境JupyterNotebook整合pyspark

安装jupyterNotebook

pip3 install jupyter

安装findspark

jupyter访问spark需要findspark这个包

pip3 install findspark

启动jupyter

若不知道jupyter命令安装在哪可以先find一下

find / -name /jupyter

或者

cd /usr/local/python3/bin此目录下有jupyter命令,此目录如果不在环境变量中就需要这样来启动./jupyter notebook --allow-root

打开jupyternotebook的web页,测试

新建一个文件

import findsparkfindspark.init()from pyspark.sql import SparkSessionspark = SparkSession.builder.master("local[*]").appName("wordCount").getOrCreate()sc = spark.sparkContextrdd = sc.parallelize(["hello world", "hello spark"])rdd2 = rdd.flatMap(lambda line:line.split(" "))rdd3 = rdd2.map(lambda word:(word, 1))rdd5 = rdd3.reduceByKey(lambda a, b : a + b)print(rdd5.collect())sc.stop()

输出结果

[('hello', 2), ('spark', 1), ('world', 1)]

Window环境JupyterNotebook整合pyspark

先安装Anaconda,自行百度下载安装包然后安装吧,没啥特别步骤,Anaconda会把JupyterNotebook安装,这个Anaconda是个集成环境,也方便其他工具和python包安装,建议安装好

安装Anaconda

进入到Anaconda目录

进入到scripts目录

在scripts目录下打开命令行cmd,一定要到此目录下,否则安装的工具包,jupyternotebook都找不到,这是windows环境的坑

安装findspark

pip3 install findspark

如果过程比较长,可以考虑换镜像为阿里的会快点

测试:启动jupyternotebook,然后在浏览器中打开web页

新建一个python3文件

import findsparkfindspark.init()from pyspark.sql import SparkSessionspark = SparkSession.builder.master("local[*]").appName("wordCount").getOrCreate()sc = spark.sparkContextrdd = sc.parallelize(["hello world", "hello spark"])rdd2 = rdd.flatMap(lambda line:line.split(" "))rdd3 = rdd2.map(lambda word:(word, 1))rdd5 = rdd3.reduceByKey(lambda a, b : a + b)print(rdd5.collect())sc.stop()

执行结果

[('hello', 2), ('spark', 1), ('world', 1)]

Copyright © 2016-2020 www.365daan.com All Rights Reserved. 365答案网 版权所有 备案号:

部分内容来自互联网,版权归原作者所有,如有冒犯请联系我们,我们将在三个工作时内妥善处理。