Apache Spark

Hirsun (HAN Jiaming)大约 7 分钟

Apache Spark

Offers over 80 operators.
Languages binding Scala,Java,SQL, Python(PySpark),R(SparkR).
RDD:in-memory cache > "Up to 100x faster than MapReduce"
Deployment standalone,YARN, Mesos, Kubernetes(containers)
External storage systems:HDFS, HBase,Amazon S3,Azure Storage,Azure Datalake,Google Cloud Storage,Cassandra, Alluxio,.…

Build-in Libraries

Spark SQL: processing structured data with relational queries(newer API than RDDs->DataFrame API)
Spark Structured Streaming: processing structured data streams with relation queries
MLlib: Spark's machine learning (ML)library
- (new) DataFrame - based API in the spark.mlopen in new window package
GraphX: distributed graph-processing
- Page Ranking,Recommendation Systems, financial Fraud Detection, Geographic Information Systems,...

Architecture

One Master node + multiple Worker nodes
Equivalent to Hadoop's Master and Slave nodes.

CleanShot 2024-05-07 at 23.28.23@2x.png

Key Elements of a Spark Cluster

Spark Driver: your Spark application that launches the main method
Cluster Manager: manages the resources of a cluster
- Support YARN,Kubernetes(K8S),Mesos,or Spark Standalone
Workers：集群中任何可以运行应用程序代码的节点。
Executors: Executors are worker nodes' JVM processes in charge of running individual tasks in a given Spark job.

Runs on Kubernetes

Each Spark app is fully isolated from the others and packages its own version of Spark and dependencies within a Docker image.

1715096544157.png

Runs on Yarn

1715096652353.png

Spark Executors Runs on Yarn

1715097532139.png

Run Schedule Tasks

Schedule Tasks to run on Executors

执行器启动一次，可被多个任务和所有后续任务使用
任务总数取决于 RDD 分区的数量

1715101694531.png

Spark Driver

任何 Spark 驱动程序应用程序中最重要的步骤是生成 SparkContext。

Spark Driver 程序使用 SparkContext 通过资源管理器（例如 Yarn）连接到集群。
SparkContext 存储配置参数：
- 例如，应用程序名称、集群的主 URL、资源请求（执行器数量、执行器内存/核心数）、...

1715101780851.png

SparkContext: PySpark Example

1715101875252.png

Cluster Managers

Spark Supported Cluster Managers

Spark Standalone Mode
- 使用 Spark 自带的集群管理器。
YARN - the resource manager since Hadoop 2.X.
- 更丰富的调度能力：FIFO、Capacity、Fair调度器。
Kubernetes (> Spark 2.3)
- K8S 创建执行器 pod 来运行 Spark 应用程序，每个执行器一个 pod！
Mesos - Deprecated as of Apache Spark 3.2.0

How to run

1715102039480.png

1715102086137.png

Method 1: Spark-Submit

1715143599474.png

1715143627926.png

Use spark-submit to run PySpark Application

1715143707486.png

1715144729961.png

1715144790262.png

Method 2: spark-shell

1715144935943.png

Use PySpark as a Python shell

1715144958293.png

Deploy Modes

Spark Execution with Yarn: Cluster Mode

1715145158487.png

Client Mode

Spark 驱动程序在提交作业的主机上运行

1715145210037.png

ApplicationMaster 只负责向 YARN 请求执行容器。容器启动后，客户端与容器通信，直接安排工作。

1715145317387.png

Cluster Mode vs. Client Mode

Client mode: (Interactive)

用于调试或希望以交互方式快速查看输出。
如果客户端不在群集中，则会遭受更高的延迟。
仍需要 ApplicationMaster（占用 1 个 Yarn 容器，但驱动程序代码不在其中运行）

Cluster mode: (Non-interactive)

Used for applications in production.
Spark Driver 和 Spark Executor 受到 YARN 自动故障恢复的监督。
Not supported for spark-shell 和 PySpark.

View ApplicationMaster & Executor Processes at a Worker Node

ApplicationMaster 在 Spark 中的进程名称是 ExecutorLauncher。

1715145527742.png