Spark简介
Spark是一款快速通用的大数据集群计算系统。它提供了Scala、Java、Python和R的高级api,以及一个支持用于数据分析的通用计算图的优化引擎。它还支持一组丰富的高级工具,包括用于SQL和DataFrames的Spark SQL、用于机器学习的MLlib、用于图形处理的GraphX和用于流处理的Spark Streaming。
Spark on Volcano
当前,有两种方式可以支持Spark和Volcano集成: - Spark on Kubernetes native支持: 由Apache Spark社区和Volcano社区共同维护。 - Spark Operator支持: 由GoogleCloudPlatform community和Volcano社区共同维护。
Spark on Kubernetes native支持 (spark-submit)
从Apache Spark v3.3.0版本及Volcano v1.5.1版本开始,Spark支持Volcano作为自定义调度,查看链接了解更多。
Spark Operator支持 (spark-operator)
通过helm安装spark-operator。
$ helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator
$ helm install my-release spark-operator/spark-operator --namespace spark-operator --create-namespace
为确保spark-operator已经正常运行,通过如下指令查看。
$ kubectl get po -nspark-operator
这里是用官方提供的spark-pi.yaml.
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: spark-pi
namespace: default
spec:
type: Scala
mode: cluster
image: "gcr.io/spark-operator/spark:v3.0.0"
imagePullPolicy: Always
mainClass: org.apache.spark.examples.SparkPi
mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0.jar"
sparkVersion: "3.0.0"
batchScheduler: "volcano" #Note: the batch scheduler name must be specified with `volcano`
restartPolicy:
type: Never
volumes:
- name: "test-volume"
hostPath:
path: "/tmp"
type: Directory
driver:
cores: 1
coreLimit: "1200m"
memory: "512m"
labels:
version: 3.0.0
serviceAccount: spark
volumeMounts:
- name: "test-volume"
mountPath: "/tmp"
executor:
cores: 1
instances: 1
memory: "512m"
labels:
version: 3.0.0
volumeMounts:
- name: "test-volume"
mountPath: "/tmp"
部署spark应用并查看状态。
$ kubectl apply -f spark-pi.yaml
$ kubectl get SparkApplication