Skip to content

Kubernetes Deployment

Two deployment methods are available:

  • Kubernetes (production)
  • Docker (for local testing and configuration validation)

This guide focuses on Kubernetes deployment, as it best illustrates key challenges:

  • Continuous deployment
  • Logs & persistence
  • Security & role management
  • Monitoring and performance

Even if those are aspects are not adressed in this project, the context of kubernetes already expose those issues and which are necessary to go past a poc to production.


Resource Overview

The deployment includes the following resources and integrates with remote storage (S3):

Deployed resources

Namespaces

Resources are organized into the following namespaces:

  • Applicative namespaces: polaris , fluss , flink ,spark
  • Technical namespaces : technical-spark-operator , technical-flink-operator

Operators

The flink-kubernetes-operator is used for Flink deployments. The Helm chart is customized to pull .jar files from an S3 bucket, featuring:

  • A custom operator pod image with the S3 filesystem Hadoop plugin
  • AWS environment variables for S3 authentication

This allows loading JARs for FlinkDeployment or FlinkSessionJob using the URI:

job:
  jarURI: s3://<bucket>/<job_path>.jar
Which can be useful depending the CI used for a project.

Spark

Spark is deployed using a custom Helm chart based on the spark-on-k8s-operator.

No custom configurations were made at the helm chart level : different from flink's operator , to be able to download the .py or .jar job file from a s3 bucket, it is not the operator's controller or webhook image which is modified but the deployment's.

Hence the base spark image used has aws sdk baked into it and authentication is made with classic secrets as secrets.


Polaris

The Polaris Helm chart is customized with:

  • AWS secrets for S3 access
  • JDBC persistence for PostgreSQL integration

A PostgreSQL database is instantiated with default credentials.

OAuth2 is used for authentication, with the following resources created:

  • Catalog
  • Namespace
  • Catalog role
  • Principal role
  • Principal

Polaris resources created
Polaris resources

Client applications use principal credentials to obtain a token for catalog interaction.


Fluss

The Fluss Helm chart is customized with:

  • A custom image including S3 dependencies
  • AWS secrets for KV data persistence (independent of Iceberg)
  • REST Iceberg configuration with principal credentials

Since Raft is not available, ZooKeeper is deployed as the cluster manager, with a PVC for state persistence.


Flink is deployed in session mode to reduce workload on the local cluster.

Session jobs are submitted via:

  • Kubernetes CRDs: FlinkSessionJob and FlinkDeployment
  • SQL Gateway REST API: using a tool to execute SQL scripts

SQL catalogs are mounted to the flink-main-container via a ConfigMap and configured as a file catalog store.

The SQL Gateway is deployed separately and submits requests to the main cluster.

For further reading, see this article

Spark deployment

AWS Infrastructure: Storage Requirements

This project requires S3 remote storage for the following components:

  • Iceberg table data
  • Flink checkpoints
  • Flink JAR files
  • Fluss PK table snapshots and tiered log segments

The infrastructure is provisioned using Terraform.

Image title
Screenshot of Iceberg's parquet files