Skip to main content

Service Account Isolation from Syntasa System Folders

Scope

In: EMR + Kubernetes Spark steps, Athena, EnrichProcessor verification, notebook dependency-download flow, notebook Spark conf (for Spark history writes to system bucket).

Out: GCP / Azure / OnPrem (SA is AWS-only today), authz JAR distribution model, global init script sandboxing, backward compatibility (SA is unreleased — direct cutover).

Why

Service Accounts (SA) attach user AWS credentials to runtimes/workspaces. Today those credentials are injected globally into Spark conf (spark.hadoop.fs.s3a.access.key), so the worker uses the user SA for every S3 access — including Syntasa system folders (syntasa-config/, syntasa-logs/, syn-spark-history/, syn-workspace/, syn-file-uploads/, syn-volumes/).

Result: admins must grant every SA access to all Syntasa system folders. The list grows with the platform, it leaks internal implementation into the SA policy, and it makes the SA model leaky (if an admin narrows an SA to "only their data," the platform breaks).

SA is unreleased — fix the model now before customer policies congeal around the wrong shape.

The Idea

Invariant: The user's SA credentials may only exist inside syn-data/notebook kernel pods as inert data destined for one sink — the Spark job submission conf for customer-data access. They never become an active credential for any in-pod operation, and the cluster never uses them for Syntasa-owned storage.

Enforced at three independent layers (defense in depth):

LayerWhereWhat
1. Type-levelEmrStepJobRequestNew sparkJobCredentials field. getSecurity() always = system creds, never mutated by SA application.
2. Call-siteEmrCopyUtils, AthenaConnectionUtilsCode touching system folders derives system creds from the infra row, ignores caller-supplied security.
3. Cluster-levelSpark conf (syn-data batch + notebook Spark sessions)Per-bucket override pins the Syntasa bucket to the cluster's own identity. SA conf still injected, but never reaches the system bucket.

After this lands, an SA's IAM policy needs only customer-bucket grants. The Syntasa bucket is owned by the cluster's IAM role (EMR EC2 instance profile or EKS node role), full stop.

Two S3 schemes are in play and both must be covered. Credential conf is injected under both fs.s3a.* (S3A, used for s3a:// URIs) and fs.s3.* (EMRFS / legacy S3, used for s3:// URIs which EMR resolves via EmrFileSystem). The per-bucket override must be set for both schemes:

SchemePer-bucket keyProvider class (EMR and K8s)
S3A (s3a://)spark.hadoop.fs.s3a.bucket.<bucket>.aws.credentials.providercom.amazonaws.auth.InstanceProfileCredentialsProvider
EMRFS (s3://)spark.hadoop.fs.s3.bucket.<bucket>.customAWSCredentialsProvidercom.amazonaws.auth.InstanceProfileCredentialsProvider

One provider class works for both EMR and K8s because both inherit IAM via the EC2 instance metadata service:

  • EMR (EC2 workers): workers query IMDS directly and receive the EMR EC2 instance profile.
  • EKS / K8s (current Syntasa deployment): pods do not use IRSA — syntasa-sa has no eks.amazonaws.com/role-arn annotation (verified empirically on kqa, 2026-05-14). Pods inherit the EKS node IAM role via IMDS (http://169.254.169.254/...). InstanceProfileCredentialsProvider queries IMDS and resolves to the node role — exactly the system identity we want for system-bucket access.

Why this works (and why env vars don't shadow it): InstanceProfileCredentialsProvider queries IMDS directly and ignores the AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY env vars we inject for user-SA propagation. The per-bucket override therefore correctly resolves to cluster identity even though user-SA env vars are present on the same pod for other use.

Do NOT use DefaultAWSCredentialsProviderChain — that chain walks EnvironmentVariableCredentialsProvider first, and we inject AWS_ACCESS_KEY_ID env vars for the user SA. The chain would resolve to the user SA before reaching IMDS and silently defeat Layer 3.

Future-state note: if Syntasa migrates K8s to IRSA in a future release, InstanceProfileCredentialsProvider would stop working on K8s pods. At that time, switch to WebIdentityTokenCredentialsProvider for K8s. Not in scope for this work — kqa/kdev today use node-IAM, not IRSA.

The Spec

1. New model field — EmrStepJobRequest.sparkJobCredentials

Module: syntasa-core (com.syntasa.lib.infra.jobs)

Add to the EmrStepJobRequest subclass only (AWS-only scope visible in types):

private SparkJobAwsCredentials sparkJobCredentials; // null when no SA configured

New POJO SparkJobAwsCredentials { accessKey, secretKey, sessionToken, Source source } with Source ∈ { SA_IAM_USER, SA_IAM_ROLE_ASSUMED, SESSION_POLICY_ASSUMED } (telemetry).

In-memory only — no Kafka schema, no DB migration.

2. Credential services stop mutating Security

Module: syntasa-runtime-service

  • AwsServiceAccountService.applyAwsServiceAccountCredentials — resolve SA (existing priority: iamUser keys → iamRole.roleArn AssumeRole). Populate sparkJobCredentials. Remove mutations at lines 45-46, 58-60.
  • SessionPolicyService.applySessionCredentials — same shape. Remove mutations at lines 100-102.

StepJobService.executeJob flow is unchanged structurally.

Unit test (locks the invariant): assert request.getSecurity() is reference-equal and field-equal before vs. after SA application.

3. Step utils: read new field + per-bucket override (both schemes) — syn-data batch path

Module: syntasa-runtime-service

EmrStepJobUtils (syntasa-runtime-service/.../utils/spark/amazon/EmrStepJobUtils.java, lines 199-222):

  • Replace getSecurity() reads with getSparkJobCredentials(). Skip injection when null.
  • Unconditionally add per-bucket overrides for both schemes:
// S3A — covers any s3a:// paths
addSparkConf(args, "spark.hadoop.fs.s3a.bucket." + systemBucket + ".aws.credentials.provider",
"com.amazonaws.auth.InstanceProfileCredentialsProvider");
// EMRFS — covers s3:// paths (the scheme EmrCopyUtils writes; the scheme EMR uses for system-folder reads at runtime)
addSparkConf(args, "spark.hadoop.fs.s3.bucket." + systemBucket + ".customAWSCredentialsProvider",
"com.amazonaws.auth.InstanceProfileCredentialsProvider");

where systemBucket = amazonStorage.getBucket().

Line 110 (getSecurity() for abort step) stays unchanged — now correctly uses system identity for the EMR control-plane call. (Latent bug fix.)

KubernetesSparkStepJobUtils has two injection blocks — fix both:

  • Block 1 (lines 549-562): SparkConf. Read getSparkJobCredentials(). Emit per-bucket override for both S3A and EMRFS schemes, both pointing to com.amazonaws.auth.InstanceProfileCredentialsProvider (same as EMR — K8s pods inherit the EKS node IAM role via IMDS in the current Syntasa deployment; verified on kqa).
  • Block 2 (lines 869-877): env vars. Read getSparkJobCredentials(). No per-bucket override needed here.

4. EmrCopyUtils derives system creds internally — signature unchanged

Module: syntasa-runtime-service

Keep the existing signaturecopyDependencies(Security security, Storage storage, Config config, SparkCluster sparkCluster). Don't drop the Security parameter, because ClusterService.java:317-321 branches between cloud and OnPrem and passes different things:

if (providerType == ONPREM) {
copyUtils.copyDependencies(runTime.getRuntimeSecurity(), ...); // OnPrem HDFS/Kerberos creds
} else {
copyUtils.copyDependencies(request.getSecurity(), ...); // request Security (cloud)
}

OnPrem still needs its RuntimeSecurity path; cloud implementers should ignore the parameter.

Inside EmrCopyUtils.copyDependencies (and the Dataproc/Azure implementers when their specs follow): ignore the passed security parameter and build a fresh system AmazonSecurity from the infra row using the existing fallback chain (useInstanceProfile=true → IRSA / instance profile; else infra static keys — same as PR #2426). Document with a comment that the parameter is intentionally unused for cloud system-folder I/O.

Precondition (implementation note): at method entry, throw IllegalStateException if the passed AmazonSecurity has a non-empty sessionToken. A session token is a clear signal the caller passed STS- or SA-derived credentials (which are never valid for system-folder I/O). Today's call sites in ClusterService pass freshly-deserialized infra security with no session token, so the precondition does not fire on today's paths.

OnPremCopyUtils is unchanged — it continues to use the passed RuntimeSecurity to access on-prem HDFS.

5. Athena writes to system bucket use system creds

Module: syntasa-athena-executor

In AthenaConnectionUtils.init (lines 63-88): after determining S3OutputLocation, check if its bucket equals amazonStorage.getBucket() (the system bucket). If yes, force the driver's credentials to com.amazonaws.auth.InstanceProfileCredentialsProvider (same provider class as §3 — pods/workers inherit the cluster identity via IMDS in both EMR and K8s). If no (customer bucket), leave as today.

6. EnrichProcessor — verify Action serialization carries system creds

Module: syntasa-app-library

EnrichProcessor.scala:290-313 reads SystemUtils.infra.amazonCloudProvider.getSecurity(). SystemUtils.infra is freshly deserialized from the Action (SystemUtils.scala:167) — not the same object as StepJobRequest.getSecurity().

Once §1 + §2 hold (request.getSecurity() = system creds), the Action carries system creds → EnrichProcessor automatically inherits the right identity.

Implementation task: trace once where the Action is serialized and confirm it reads from request.getSecurity() (now system), not from the new sparkJobCredentials. Document the trace in the PR.

7. Notebook flow: dep downloads + Spark conf builder

Module: syntasa-notebooks

The notebook flow has two distinct system-folder access paths that need fixing — and a small set of paths that are already correct. Verified end-to-end by reading bootstrap-kernel.sh, dependency_utils.py, jupyterhub_config.py, the AWS kernel pod template, aws_credential_config.py, and the three resource managers.

Broken paths (fix needed)

(7a) Three places download from syntasa-config/deps/ via aws s3 cp with user-SA env vars active:

  1. Driver kernel pod, runtime-init script — sourced by Python/Scala observer at Spark-session creation time. Bootstrap line 222 (fetch_aws_credentials) has already exported AWS_ACCESS_KEY_ID=<user SA>.
  2. Executor pod, global-init script — sourced at executor bootstrap (line 381). Pod inherits spark.executorEnv.AWS_* = user SA from the driver's Spark conf.
  3. Executor pod, runtime-init script — sourced at executor bootstrap (line 382). Same env.

(7b) Notebook Spark sessions write event logs to syn-spark-history/ using user SA:

aws_credential_config.py:configure_aws_credentials (lines 6-51) is the notebook-side equivalent of EmrStepJobUtils/KubernetesSparkStepJobUtils. It's called by kubernetes_resource_manager.py:172, yarn_resource_manager.py:74, and ray_resource_manager.py:52 — and it injects user-SA credentials into the Spark session globally. The same managers later set spark.eventLog.dir = s3://<bucket>/syn-spark-history/ (e.g. kubernetes_resource_manager.py:277). Result: Spark event log writes go to the system bucket using user SA. §3's per-bucket override does not apply here because §3 only touches the syn-data batch step utils — the notebook Spark session is built by separate code in the notebook repo.

Already-correct paths (no change needed)

  • Driver kernel-pod global-init script (bootstrap line 215) runs before fetch_aws_credentials — no AWS_* env vars yet → falls through default chain → IMDS → node role.
  • Authz JAR download (line 220) uses S3FileSystem boto3 with Configuration.SYN_S3_* (sourced from KERNEL_S3_* env vars). Per jupyterhub_config.py:213-219, KERNEL_S3_* is propagated from the notebook-service pod's own AWS env — that pod runs as syntasa-sa with system-tier identity (or empty, falling through to IMDS). Never user-SA.
  • syn-workspace s3fs mount (line 245) uses -o iam_role=auto — s3fs queries IMDS directly and ignores AWS_* env vars. Even though fetch_aws_credentials runs before the mount and sets user-SA env, s3fs uses IMDS → node role.
  • filesystem.sh workspace sync — same source as authz JAR (system tier).
  • Standalone spark-history-server pod (history-server.sh) — uses S3A with default credential chain (no AWS_* in pod) → IMDS → node role. Reads syn-spark-history/ under system identity.
  • HTTP-only Flask CLI commands (download-global-init-script, notebook runtimes script) — no S3.

Fix A — dep downloads (_cloud_storage_download_cmd)

Modify _cloud_storage_download_cmd in dependency_utils.py:361-383 (only ever called for system-bucket paths {configFolder}/deps/...). Wrap the AWS branch in an env-stripping subshell:

return (
f'( unset AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_SESSION_TOKEN; '
f'aws s3 cp "s3://{bucket}/{src_path}" "{dest_path}"{endpoint_opt} )'
)

The unset applies only to the subshell — parent shell's user-SA env preserved for subsequent customer-bucket downloads. Inside, aws s3 cp falls through default credential chain → IMDS → node role. One change covers all three broken cases in (7a).

Fix B — notebook Spark conf (configure_aws_credentials)

Modify aws_credential_config.py to accept a new system_bucket parameter; when supplied, emit the same per-bucket overrides as §3 (S3A + EMRFS, both InstanceProfileCredentialsProvider):

if system_bucket:
spark_conf = spark_conf.set(
f"spark.hadoop.fs.s3a.bucket.{system_bucket}.aws.credentials.provider",
"com.amazonaws.auth.InstanceProfileCredentialsProvider")
spark_conf = spark_conf.set(
f"spark.hadoop.fs.s3.bucket.{system_bucket}.customAWSCredentialsProvider",
"com.amazonaws.auth.InstanceProfileCredentialsProvider")

Update all three callers (kubernetes_resource_manager.py, yarn_resource_manager.py, ray_resource_manager.py) to pass system_bucket=os.environ.get("KERNEL_SYN_BUCKET"). After this, notebook Spark sessions write spark.eventLog.dir = s3://<bucket>/syn-spark-history/ using the cluster identity, not user SA.

Why not a Flask CLI proxy (originally considered)

Executor pods don't have KERNEL_REFRESH_TOKEN in their env, so they can't auth to such an endpoint. The unset-subshell + Spark-conf approach is simpler, covers driver and executor uniformly, and adds no new server-side endpoint.

Known limitation (not in scope)

_cloud_url_download_cmd (lines 386-406) handles spark.jars URLs the user/admin configures directly. If those URLs point at the system bucket (e.g., s3://<system-bucket>/syntasa-config/...), the bootstrap aws s3 cp runs with whatever AWS env is active — user SA on driver post-fetch and on executor. The legitimate path for Syntasa-supplied jars is syntasa.jar.dependencies.names. Pointing spark.jars directly at the system bucket is going around the supported config; file a follow-up issue if it becomes a real concern.

Implementation Order

Each step is independently shippable and bisectable.

  1. §1 + invariant test — add field; no behavior change yet.
  2. §2 — credential services populate the field; remove mutations. Invariant test now passes.
  3. §3 — syn-data step utils read new field, emit per-bucket override (S3A + EMRFS). Batch job flow is fully fixed at this point.
  4. §4EmrCopyUtils derives system creds internally (signature kept; OnPrem unaffected).
  5. §5 — Athena.
  6. §6 — EnrichProcessor trace + PR docs.
  7. §7 Fix A_cloud_storage_download_cmd unset-subshell wrap (notebook dep downloads).
  8. §7 Fix Bconfigure_aws_credentials per-bucket override (notebook Spark eventLog writes).
  9. Ship reference IAM policy doc (minimal post-fix SA policy).

Pre-flight Check

Run on the target cluster before merging — verifies the IMDS-based pod IAM model the design assumes:

kubectl --context=<cluster> exec -n syntasa <any-pod> -- sh -c \
'wget -q -T 3 -O- http://169.254.169.254/latest/meta-data/iam/security-credentials/'

Expected output: the EKS node role name (e.g. EKS_EC2_DefaultRole). If this returns empty or times out, the cluster's pod IAM model differs from kqa/kdev — STOP and re-evaluate the provider class before deploying.

Testing

  • Unit: §1 invariant — getSecurity() unchanged through SA application
  • Unit: sparkJobCredentials populated with correct Source for all priority paths
  • Unit: EmrStepJobUtils emits per-bucket override unconditionally for both S3A and EMRFS; SA conf only when field non-null
  • Unit: KubernetesSparkStepJobUtils — both blocks, both schemes
  • Unit: EmrCopyUtils rejects Security with session token (precondition); resolves system creds correctly otherwise
  • Unit: AthenaConnectionUtils switches provider when output is on system bucket
  • Unit: _cloud_storage_download_cmd for AWS provider returns a subshell-wrapped command with unset AWS_* and aws s3 cp inside; GCP/Azure branches unchanged
  • Unit: configure_aws_credentials emits per-bucket override (S3A + EMRFS) for system bucket when system_bucket is supplied; preserves user-SA global conf for customer-bucket access
  • Integration: EMR step + SA → assert Spark args contain SA access.key (S3A and EMRFS) AND both per-bucket overrides (S3A and EMRFS)
  • Integration: K8s Spark step + SA → assert same on SparkConf builder, with InstanceProfileCredentialsProvider as the provider class
  • E2E (kqa): SA with minimal IAM policy (customer bucket only, zero syntasa-bucket grants) — submit Spark job, run notebook with runtime+deps + Spark session (Spark history writes succeed), run Enrich step with Athena view. All succeed.
  • Regression: existing tests pass with updated mock signatures.

Open Questions for Implementation

  1. System-bucket naming convention. The system bucket is configured per-infrastructure (not globally), so a single startup check doesn't fit. Trust the infra row at emission time.
  2. Action serialization trace (§6) — confirm at impl time; document in PR.

Rollback

Each step is independently revertible. No on-disk format changes. Revert §3 → restores prior Spark-conf behavior; revert §7 → restores direct aws s3 cp and previous notebook Spark conf.

Behavior Changes (Beneficial, Not Regressions)

After Phase A, request.getSecurity() returns infra credentials. Four control-plane sites read this value for EMR / EKS API calls:

FileLineOperation
EmrStepJobUtils.java110EMR abort step
EmrStepJobUtils.java329EMR cancel-steps API
KubernetesSparkStepJobUtils.java163EKS getKubernetesClusterDetails (executeJob)
KubernetesSparkStepJobUtils.java382EKS getKubernetesClusterDetails (abortJob)

Today these read SA-mutated security (user SA). After this fix they read infra security. This is structurally correct — control-plane operations should run as system identity. System identity always has those permissions; a narrowly-scoped SA may not. The plan does not modify these four lines.

Audit Appendix — Verified Clean, No Changes

ModuleWhy clean
syntasa-app-library (all 34+ processors incl. CustomSparkProcessor)No credential code; Layer-3 covers all Spark-conf-driven access.
EnrichProcessorFresh-deserialized SystemUtils.infra; inherits §1 fix.
ReportProcessor:226Reads isUseInstanceProfile boolean only.
syntasa-upload-servicePer-request connection creds; never on SA path; no system-folder writes.
syntasa-logging-serviceDefaultAWSCredentialsProviderChain only — server-side daemon, always cluster identity.
CloudStorageLogAggregatorSinkCluster identity.
syn-data-utils Spark Operator, Interactive Query, Log Agent, base imagesNo distinct credential flow.
syntasa-history-server standalone pod (history-server.sh)S3A with default credential chain — no AWS_* in pod env → IMDS → node role. Reads syn-spark-history/ as system identity.
syn-workspace s3fs mount (bootstrap-kernel.sh:245)-o iam_role=auto queries IMDS directly; s3fs ignores AWS_* env vars. System identity even after fetch_aws_credentials sets user-SA env.
S3FileSystem (s3_fs.py)boto3 client built from Configuration.SYN_S3_* which sources from KERNEL_S3_*, which are propagated from notebook-service pod's own AWS env (system tier per deployment convention, or empty → IMDS).
filesystem.sh workspace syncUses KERNEL_S3_* — same source as S3FileSystem, system tier.

Audit Appendix — Other AmazonSecurity Mutation Sites (Out of Scope)

Eight other sites mutate AmazonSecurity, but with infra-row creds, not user SA — not violating the invariant. Listed so future reviewers don't re-litigate: DataGatewayInfrastructureDeserializerUtils, InfrastructureSecurityUtils, ActionStoreService, ProcessStatusToMetricsConverter, ClusterService (multiple), plus three more found via grep.

Follow-up Specs (Not This Work)

  • GCP / Azure / OnPrem SA isolation (when SA expands to those clouds)
  • K8s migration to IRSA — would require switching the per-bucket provider class to WebIdentityTokenCredentialsProvider
  • Global init script sandboxing (privilege-escalation surface from user-supplied pre-SA scripts)
  • _cloud_url_download_cmd system-bucket detection (if user-supplied spark.jars pointing at the system bucket becomes a real concern)
  • Defensive refactor of the 8 AmazonSecurity mutation sites above