Recording: https://advancegroup.larksuite.com/drive/folder/fldusPHvn5UrWkSlZ67Ucbcqjne?from=space_persnoal_filelist

App Foundation

Kubernetes

  • Are all the Clusters created based on the GKE module?
  • kubectl_wrapper - Is it still being used? What is the current status of the kubectl-wrapper? Do we need to use this in the future?
    • no
  • do we remove the default nodepool?
    • yes
      • And then we add custom nodepool?
      • yes
    • are we using preemptible instances
      • yes
      •  
  • Can we have a documentation on argocd_cluster resource that are added in each *_gke.tf file?
    • the proxy exists to create this resource
    • its working like an l4 load balancer, the ssl termination is happening in api server which doesnt match with the ip address
    • internal load balancer that points to nginx (gke)

Cloudflare

  • Custom Rules are all written in code.

Tyk

  • Gnana and Bharath has full visibility on Tyk.

Okta

  • Aleksandr Kanaev (Unlicensed) documentation on Okta, kindly add all integration, which platforms are using okta
  • Since Okta is SAAS, infra is hosted in vendors platform.
  • The configuration of groups, authentication and other custom config is done in TF.
    • Aside from adding a new admin user, what else needs to be manually done in Okta?
      • only mapping of the users to groups is manual
    • Based on the current subscription, are we close to using up all the licenses? Are there any issues with the current subscription in terms of scaling?
      • unknown - will check with Ion
      • an added feature that we use for IAM - api access management - is in trial with 23days left as of 11/24
  • There are some groups that were created manually by other admins.
  • SAML - what is the current implementation? Is there any reference docs how Aleks did it?
  • Okta rate limit issue
    • Document or point us to an existing document for resolving iam-manager ms making too many api calls? SM-4781 - Bank User groups support in IAM library Done

Monitoring

  • Diagram for Monitoring Stack/Platform
    • Documentation how each component is connected
  • Prometheus -
    • Metrics data retention
    • 12hrs stored in pvc
  • Alertmanager
    • Where are the prometheus rules defined
  • Thanos -
    • How do we scrape historical metrics?
    • What are the lifecyle policies in place?
  • Grafana -
    • rules, the dashboards, resources he used when creating these rules and dashboards
      • charts/grafana-dashboards
      • argocd/environments/common/infra/prometheus-extra/base/rules
  • Loki
    • Promtail - daemonset
    • Promtail replaced fluentbit
  • Is there any security/regex for masking PIIs in place in promtail?
    • it better to have this security placed in the application side because app will still send the logs to loki. (because if you access logs directly from k8s, the logs are all there)
      • although the regex filter can be implemented on the promtail side before sending it to loki
  • Logging retention - what is currently in place
    • GCS bucket
    • PVC is used to cache
    • retention_period: 744h
  • Tempo
    • How is this deployed?
      • Tempo
      • build.gradle dependecies - specify the plugin to the java agent for tempo
    • How is this scaling?
      • hpa is not yet in use
      •  
    • What are the possible issues we may face with the current setup?
      • only the stage which we are not utilizing for stage but for epfs
    • What were the issues you encountered during setup?
      • current setup is fine tuned, for new env we can copy
      • take note of the compactor persistent value
        • get alerts for the pvc node exporters
    • What is the lifecycle rules set in place?
      • block_retention: 336h
  • Cloudflare Exporter
    • This is a plugin to gather metrics from Cloudflare and present the metrics in Grafana.
  •  
  • crds
    • Can you tell us about the custom crd patches implemented in the prometheus-extra
      • argocd/environments/common/infra/prometheus-extra/base/crd-patch.yaml
    • Whats the logic behind the rules and thanos.yaml files in the prometheus-extra/overlays
    • were there any crd installed in cluster for monitoring?

Github Action

  • self hosted runners
    • devops/argocd/environments/common/infra/actions-runner-controller
    • devops/argocd/environments/common/infra/actions-runner
    • https://github.com/SafiBank/SaFiMono/settings/actions/runners
    • What issues are we encountering on self hosted runners?
      • CPU, memory, heap size
      • How is he troubleshooting issues when it comes to self hosted runner performance problems?
    • How are we monitoring GHAs self hosted runners?
  • How is the promotion happening from dev to stage? Is it now purely branch based? epfs
  • TYK related GHA is documented
  • Go through the list of GH Secrets and identify if any of the GH secrets that we’re manually added. (e.g. TYK GH secret)
  • For the CICD image tag, we store it in CICD Vault
    • What else , what are the other use case in GHAs where we use or pull secrets from CICD vault instead of GH Secrets.

Argo CD

CICD-Vault

  • Why is the vault cicd public?
  • Root token - stored in vault - cicd secrets
  • Data stored in pvc, with no backup
  • What is the plans for the cicd vault? Are we going to keep using the cicd vault or do we plan to use HCV?
  • The list of manually added secrets (aside from the one with the /manual directory)
    • We have to go through the list of all secrets in all paths and identify the manually added values.
  • Okta roles mapping on vault policies.

Sonarqube

  • .github/workflows/_app-sonarqube-lib.yml
  • This one is deployed as a helm chart in CICD cluster and its the free opensource version.
    • No manual config?
      • only generated token for user and added it in github secrets
      • api token is manually created
  • No quality gates setup yet, the deployed version has the default policies.
  • authentication is not through okta yet. ldap it the only supported
  • SQ is already integrated in all microservices (common lib)

Istio

  • How is Istio deployed? In apps cluster only. (brave, dev, stage) devops/argocd/environments/common/infra/base/istiod.yaml
    • istio-base.yaml for istio crds
    • istiod is the controller
  • Are there any existing issues in istio?
    • none as we are not completely utilizing istio yet
  • How is MTLS enabled?
    • enabled by default (microservice to microservice)
    • Are we using the default mtls for certificates?
    • or is it from lets-encrypt or zerossl?
      • neither as we are using the istio generated certs
  • Is istio enabled all microservices namespaces? Its enabled in the kotlin chart, added as pod annotation
  • Are we going to use istio for observability? latency?
  • (tick) In the future, maybe we can use the traffic management feature in istio for Canary based deployment.

Traefik

  • This is the main ingress controller.
  • Each GKE cluster has its own traefik. (internal devops/argocd/environments/common/infra/base/traefik-internal.yaml)
  • .internal is still being used by prometheuses of all gke clusters (for thanos to pull data)
  • The external traefik - devops/terraform/tf-cicd/traefik.tf
    • Do we have another external traefik aside from this? - none
      • cicd, vault, sonarqube all points to same load balancer
    • A traffic network diagram that will provide us a picture of the different traefik ingress controllers deployed as well as external load balancers, how are they connected to each other
  • traefik metrics

Certificates

  • Are we still using lets-encrypt for certificate creation or all of https endpoint are now using zero ssl?
    • yes, for cicd, vault, sonarqube for http verification
    • all ms is zerossl
    • internal tooling uses such as monitoring endpoints are also using zerossl
  • cert-manager is deployed in all GKE clusters https://github.com/SafiBank/SaFiMono/tree/main/devops/argocd/environments/common/infra/cert-manager/templates
  • What are the manual configuration done when setting up the zerossl? Will there be any manual config to be done in the future?
    • any secrets or keys? devops/argocd/environments/common/infra/cert-manager/templates/cluster-issuer-zerossl-dns.yaml
    • <secret:secret/data/cicd/zerossl~EABHMACKey>
    • zerossl is created using Ion Mudreac acct
  • Are we using Google Cloud Certificate Manager or just purely CertManager?
    • Using CertManager based on the code
  • Are we using the ssl proxy in Cloudflare? Do we need to use it?
  • Are we terminating all HTTPS traffic in the load balancer (traefik/zerossl)?
    • yes for services using this lb
    • for ms services its using tyk
      • cloudflare ssl is being used to ssl termination

Genesys

  • Is it going to be deployed in cicd cluster?
    • should be in environment (brave but its not deployed there only in dev terraform-dev branch)
  • What are the plans for Genesys?

BOFE

  • Mapping is manual of users to groups - how? kindly add the link to the documentation
    • How do we access the dashboard from where we can map the users into groups

Threat mapper

Attachments: