SaFi Bank Space : Feb 2023 - Monthly Planning

February 2023 - DevOps/SRE Tickets

Epics

Story

Priority

Target Completion

Comments

SAF-1520 - Getting issue details... STATUS

1

  • stories for label reviews on GCP resources

  • understand and create a design on how we can manage costs per projects, per teams.

  • design how to label shared resources

SAF-1503 - Getting issue details... STATUS

SAF-167 - Getting issue details... STATUS

SAF-1678 - Getting issue details... STATUS

SAF-1040 - Getting issue details... STATUS

SAF-1687 - Getting issue details... STATUS

SAF-1688 - Getting issue details... STATUS

SAF-1699 - Getting issue details... STATUS

Draft of Epics and Stories for DevOps/SRE

Categories/Epics

Thought Machine Setup - (NOTE: Brave is not yet using TM deployed brave - it is still pointed to TM5 Sandbox 5 Project and as for Tangled, it is the one where we have a dedicated TM deployed in Tangled env and is running TM 4.5.1)

  • what are the other issues

  • kafka cluster issues -

  • tm confluent cloud kafka integration issue needs to be followed up with TM team, currently we are still using local kafka cluster, on prem kafka cluster

Testing Tangled Environment

create all the tasks required for testing all functionalities within tangled

find a way to test new environments in an automated way. we will most likely have another environment (short lived ones, to test upgrades, maintenance, big features impacting multiple services)

Cost Optimization - continuous effort (monthly review)

Data Team Setup (alloydb,

alloydb

vertex, feature stores, other pending request

migrate from data test to brave-data,brave-infra)

Alloydb for microservices

Testing of Alloydb on all microservices? Checking with Architecture what needs to be checked before migrating? How will we desing HA and Brave env for Alloydb?

**Consider and understand the design of Alloydb and cost implications on choosing it versus CloudSQL. pending call with google

Alloydb as datastore for Tyk

already tested, just havent implemented yet

same as above - **Consider and understand the design of Alloydb and cost implications on choosing it versus CloudSQL. pending call with google

Documentation , Knowledge Transfer - dedicated days (as a team)

Documentation is required as DOD specifically for big stories and epics

High Level Architecture - further improve the project and folder gcp diagram and add 3rd party connections , which securities in place

Creation of dedicated GKE Cluster for Temporal (brave, tangled)

Observability Stack (GKE apps including TM GKE)

  • Monitoring -

  • Tracing -

  • Alerting

  • Logging

  • TM Observability stack - can we point it to the same environment - brave, tangled

    • identify base observability package , whats included

    • what other metrics are not available in this package

    • topics monitoring?

    • database included in monitoring ?

    • identify/find a proper way for granting access to TM Support so they can help us debug the problem in TM (probably a dormant Okta acct? )

Monitoring of resources not managed by GKE - EPIC

  • Confluent Cloud Kafka -

  • GCP network

    • GCP vpc flow logs

    • Tyk (? is in gke)

    • Istio (? also in gke)

    • Traefik (? also in gke)

  • Cloudflare

  • Meiro

    • Understand what they need to access and what they dont need to access.

    • We need someone from Meiros team (no longer svynek as svynek is going to be focused in genesys) to drive the communication

    • Find a way to manage / restrict access for Meiro to access our GCP project - especially for Prod.

    • Monitor what they are doing and audit what they do to our project, based on this we should implement stricter permissions and/or firewall rules. This is CRITICAL as Meiro platform contains PII data.

    • We need to figure out how to restrict their access to BQ - possibly via Kafka → BQ and not directly querying BQ.

    • Meiro team will be deploying in gke meiro but we need to monitor what they’re deploying as we wont be managing deployments. CICD/deployment is handled by them.

  • 3rd party integration (external providers) monitoring - statuses - vida, cloudflare, 3rd party apis, euronet, (need to identify what other services needs to be monitored that are not in our control)

Identifying Business Requirements for metrics - SLOs

Thought Machine Automation - How we will be doing upgrades or deployments

Hashicorp Vault, Secrets Management

  • we have two types of vault - cicd and TM (each cluster vault)

  • backup the cicd vault

  • implement env specific vault (this is not in used yet)

    • why this and not the cicd vault? why many vaults?

    • what security layer does this provide us - Integration of Hashicorp Vault with Micronaut

    • if no security, if secrets is still a base64 k8s secret, maybe consider looking at other options Mozilla SOPS, Bitnami Secrets etc.)

Confluent Kafka

Backup and DR - what is included in our plan/package

Hashicorp Terraform Cloud - optimization, proper uses of agents

Cost Alerting per project

Production Environment Planning

Migrate resources (firebase) from Safi-dev to Brave, Tangled

Migrate resources (data-test) to Brave, Tangled (data)

Incident and Response Management - Pagerduty

Building Automation on creating and granting access to devs/testers based on roles. Runbooks for low level tasks.

Roadmap for DevOps

Feb 2023

Thought Machine Setup - Brave is not yet using brave tm - pointed to TM5 Sandbox 5 and Tangled (address pending issues)

Testing Tangled Environment

Cost Optimization - continuous effort (monthly review)

Data Team - Alloydb and migration of Data Test resource to Brave (cost impacting)

Data Team Radney - Github - move their repo to SafiMono (dedicated directory)

Risk Team Joon Kiat - infra and config can be managed and controlled in safimono

but there are some lambda functions and python code that needs to managed from Risk repo as these are models that needs to be maintained by their teams (contains proprietary logic)

Due to this restriction to access the models, risk team terraform workspace and github action codes as well as the code in their repo will need to be restricted to 2 or 3 SRE Team members

Data - Test

Check who is using Data-test (Pete, Radney? )

Create a design how do we manage risk gcp projects with consideration on the proprietary that they would like to restrict to few people

TM Observability Stack

integration with current grafana - brave

understand how we can exposed the prometheus endpoints and thanos endpoints from TMs in the shared_vpc so the main grafana can scrape it as an additional datasource

Monitoring of resources not managed by GKE

Cost Alerting per project -

Understand what type of alerts we want and the recipients (project, label)

For Requests and Incidents

  • Create runbooks and documentation for repetitive tasks as definition of done

  • Identify root cause for repeating incidents

  • Conduct atleast 2 Knowledge transfer for Feb (sre/devops team)

March 2023

April 2023

May 2023

June 2023

July 2023 - MVP

August 2023 - Go Live!