February 2023 - DevOps/SRE Tickets
Epics | Story | Priority | Target Completion | Comments |
---|---|---|---|---|
SAF-1520 - Getting issue details... STATUS | 1 |
|
| |
SAF-1503 - Getting issue details... STATUS |
| |||
SAF-167 - Getting issue details... STATUS | SAF-1678 - Getting issue details... STATUS | |||
SAF-1040 - Getting issue details... STATUS | ||||
SAF-1687 - Getting issue details... STATUS | ||||
SAF-1688 - Getting issue details... STATUS | ||||
SAF-1699 - Getting issue details... STATUS |
Draft of Epics and Stories for DevOps/SRE
Categories/Epics
Thought Machine Setup - (NOTE: Brave is not yet using TM deployed brave - it is still pointed to TM5 Sandbox 5 Project and as for Tangled, it is the one where we have a dedicated TM deployed in Tangled env and is running TM 4.5.1)
what are the other issues
kafka cluster issues -
tm confluent cloud kafka integration issue needs to be followed up with TM team, currently we are still using local kafka cluster, on prem kafka cluster
Testing Tangled Environment
create all the tasks required for testing all functionalities within tangled
find a way to test new environments in an automated way. we will most likely have another environment (short lived ones, to test upgrades, maintenance, big features impacting multiple services)
Cost Optimization - continuous effort (monthly review)
Data Team Setup (alloydb,
alloydb
vertex, feature stores, other pending request
migrate from data test to brave-data,brave-infra)
Alloydb for microservices
Testing of Alloydb on all microservices? Checking with Architecture what needs to be checked before migrating? How will we desing HA and Brave env for Alloydb?
**Consider and understand the design of Alloydb and cost implications on choosing it versus CloudSQL. pending call with google
Alloydb as datastore for Tyk
already tested, just havent implemented yet
same as above - **Consider and understand the design of Alloydb and cost implications on choosing it versus CloudSQL. pending call with google
Documentation , Knowledge Transfer - dedicated days (as a team)
Documentation is required as DOD specifically for big stories and epics
High Level Architecture - further improve the project and folder gcp diagram and add 3rd party connections , which securities in place
Creation of dedicated GKE Cluster for Temporal (brave, tangled)
Observability Stack (GKE apps including TM GKE)
Monitoring -
Tracing -
Alerting
Logging
TM Observability stack - can we point it to the same environment - brave, tangled
identify base observability package , whats included
what other metrics are not available in this package
topics monitoring?
database included in monitoring ?
identify/find a proper way for granting access to TM Support so they can help us debug the problem in TM (probably a dormant Okta acct? )
Monitoring of resources not managed by GKE - EPIC
Confluent Cloud Kafka -
GCP network
GCP vpc flow logs
Tyk (? is in gke)
Istio (? also in gke)
Traefik (? also in gke)
Cloudflare
Meiro
Understand what they need to access and what they dont need to access.
We need someone from Meiros team (no longer svynek as svynek is going to be focused in genesys) to drive the communication
Find a way to manage / restrict access for Meiro to access our GCP project - especially for Prod.
Monitor what they are doing and audit what they do to our project, based on this we should implement stricter permissions and/or firewall rules. This is CRITICAL as Meiro platform contains PII data.
We need to figure out how to restrict their access to BQ - possibly via Kafka → BQ and not directly querying BQ.
Meiro team will be deploying in gke meiro but we need to monitor what they’re deploying as we wont be managing deployments. CICD/deployment is handled by them.
3rd party integration (external providers) monitoring - statuses - vida, cloudflare, 3rd party apis, euronet, (need to identify what other services needs to be monitored that are not in our control)
Identifying Business Requirements for metrics - SLOs
Thought Machine Automation - How we will be doing upgrades or deployments
Hashicorp Vault, Secrets Management
we have two types of vault - cicd and TM (each cluster vault)
backup the cicd vault
implement env specific vault (this is not in used yet)
why this and not the cicd vault? why many vaults?
what security layer does this provide us - Integration of Hashicorp Vault with Micronaut
if no security, if secrets is still a base64 k8s secret, maybe consider looking at other options Mozilla SOPS, Bitnami Secrets etc.)
Confluent Kafka
Backup and DR - what is included in our plan/package
Hashicorp Terraform Cloud - optimization, proper uses of agents
Cost Alerting per project
Production Environment Planning
Migrate resources (firebase) from Safi-dev to Brave, Tangled
Migrate resources (data-test) to Brave, Tangled (data)
Incident and Response Management - Pagerduty
Building Automation on creating and granting access to devs/testers based on roles. Runbooks for low level tasks.
Roadmap for DevOps
Feb 2023
Thought Machine Setup - Brave is not yet using brave tm - pointed to TM5 Sandbox 5 and Tangled (address pending issues)
Testing Tangled Environment
Cost Optimization - continuous effort (monthly review)
Data Team - Alloydb and migration of Data Test resource to Brave (cost impacting)
Data Team Radney - Github - move their repo to SafiMono (dedicated directory)
Risk Team Joon Kiat - infra and config can be managed and controlled in safimono
but there are some lambda functions and python code that needs to managed from Risk repo as these are models that needs to be maintained by their teams (contains proprietary logic)
Due to this restriction to access the models, risk team terraform workspace and github action codes as well as the code in their repo will need to be restricted to 2 or 3 SRE Team members
Data - Test
Check who is using Data-test (Pete, Radney? )
Create a design how do we manage risk gcp projects with consideration on the proprietary that they would like to restrict to few people
TM Observability Stack
integration with current grafana - brave
understand how we can exposed the prometheus endpoints and thanos endpoints from TMs in the shared_vpc so the main grafana can scrape it as an additional datasource
Monitoring of resources not managed by GKE
Cost Alerting per project -
Understand what type of alerts we want and the recipients (project, label)
For Requests and Incidents
Create runbooks and documentation for repetitive tasks as definition of done
Identify root cause for repeating incidents
Conduct atleast 2 Knowledge transfer for Feb (sre/devops team)
March 2023
April 2023
May 2023
June 2023
July 2023 - MVP
August 2023 - Go Live!