Date: November 22, 2022 | 10:00AM - 12PM MNL/SGT (internal)
Attendees: (All SAFI SRE Team Members)
Reference:
- Completed
FYI
Refer to another SRE/DEVOPS engineer (missing doc, incomplete doc, may need few more sessions)
Agenda
- What do we currently have? (infrastructure, cicd, iac, security, documentations etc)
- What do we currently know? (in terms of what is already in place and deployed)
- What do we currently know? (in terms of process)
✅ Documentation Checklist ✅
What is currently documented?
Is it updated?
Is it complete
Does it require further documentation?
Team Structure / Squads
This is the current team structure for DevOps as of this writing.
What’s the problem/challenge we want to address in this meeting?
We want to answer the questions we’ve mentioned in the Agenda above and we want to form a structure and process on how we can work alongside the VacuumLabs team in identifying the parts in our infrastructure that we either don't have full understanding on or we haven't been made aware of (in terms of planning and architecting).
What is/are the goal/s of this exercise?
- To have full visibility and knowledge on how each parts of our Infrastructure is currently implemented.
- To know logic and reasoning and future plans why we chose a certain part of the tech stack
- Take note of the challenges, issues that were encountered while implementing solutions and possible problems that may arise with current setup
- To tackle future plans in order to cover scalability, resiliency and high availability of the implemented solution
--- Actual Minutes and Notes ---
We started by going through different parts of each squad. Today we were able to cover the responsibilities under the Core Foundation Squad.
Core Foundation
Responsibilities
GCP Projects and Structure
- Org admin handover - who is currently the org admin and owner of the GCP account?
- Ion Mudreac is now the Org Admin.
- On permissions - all documented under
terraform/_files
- are all gcp users in this directory?
- yes and no?
- there may be some users that was added prior from creating the _files yaml
- especially in the organizational view
- for improvements (okta integration, rbac based on groups in okta)
- we use okta in bofe, vault, cf, grafana, argo, genesys? , iam manager
- Dispatcher Documentation and Explanation
- Terraform Dispatchers
- Is the documentation complete and updated?
- Any manual configuration done? If yes, please document.
- Ondřej Wantula (Unlicensed) to add more details on this doc Terraform Dispatchers
- tf-dispatcher was created manually from tfc
- including the variables
- This is the starting point in order to automate the creation of the rest of the other ws
- GCP APIs manually enabled?
- SAFI DEV project- most likely all of the apis enabled
- This needs to be deleted
- Are there resources still being used in this project?
- Sergei Teteriukov (Unlicensed) started working on automating some of the manually created resources (e.g. firebase), are these documented Sergei Teteriukov (Unlicensed) ?
- No deadline that was set on the migration
- We may need to check with Andre Laksmana (Unlicensed) what else is running and being used in this project before deleting it.
- Management Project - one of the manually created projects
safi-management
- This project had to be created in order to start the automation of the rest of the other GCP projects
Terraform and Terraform Cloud
- Terraform Workspaces - Different environments
- Terraform Owner/Admin handover
- Ion Mudreac is an owner. BharathKumar D Lucky La Torre (Unlicensed) has been added as well
- Documentation - what goes to infra and config and the logic behind it
- What resources are created in each workspaces?
- Terraform Workspaces - Ondřej Wantula (Unlicensed) will add more info on this doc
- On terraform variable values, kindly list down all that has been added manually.
- Can we get the values? (sensitive and non sensitive)
- The sensitive values, get it from terraform output and then transfer it to vault
- On managing terraform state, please list down all manual configurations that had to be done (e.g. mapping workspaces on terraform cloud)
- Can we add this portion in the Confluence (on sharing states) Ondřej Wantula (Unlicensed)
- Kindly add as well how to fix/troubleshoot issues re: to the above
- Terraform Agents
- Documentation and explanation on how are they deployed, how it works, limitation (numbers of agents), how does it scale and how the integration works
- Ondřej Wantula (Unlicensed) can create a starting doc and Peter Kmec (Unlicensed) to check and add for anything that is missing.
- Explanation behind moving to terraform agent
- How does argocd connects via proxy - in relation with terraform agents.
- There used to be a proxyvm before in GKE?
- Peter Kmec (Unlicensed) Can you document this part, if and what proxy are we still using for argocd.
- What were the next plans in terraform agents when they deployed it (dev)?
- It was originally designed to have 1 for each env and 1 for CICD
- But due to the rapid changes in Dev, the number of agents per env (specifically dev we had to add another one on brave)
- one agent pool in cicd project, 4 vms in cicd project and all tf ws can use the same tfc agent pool
- Do we have to deploy separate agents for Production?
- This needs to be discussed as a terraform agent could cost as 10k usd/yr
- How will it scale and is it covered by the current terraform subscription package?
- It wont, as its a limitation on the package
- Are all the tfc agents deployed in CICD Project?
- 1 for CICD Project? as a vm instance
- 2 for brave? 1 for dev? - How did we deploy the agents in the dev and brave project?
- Whats our limit? What about stage and production?
- Terraform Custom Cloud Provider
- Documentation on custom cloud provider for Confluent Kafka
- Peter Kmec (Unlicensed) we need some form of documentation on how this is implemented.
Networking
- Brief explanation on the current network diagram/setup within GCP
- shared_vpc connectivity
- connectivity to other projects outside of shared_vpc
- connectivity with third party
- Euronet VPN Gateway
- How was the Euronet VPN setup (documentation + knowledge transfer sessions)
- What were the issues then?
- How do we reach out to Euronet? Kindly add it in the confluence. Ondřej Wantula (Unlicensed)
- Are there any issues that could probably arise with the current setup?
- scalability issues - takes time to create new tunnels because of delays from euronets side.
- HA - we are using classic vpn gateway not the HA tunneling
- Any backlog tasks relating to Euronet vpn setup?
- According to Andre L., Euronet is supposed to connect through mtls, but we want to understand how exactly they will connect with us (euronet to our microservices endpoints)
- (e.g) Meiro when we give them access through tyk to access the output manager, we can probably do the same with Euronet
- Documentation on how do we access the gke cluster from local.
- gcloud and VPN warp
- Cloudflare VPN Access Rules
- In terms of allowed networks in the cloudflare warp vpn client, are we able to reach any private resource that are hosted in shared vpc subnets?
- yes as long as the subnets are allowed in the routing - manual as theres no cloudflare api for this yet, see https://safibank.atlassian.net/wiki/spaces/ITArch/pages/146571383/Cloudflare+VPN+Implementation#Manual-steps-necessary-to-be-taken%3A
- Confluent cloud networking in shared vpc, check with Peter Kmec (Unlicensed)
- How is confluent cloud connected to our gcp network - vpc peering - we want to understand how are the routes setup.
- Peter Kmec (Unlicensed) to answer on Data foundation handover
- Where is it in the code and how do we add more routes if we have to?
- How is shared_vpc is setup in all projects and how are the other projects (outside of the share vpc) connects to the shared_vpc (such as
argocd
) - Subnet planning
- cidr networks - what are the networks that we will be using in production, was it part of the initial architectural plan?
- In network.yaml file, we noticed this
private-default,
what is this for? - We are still using this for VMs that are not part of the GKE network. _ reserved for standalone vm instance
- Egress Traffic
- Cloud NAT - how is it deployed and how is it working right now.
- devops/terraform/tf-environments/shared_vpc.tf
- 1 per shared network, 1 per env (total of 3)
- are we restricting any sort of egress traffic?
- none
- are we planning to?
- depends on the security plans
- are there any policies set?
- none
- Ingress Traffic
- Network diagram and documentation on a high level, how are our apps will be reachable from the wider internet.
- GCP Structure
- Are we only exposing Production endpoints?
- api.safibank.ph - is the public endpoints we should use for partners
- use the domain that is pointed to tyk and not the .safibank.online
- Are they all under cloudflare zero trust?
- From Cloudflare → Tyk → GCLB → Traefik?
GCP IAM and Permissions
- For all GCP users its all defined in terraform/_files ? (no one has been manually added)
- For Service Accts, microservices.yaml is used to loop the app and create the respective sa
- For service acct that is not for the microservice, (e.g. thanos sa, bigquery sa), they are defined in the -infra of each respective tf file
- Yes, should be in the *-infra workspace of the env
- Documentation on each component and its service acct & permissions
- Workload identity - service acct permissions - documentation on the implementation and how to utilize it - Aleksandr Kanaev (Unlicensed) can cover this.
Cloudflare Zero Trust
- Cloudflare Zero Trust / VPN
- What are existing issues in Cloudflare Zero trust?
- No open support ticket
- China devs have slow connection when using CF Warp when working from home.
- All rules/policies/config are in terraform ?
- Yes except for below.
- If there any manual config, is it documented?
- Split tunnels, config and access rules - https://safibank.atlassian.net/wiki/spaces/ITArch/pages/146571383/Cloudflare+VPN+Implementation#Manual-steps-necessary-to-be-taken%3A
- Documentation of user management with Okta
- Ondřej Wantula (Unlicensed) kindly add it in the comments - how is cloudflare zero trust integrated with Okta
- Documentation of all manual configurations done while setting up the Cloudflare Zero Trust.
- Traffic and route management
- Adding of new allowed networks (documentation)
- https://safibank.atlassian.net/wiki/spaces/ITArch/pages/146571383/Cloudflare+VPN+Implementation#Manual-steps-necessary-to-be-taken%3A
- sfdvwork.xyz - what is this for, what are the plans and whats currently being done for this domain
- new stage env
- we are waiting for all issues in kafka to be done
- blueship.store we cannot use this because we’re not done with epfs
- What are the different features being utilized in Cloudflare, documentation
- Are they all in the code, please document as well which part of the code are these features/resources been added
- On waf security by Cloudflare
- what security policies are in place (is it in code and are there any manual config?)
- Do we have any Country blacklisting and whitelisting in place - documentation
- Gnanasekaran Gajendiran can add documentation for the blacklisting and whitelisting
- Documentation of cf tunnels - https://one.dash.cloudflare.com/8f541e26d66c440f775ecfdb37d6303d/access/tunnels
- What is the purpose of the resources defined in this file https://github.com/SafiBank/SaFiMono/blob/f38383f833b8d173f73c99e9e028b8d63dd0f559/devops/terraform/tf-cicd/cloudflare_tunnel.tf
- Ondřej Wantula (Unlicensed) to create a documentation on this
- Cloudflare monitoring and alert rules
- Documentation, Gnanasekaran Gajendiran will cover this one.
DNS
- DNS registrar. Where did we buy the domains? Who can access the portals from where we bought it?
- Ion Mudreac purchased by ION.
- Google Cloud DNS
-
safibank.internal
we don't use this anymore? - do we plan on using this in the future? - We are not using this. We can doublecheck with Aleksandr Kanaev (Unlicensed)
- safibank.online (are all of the dns records pointing to this accessible through vpn only)
- all endpoints (ms, devops tools are under this domain)
- the dns records in tf (mostly)
- Are are all records in this folder - terraform/tf-dns-safibankonline
- Yes, except for *.internal
- Documentation which tf file is used when creating dns record
- terraform/tf-dns-safibankonline
- argo and cicd vault are public, the rest are behind the vpn
- Why are these still public, Aleksandr Kanaev (Unlicensed) can answer this.
- What were the issues before on the public ip that we encountered when ms were still pointed to the gclb external ip?
- Mostly security.
- applications network diagram, what are the different load balancers and how are they connected?
- Aleksandr Kanaev (Unlicensed) may have better idea on this
- nginx proxy load balancer? is this still being used, for what purpose?
- Peter Kmec (Unlicensed) can give us some ideas on this.
- what are the different external and internal load balancers that are deployed and how are they connected and communicating? ask Aleksandr Kanaev (Unlicensed)
- how are we achieving/generating the ssl certificate creation through zerossl for the private load balancers (will be covered in app foundation)
- Why are we using public/external ip (gclb lb ip) for cicd vault, argocd (unlike ms that are pointed to private ip) ?
- Aleksandr Kanaev (Unlicensed) can give us some light on this one
- Are we using any other domain aside from safibank.online in google cloud dns
- No
- Are there any manually added dns records ?
- Cloudflare DNS
- smallog.tech and blueship.store
- all records are created terraform?
- safibank.ph
- Are we going to use this for all ms for production (were not going to use safibank online for prod?
- only for api (tyk) since the ms should still be utilizing the safibank.online behind the cf
- Onboarding / Offboarding
- Okta adding of users are in code
- when adding admin, it has to be done manually (ask Aleksandr Kanaev (Unlicensed) for handover of admin permission)
- Justin, Aleks, Regin, Gnana ( are already admins)
- For non admins, for other roles, developer role can be done by code
- but for other roles like backoffice admins, due to api limitations, permissions are being done manually , check with Aleksandr Kanaev (Unlicensed)
- ask aleks if the api rate limiting is now solved (to be covered in app foundation meeting)
- Okta groups/roles is created elsewhere, another file is managing BOFE
- https://github.com/SafiBank/SaFiMono/blob/main/devops/terraform/tf-env-applications-config/okta_bofe.tf
- how does mapping happens? how do we map users to groups? Ask Aleksandr Kanaev (Unlicensed)
Recording available in Lark Minutes. https://advancegroup.larksuite.com/minutes/obusqy9kw872r1z194946k69