SaFi Bank Space : Core Foundation - Handover Notes

Date: November 22, 2022 | 10:00AM - 12PM MNL/SGT (internal)

Attendees: (All SAFI SRE Team Members)

Reference:
  • Completed

(blue star) FYI

(blue star) Refer to another SRE/DEVOPS engineer (missing doc, incomplete doc, may need few more sessions)


Agenda

  • What do we currently have? (infrastructure, cicd, iac, security, documentations etc)
  • What do we currently know? (in terms of what is already in place and deployed)
  • What do we currently know? (in terms of process)
  • ✅ Documentation Checklist ✅

    • What is currently documented?

    • Is it updated?

    • Is it complete

    • Does it require further documentation?

  • Team Structure / Squads

    • Team Structure

      • This is the current team structure for DevOps as of this writing.

What’s the problem/challenge we want to address in this meeting?

We want to answer the questions we’ve mentioned in the Agenda above and we want to form a structure and process on how we can work alongside the VacuumLabs team in identifying the parts in our infrastructure that we either don't have full understanding on or we haven't been made aware of (in terms of planning and architecting).

What is/are the goal/s of this exercise?

  • To have full visibility and knowledge on how each parts of our Infrastructure is currently implemented.
  • To know logic and reasoning and future plans why we chose a certain part of the tech stack
  • Take note of the challenges, issues that were encountered while implementing solutions and possible problems that may arise with current setup
  • To tackle future plans in order to cover scalability, resiliency and high availability of the implemented solution

--- Actual Minutes and Notes ---

We started by going through different parts of each squad. Today we were able to cover the responsibilities under the Core Foundation Squad.

Core Foundation

Responsibilities

GCP Projects and Structure

  • Org admin handover - who is currently the org admin and owner of the GCP account?
  • On permissions - all documented under terraform/_files
    • are all gcp users in this directory?
      • yes and no?
      • there may be some users that was added prior from creating the _files yaml
      • especially in the organizational view
    • for improvements (okta integration, rbac based on groups in okta)
      • we use okta in bofe, vault, cf, grafana, argo, genesys? , iam manager
  • (blue star) Dispatcher Documentation and Explanation
    • Terraform Dispatchers
    • (blue star) Is the documentation complete and updated?
    • (blue star) Any manual configuration done? If yes, please document.
    • (blue star) Ondřej Wantula (Unlicensed) to add more details on this doc Terraform Dispatchers
    • tf-dispatcher was created manually from tfc
    • including the variables
    • This is the starting point in order to automate the creation of the rest of the other ws
      • GCP APIs manually enabled?
    • SAFI DEV project- most likely all of the apis enabled
    • Management Project - one of the manually created projects safi-management
      • This project had to be created in order to start the automation of the rest of the other GCP projects

Terraform and Terraform Cloud

  • Terraform Workspaces - Different environments
    • Terraform Owner/Admin handover
    • Documentation - what goes to infra and config and the logic behind it
    • What resources are created in each workspaces?
    • On terraform variable values, kindly list down all that has been added manually.
      • Can we get the values? (sensitive and non sensitive)
        • The sensitive values, get it from terraform output and then transfer it to vault
    • (blue star) On managing terraform state, please list down all manual configurations that had to be done (e.g. mapping workspaces on terraform cloud)
      • (blue star) Can we add this portion in the Confluence (on sharing states) Ondřej Wantula (Unlicensed)
      • (blue star) Kindly add as well how to fix/troubleshoot issues re: to the above
  • (blue star) Terraform Agents
    • (blue star) Documentation and explanation on how are they deployed, how it works, limitation (numbers of agents), how does it scale and how the integration works
    • (blue star) Ondřej Wantula (Unlicensed) can create a starting doc and Peter Kmec (Unlicensed) to check and add for anything that is missing.
    • Explanation behind moving to terraform agent
    • (blue star) How does argocd connects via proxy - in relation with terraform agents.
      • (blue star) There used to be a proxyvm before in GKE?
      • (blue star) Peter Kmec (Unlicensed) Can you document this part, if and what proxy are we still using for argocd.
    • What were the next plans in terraform agents when they deployed it (dev)?
      • It was originally designed to have 1 for each env and 1 for CICD
        • But due to the rapid changes in Dev, the number of agents per env (specifically dev we had to add another one on brave)
        • one agent pool in cicd project, 4 vms in cicd project and all tf ws can use the same tfc agent pool
        • Do we have to deploy separate agents for Production?
          • This needs to be discussed as a terraform agent could cost as 10k usd/yr
      • How will it scale and is it covered by the current terraform subscription package?
        • It wont, as its a limitation on the package
    • Are all the tfc agents deployed in CICD Project?
      • 1 for CICD Project? as a vm instance
      • 2 for brave? 1 for dev? - How did we deploy the agents in the dev and brave project?
      • Whats our limit? What about stage and production?
  • (blue star) Terraform Custom Cloud Provider
    • (blue star) Documentation on custom cloud provider for Confluent Kafka
    • (blue star) Peter Kmec (Unlicensed) we need some form of documentation on how this is implemented.

Networking

  • Brief explanation on the current network diagram/setup within GCP
    • shared_vpc connectivity
    • connectivity to other projects outside of shared_vpc
    • connectivity with third party
  • Euronet VPN Gateway
    • How was the Euronet VPN setup (documentation + knowledge transfer sessions)
    • What were the issues then?
    • (blue star) How do we reach out to Euronet? Kindly add it in the confluence. Ondřej Wantula (Unlicensed)
    • Are there any issues that could probably arise with the current setup?
      • scalability issues - takes time to create new tunnels because of delays from euronets side.
      • HA - we are using classic vpn gateway not the HA tunneling
    • Any backlog tasks relating to Euronet vpn setup?
      • (blue star) According to Andre L., Euronet is supposed to connect through mtls, but we want to understand how exactly they will connect with us (euronet to our microservices endpoints)
      • (e.g) Meiro when we give them access through tyk to access the output manager, we can probably do the same with Euronet
  • Documentation on how do we access the gke cluster from local.
  • In terms of allowed networks in the cloudflare warp vpn client, are we able to reach any private resource that are hosted in shared vpc subnets?
  • (blue star) Confluent cloud networking in shared vpc, check with Peter Kmec (Unlicensed)
    • (blue star) How is confluent cloud connected to our gcp network - vpc peering - we want to understand how are the routes setup.
      • (blue star) Peter Kmec (Unlicensed) to answer on Data foundation handover
      • (blue star) Where is it in the code and how do we add more routes if we have to?
  • How is shared_vpc is setup in all projects and how are the other projects (outside of the share vpc) connects to the shared_vpc (such as argocd)
  • Subnet planning
  • Egress Traffic
    • Cloud NAT - how is it deployed and how is it working right now.
      • devops/terraform/tf-environments/shared_vpc.tf
        • 1 per shared network, 1 per env (total of 3)
      • are we restricting any sort of egress traffic?
        • none
      • are we planning to?
        • depends on the security plans
      • are there any policies set?
        • none
  • Ingress Traffic
    • Network diagram and documentation on a high level, how are our apps will be reachable from the wider internet.
      • GCP Structure
      • Are we only exposing Production endpoints?
        • api.safibank.ph - is the public endpoints we should use for partners
        • use the domain that is pointed to tyk and not the .safibank.online
      • Are they all under cloudflare zero trust?
      • From Cloudflare → Tyk → GCLB → Traefik?

GCP IAM and Permissions

  • For all GCP users its all defined in terraform/_files ? (no one has been manually added)
  • For Service Accts, microservices.yaml is used to loop the app and create the respective sa
  • For service acct that is not for the microservice, (e.g. thanos sa, bigquery sa), they are defined in the -infra of each respective tf file
  • Yes, should be in the *-infra workspace of the env
  • Documentation on each component and its service acct & permissions
  • (blue star) Workload identity - service acct permissions - documentation on the implementation and how to utilize it - Aleksandr Kanaev (Unlicensed) can cover this.

Cloudflare Zero Trust

DNS

  • DNS registrar. Where did we buy the domains? Who can access the portals from where we bought it?
  • Google Cloud DNS
    • (blue star) safibank.internal we don't use this anymore? - do we plan on using this in the future?
    • safibank.online (are all of the dns records pointing to this accessible through vpn only)
      • all endpoints (ms, devops tools are under this domain)
      • the dns records in tf (mostly)
      • Are are all records in this folder - terraform/tf-dns-safibankonline
        • Yes, except for *.internal
        • Documentation which tf file is used when creating dns record
          • terraform/tf-dns-safibankonline
      • argo and cicd vault are public, the rest are behind the vpn
    • What were the issues before on the public ip that we encountered when ms were still pointed to the gclb external ip?
      • Mostly security.
    • (blue star) applications network diagram, what are the different load balancers and how are they connected?
      • Aleksandr Kanaev (Unlicensed) may have better idea on this
        • nginx proxy load balancer? is this still being used, for what purpose?
        • Peter Kmec (Unlicensed) can give us some ideas on this.
        • (blue star) what are the different external and internal load balancers that are deployed and how are they connected and communicating? ask Aleksandr Kanaev (Unlicensed)
        • (blue star) how are we achieving/generating the ssl certificate creation through zerossl for the private load balancers (will be covered in app foundation)
    • Why are we using public/external ip (gclb lb ip) for cicd vault, argocd (unlike ms that are pointed to private ip) ?
    • Are we using any other domain aside from safibank.online in google cloud dns
      • No
    • Are there any manually added dns records ?
  • Cloudflare DNS
    • smallog.tech and blueship.store
      • all records are created terraform?
    • safibank.ph
      • Are we going to use this for all ms for production (were not going to use safibank online for prod?
      • only for api (tyk) since the ms should still be utilizing the safibank.online behind the cf
  • (blue star) Onboarding / Offboarding

Recording available in Lark Minutes. https://advancegroup.larksuite.com/minutes/obusqy9kw872r1z194946k69

SAFI SRE - Transition Meeting #1.mp4

Attachments: