SaFi Bank Space : FeatureStore Technical Documentation


This page details the code, frameworks and schema needed in order to implement and read/write to the FeatureStore.

At present, there are 2 main ways of read/write that we are using:

  • FeatureStore Client: Offline and Online FeatureStore

  • REST API: Online FeatureStore

Future:


Concepts underlying Vertex AI FeatureStore:

Vertex AI Feature Store uses a time series data model to store a series of values for features. This model enables Vertex AI Feature Store to maintain feature values as they change over time. Vertex AI Feature Store organizes resources hierarchically in the following order: Featurestore -> EntityType -> Feature. You must create these resources before you can ingest data into Vertex AI Feature Store.

feature_id should be a STRING in lower case only.

Entity type

An entity type is a collection of semantically related features. You define your own entity types entity_type_id, based on the concepts that are relevant to your use case. For example, a movie service might have the entity types movie and user, which group related features that correspond to movies or customers.

Entity

An entity is an instance of an entity type. For example, movie_01 and movie_02 are entities of the entity type movie. In a featurestore each entity must have a unique ID entity_id and must be of type STRING.

Feature - Feature Value Types Reference Documentation

A feature is a measurable property or attribute of an entity type. For example, the movie entity type has features such as average_rating and title that track various properties of movies. Features are associated with entity types. Features must be distinct within a given entity type, but they don't need to be globally unique. For example, if you use title for two different entity types, Vertex AI Feature Store interprets title as two different features. When reading feature values, you provide the feature and its entity type as part of the request.


Source Data Requirements

Vertex AI Feature Store can ingest data from tables in BigQuery or files in Cloud Storage. For files in Cloud Storage, they must be in the Avro or CSV format.

Data Types and Requirements

Enums

VALUE_TYPE_UNSPECIFIED

The value type is unspecified.

BOOL

Used for Feature that is a boolean.

BOOL_ARRAY

Used for Feature that is a list of boolean.

DOUBLE

Used for Feature that is double.

DOUBLE_ARRAY

Used for Feature that is a list of double.

INT64

Used for Feature that is INT64.

INT64_ARRAY

Used for Feature that is a list of INT64.

STRING

Used for Feature that is string.

STRING_ARRAY

Used for Feature that is a list of String.

BYTES

Used for Feature that is bytes.

  • If you provide a column for feature generation timestamps, use one of the following timestamp formats:

    • For BigQuery tables, timestamps must be in the TIMESTAMP column.

    • For Avro, timestamps must be of type long and logical type timestamp-micros.

    • For CSV files, timestamps must be in the RFC 3339 format.The timestamp attached to the feature will have to be a TIMESTAMP data type

  • All columns must have a header that are of type STRING. There are no restrictions on the name of the headers.

    • For BigQuery tables, the column header is the column name.

    • For Avro, the column header is defined by the Avro schema that is associated with the binary data.

    • For CSV files, the column header is the first row.FeatureStore Data Schema


Offline FeatureStore

Feature Value Timestamps (Batch Ingestions)

For batch ingestions, Vertex AI Feature Store requires user-provided timestamps for the ingested feature values. You can specify a particular timestamp for each value or specify the same timestamp for all values:

  • If the timestamps for feature values are different, specify the timestamps in a column in your source data. Each row must have its own timestamp indicating when the feature value was generated. In your ingestion request, you specify the column name to identify the timestamp column.

  • If the timestamp for all feature values is the same, you can specify it as a parameter in your ingestion request. You can also specify the timestamp in a column in your source data, where each row has the same timestamp.

  • See also EntityType Class docs - https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.EntityType

FeatureStore FEATURE_CONFIG Schema

When not ingesting from a Cloud Storage CSV/Avro file or a DataFrame, a FEATURE_CONFIG schema needs to be passed at the point of ingestion alongside the ingestion request.

  • FEATURE_CONFIG for GCS - Timestamp column is specified

{
    "id": {
        "value_type": "STRING",
        "description": "User ID"
    },
    "annual_inc": {
        "value_type": "DOUBLE",
        "description": "The self-reported annual income provided by the borrower during registration."
    },
    "dti": {
        "value_type": "DOUBLE",
        "description": "A ratio calculated using the borrower total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income."
    },
    "timestamp": {
        "value_type": "STRING",
        "description": "The timestamp of the entry"
    },
    "target": {
        "value_type":"DOUBLE",
        "description": "Indicates charged off status or not"
    }
}

  • FEATURE_CONFIG for DataFrames - Timestamp column is not specified, passed as a parameter in the ingestion function instead

{
    "id": {
        "value_type": "STRING",
        "description": "User ID"
    },
    "annual_inc": {
        "value_type": "DOUBLE",
        "description": "The self-reported annual income provided by the borrower during registration."
    },
    "dti": {
        "value_type": "DOUBLE",
        "description": "A ratio calculated using the borrower total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income."
    },
    "target": {
        "value_type":"DOUBLE",
        "description": "Indicates charged off status or not"
    }
}

  • BigQuery Ingestion - No explicit schema is needed, but the table source columns and the TIMESTAMP column must adhere to the above requirements as stated in Data Types and Requirements above.

    • If there is no timestamp column in GCP, feature_time = datetime object (millis) <datetime.datetime>

    • If there is a timestamp column in GCP, feature_time = "column name" <str>

Online FeatureStore

Currently, this is in preview access only. For more details see sample in GitHub - LINK

Attachments: