Getting Started with ArangoGraphML Services and Packages
Interface to control all resources inside ArangoGraphML in a scriptable manner
ArangoGraphML is a set of services that provide an easy to use and scalable interface for graph machine learning. Since all of the orchestration and training logic is managed by ArangoGraph, all that is typically required is a specification outlining the data to be used to solve a task.
The following is a guide to show how to use the arangoml
package in order to:
- Manage projects
- Featurize data
- Submit training jobs
- Evaluate Metrics
- Generate predictions
arangoml
package.Specifications
The arangoml
package requires a properly formed specification. The specifications
describe the task being performed and the data being used. The ArangoGraphML services
work closely together, with one task providing inputs to another.
Import
The arangoml
package comes pre-loaded with every ArangoGraphML notebook environment.
To start using it, simply import it:
import arangoml
Projects
Projects are an important reference used throughout the entire ArangoGraphML lifecycle. All activities link back to a project. The creation of the project is very simple.
Create a project
project = arangoml.projects.create_project({"name":"ArangoGraphML_Project_Name"})
List projects
arangoml.projects.list_projects(limit=limit, offset=offset)
Lookup an existing project
project = arangoml.projects.get_project_by_name("ArangoGraphML_Project_Name")
# or by id
project = arangoml.projects.get_project("project_id")
Update an existing project
arangoml.projects.update_project(body={'id': 'project_id', 'name': 'Updated_ArangoGraphML_Project_Name'}, project_id='existing_project_id')
List models associated with project
arangoml.projects.list_models(project_name=project_name, project_id=project_id, job_id=job_id)
Delete an existing project
api_instance.delete_project(project_id)
Featurization
The featurization specification asks that you input the following:
featurization_name
: A name for the featurization task.project_name
: The associated project name. You can useproject.name
here if was created or retrieved as descried above.graph_name
: The associated graph name that exists within the database.default_config
Optional: The optional default configuration to be applied across all features. Individual collection feature settings override this option.dimensionality_reduction
: Object configuring dimensionality reduction.disabled
: Boolean for enabling or disabling dimensionality reduction.size
: The number of dimensions to reduce the feature length to.
vertexCollections
: The list of vertex collections to be featurized. Here you also need to detail the attributes to featurize and how. Supplying multiple attributes from a single collection results in a single concatenated feature.config
Optional: The configuration to apply to the feature output for this collection.dimensionality_reduction
: Object configuring dimensionality reduction.disabled
: Boolean for enabling or disabling dimensionality reduction.size
: The number of dimensions to reduce the feature length to.
output_name
: Adjust the default feature name. This can be any valid ArangoDB attribute name.
features
: A single feature or multiple features can be supplied per collection and they can all be featurized in different ways. Supplying multiple features results in a single concatenated feature.feature_type
: Provide the feature type. Currently the supported types includetext
,category
,numerical
.feature_generator
Optional: Adjust advanced feature generation parameters.feature_name
: The name of this Dict should match the attribute name of the document stored in ArangoDB. This overrides the name provided for the parent Dict.method
: Currently no additional options, leave as default.output_name
: Adjust the default feature name. This can be any valid ArangoDB attribute name."collectionName": { "features": { "attribute_name": { "feature_type": 'text' # Currently the supported types include text, category, numerical "feature_generator": { # this advanced option is optional. "method": "transformer_embeddings", "feature_name": "movie_title_embeddings", },
edgeCollections
: This is the list of edge collections associated with the vertex collections. There are no additional options."edgeCollections": { "edge_name_1", "edge_name_2 },
Once you have filled out the featurization specification you can pass it to
the featurizer
function.
from arangoml.featurizer import featurizer
featurizer(featurizer(_db, featurization_spec))
This is all you need to get started with featurization. The following also shows an example of using the featurization package with a movie dataset and some additional options available.
from arangoml.featurizer import featurizer
featurization_spec = {
"featurization_name": "Movie_Recommendation",
"project_name": "movie_recommendation_project",
"graph_name": "fake_m_hetero_2",
"vertexCollections": {
"v0": {
"features": {
"movie_title": {
"feature_type": "text",
"feature_generator": {
"method": "transformer_embeddings",
"feature_name": "movie_title_embeddings",
},
},
"genre": {
"feature_type": "category",
"feature_generator": {
"method": "one_hot_encoding",
"feature_name": "genre_embeddings",
},
},
}
},
"v1": {
"features": {
"actor_summary": {
"feature_type": "text",
"feature_generator": {
"method": "transformer_embeddings",
"feature_name": "actor_summary_embeddings",
},
},
"actor_country": {
"feature_type": "category",
"feature_generator": {
"method": "one_hot_encoding",
"feature_name": "actor_country_embedding",
},
},
}
},
"v2": {
"features": {
"director_summary": {
"feature_type": "text",
},
}
},
},
"edgeCollections": {
"e0"
},
}
output = featurizer(_db, featurization_spec, batch_size=32)
print(output)
# can also run without analysis
# featurizer(db, featurization_spec, run_analysis_checks=False)
# can also be run with a feature store
import arangomlFeatureStore
from arangomlFeatureStore.defaults import DEFAULT_FEATURE_STORE_DB
fs_db = client.db(DEFAULT_FEATURE_STORE_DB, username="root", password="") # nosec
admin = arangomlFeatureStore.FeatureStoreAdmin(
database_connection=fs_db,
)
feature_store = admin.get_feature_store()
output = featurizer(db, featurization_spec, feature_store=feature_store, batch_size=32)
print(output)
Experiment
Each experiment consists of three main phases:
- Training
- Model Selection
- Predictions
Training Specification
Training graph machine learning models with ArangoGraphML only requires two steps:
- Describe which data points should be included in the training job.
- Pass the training specification to the training service.
See below the different components of the training specification.
database_name
: The database name the source data is in.project_name
: The top-level project to which all the experiments will link back.metagraph
: This is the largest component that details the experiment objective and the associated data points.mlSpec
: Describes the desired machine learning task, input features, and the attribute label to be predicted.graph
: The ArangoDB graph name.vertexCollections
: Here, you can describe all the vertex collections and the features you would like to include in training. You must provide anx
for features, and the desired prediction label is supplied asy
.edgeCollections
: Here, you describe the relevant edge collections and any relevant attributes or features that should be considered when training.
A training specification allows for concisely defining your training task in a single object and then passing that object to the training service using the Python API client, as shown below.
Create a training job
job = arangoml.training.train(training_spec)
print(job)
job_id = job.job_id
{'job_id': 'f09bd4a0-d2f3-5dd6-80b1-a84602732d61'}
Get status of a training job
arangoml.training.get_job(job_id)
{'database_name': 'db_name',
'job_id': 'efac147a-3654-4866-88fe-03866d0d40a5',
'job_state': None,
'job_status': 'COMPLETED',
'metagraph': {'edgeCollections': {...},
'graph': 'graph_name',
'mlSpec': {'classification': {'inputFeatures': 'x',
'labelField': 'label_field',
'targetCollection': 'target_collection_name'}},
'vertexCollections': {...}
},
'project_id': 'project_id',
'project_name': 'project_name',
'time_ended': '2023-09-01T17:32:05.899493',
'time_started': '2023-09-01T17:04:01.616354',
'time_submitted': '2023-09-01T16:58:43.374269'}
Cancel a running training job
arangoml.training.cancel_job(job_id)
'OK'
Model Selection
Once the training is complete you can review the model statistics. The training completes 12 training jobs using grid search parameter optimization.
To select a model, use the projects API to gather all relevant models and choose the one you prefer for the next step.
The following examples uses the model with the highest validation accuracy, but there may be other factors that motivate you to choose another model.
models = arangoml.projects.list_models(project_name=project_name, project_id=project_id, job_id=job_id)
# Tip: Sort by accuracy
models.models.sort(key=(lambda model : model.model_statistics["test"]["accuracy"]) , reverse=True)
# The most accurate model is the first in the list
model = models.models[0]
print(model)
{'job_id': 'f09bd4a0-d2f3-5dd6-80b1-a84602732d61',
'model_display_name': 'Node Classification Model',
'model_id': '123',
'model_name': 'Node Classification Model '
'123',
'model_statistics': {'_id': 'devperf/123',
'_key': '123',
'_rev': '_gkUc8By--_',
'run_id': '123',
'test': {'accuracy': 0.8891242216547955,
'confusion_matrix': [[13271, 2092],
[1276, 5684]],
'f1': 0.9,
'loss': 0.1,
'precision': 0.9,
'recall': 0.8,
'roc_auc': 0.8},
'timestamp': '2023-09-01T17:32:05.899493',
'validation': {'accuracy': 0.9,
'confusion_matrix': [[13271, 2092],
[1276, 5684]],
'f1': 0.85,
'loss': 0.1,
'precision': 0.86,
'recall': 0.85,
'roc_auc': 0.85}},
'target_collection': 'target_collection_name',
'target_field': 'label_field'}
Prediction
After selecting a model, it is time to persist the results to a collection
using the predict
function.
prediction_spec = {
"project_name": project.name,
"database_name": db.name,
"model_id": model.model_id
}
prediction_job = arangoml.prediction.predict(prediction_spec)
print(prediction_job)
This creates a prediction job that grabs data and generates inferences using the
selected model. Then, by default, it writes the predictions to a new collection
connected via an edge to the original source document.
You may choose to specify instead a collection
.
Get prediction job status
prediction_status = arangoml.prediction.get_job(prediction_job.job_id)
{'database_name': 'db_name',
'job_id': '123-ee43-4106-99e7-123',
'job_state_information': {'outputAttribute': 'label_field_predicted',
'outputCollectionName': 'collectionName_predicted_123',
'outputGraphName': 'graph_name'},
'job_status': 'COMPLETED',
'model_id': '123',
'project_id': '123456',
'project_name': 'project_name',
'time_ended': '2023-09-05T15:23:01.595214',
'time_started': '2023-09-05T15:13:51.034780',
'time_submitted': '2023-09-05T15:09:02.768518'}
Access the predictions
You can now directly access your predictions in your application. If you left the default option you can access them via the dynamically created collection with a query such as the following:
## Query to return results
query = f"""
FOR prediction IN {prediction_status.job_state_information['outputCollectionName']}
RETURN prediction
"""
results = db.aql.execute(query)
for r in results:
for key in r:
print(key +": "+ str(r[key]))
print(" ")