This is the multi-page printable view of this section. Click here to print.
Notebooks
1 - Accessing Data for Machine Learning Models
Data - sequences
Retrieving Sequences in Parquet Format
Parquet is a columnar storage format optimized for large-scale data processing. It is widely used in machine learning pipelines due to its efficiency and compatibility with distributed systems.
To access a sequence in Parquet format, you only need to construct the correct URL pointing to the resource. The general pattern is:
http://localhost/v1/data/sequences/<sequenceId>.parquet?dataProjectId=<projectId>
- sequenceId - The unique identifier of the sequence. You can obtain this ID from the Data Workspace.
- dataProjectId - The identifier of the project in which the sequence resides.
Sequences can be retrieved using standard ML tools like Pandas, simply by constructing the correct URL with the sequenceId and dataProjectId.
%pip install pyarrow
import pandas as pd
df = pd.read_parquet("http://localhost/v1/data/sequences/0add4bdc-cff6-4f26-a904-c38b5956e60b.parquet?dataProjectId=680b61b0aedd6f9e639d8699")
df.head(10)2 - Model training
Building effective machine learning models requires not only robust algorithms but also a well-structured workflow for experimentation, tracking, and reproducibility. Python has become the de facto language for machine learning due to its rich ecosystem of libraries such as scikit-learn, TensorFlow, PyTorch, and XGBoost, which provide powerful tools for model development across classical and deep learning tasks.
To complement these libraries, MLflow offers an open-source platform SDK to manage the end-to-end machine learning lifecycle. It enables:
- Experiment tracking: Logging parameters, metrics, and artifacts for each run.
- Model management: Packaging models in a standardized format for deployment.
- Reproducibility: Ensuring experiments can be replicated across environments.
- Collaboration: Sharing results and models across teams.
Important
The platform has been designed so that you can use MLflow natively, without additional configuration. MLflow integrates seamlessly and transparently with our services.Dependencies
Before starting model training, ensure that the required Python libraries are installed. These dependencies provide the core functionality for building and tracking machine learning experiments.
Scikit-learn
Run the following command in your notebook:
pip install scikit-learn mlflow==3.5.1
Use the following code snippet as a template for training a machine learning model with Scikit-learn:
# Original source code and more details can be found in:
# https://www.mlflow.org/docs/latest/tutorials-and-examples/tutorial.html
# The data set used in this example is from
# http://archive.ics.uci.edu/ml/datasets/Wine+Quality
# P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
# Modeling wine preferences by data mining from physicochemical properties.
# In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
import warnings
import sys
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNet
from urllib.parse import urlparse
import mlflow
import mlflow.sklearn
from mlflow.models.signature import infer_signature
import logging
logging.basicConfig(level=logging.WARN)
logger = logging.getLogger(__name__)
def eval_metrics(actual, pred):
rmse = np.sqrt(mean_squared_error(actual, pred))
mae = mean_absolute_error(actual, pred)
r2 = r2_score(actual, pred)
return rmse, mae, r2
if __name__ == "__main__":
warnings.filterwarnings("ignore")
np.random.seed(40)
# Read the wine-quality csv file from the URL
csv_url = (
"http://archive.ics.uci.edu/ml"
"/machine-learning-databases/wine-quality/winequality-red.csv"
)
try:
data = pd.read_csv(csv_url, sep=";")
except Exception as e:
logger.exception(
"Unable to download training & test CSV, "
"check your internet connection. Error: %s",
e,
)
# Split the data into training and test sets. (0.75, 0.25) split.
train, test = train_test_split(data)
# The predicted column is "quality" which is a scalar from [3, 9]
train_x = train.drop(["quality"], axis=1)
test_x = test.drop(["quality"], axis=1)
train_y = train[["quality"]]
test_y = test[["quality"]]
alpha = 0.5
l1_ratio = 0.5
experiment_name = "wine-classification"
existing_experiment = mlflow.get_experiment_by_name(experiment_name)
if existing_experiment is None:
experiment_id = mlflow.create_experiment(
name=experiment_name
)
else:
experiment_id = existing_experiment.experiment_id
mlflow.set_experiment(experiment_name)
# Add or update tags to the created experiment.
mlflow.set_experiment_tags({
"project_name": "Fraud Prevention",
"team": "Data Science Core",
"priority": "High"
})
with mlflow.start_run(experiment_id=experiment_id):
lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)
lr.fit(train_x, train_y)
predicted_qualities = lr.predict(test_x)
(rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)
print("Elasticnet model (alpha=%f, l1_ratio=%f):" % (alpha, l1_ratio))
print(" RMSE: %s" % rmse)
print(" MAE: %s" % mae)
print(" R2: %s" % r2)
mlflow.log_param("alpha", alpha)
mlflow.log_param("l1_ratio", l1_ratio)
mlflow.log_metric("rmse", rmse)
mlflow.log_metric("r2", r2)
mlflow.log_metric("mae", mae)
# You can tag each run under an experiment independently.
mlflow.set_tag("version", "1.0")
tracking_url_type_store = urlparse(mlflow.get_tracking_uri()).scheme
model_signature = infer_signature(train_x, train_y)
if tracking_url_type_store != "file":
mlflow.sklearn.log_model(
lr,
"my-new-model",
registered_model_name="ElasticnetWineModel",
input_example=train_x.head(1),
signature=model_signature,
)
else:
mlflow.sklearn.log_model(lr, "model", signature=model_signature)
print("done.")
Script Breakdown
- Dataset: Wine Quality dataset from UCI ML repository.
- Model: ElasticNet regression (combines L1 and L2 regularization).
- Metrics logged: RMSE, MAE, R².
- Parameters, metrics, and the trained model are logged automatically.
- Creates or reuses an experiment (wine-classification) and stores results there.
Pytorch
Run the following command in your notebook:
pip install torch torchvision mlflow==3.5.1
Use the following code snippet as a template for training a machine learning model with PyTorch:
from torch.fft import Tensor
from mlflow.types import Schema, TensorSpec
import mlflow
import mlflow.pytorch
from mlflow.models import ModelSignature
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
import numpy as np
import logging
logging.basicConfig(level=logging.WARN)
logger = logging.getLogger(__name__)
# Transformations for MNIST images
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
# Load MNIST dataset
train_dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_dataset = datasets.MNIST('./data', train=False, download=True, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=1000, shuffle=False)
# Simple neural network
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.fc1 = nn.Linear(28*28, 128)
self.fc2 = nn.Linear(128, 10)
def forward(self, x):
x = x.view(-1, 28*28)
x = torch.relu(self.fc1(x))
x = self.fc2(x)
return x
# Define input schema
inputs_schema = Schema([TensorSpec(type=np.dtype(np.float32), shape=(-1, 1, 28, 28))])
# Define output schema
outputs_schema = Schema([TensorSpec(type=np.dtype(np.float32), shape=(-1, 10))])
# Create the signature
model_signature = ModelSignature(inputs=inputs_schema, outputs=outputs_schema)
model = Net()
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()
experiment_name = "Testing"
existing_experiment = mlflow.get_experiment_by_name(experiment_name)
if existing_experiment is None:
experiment_id = mlflow.create_experiment(
name=experiment_name,
artifact_location="mlflow-artifacts:/pytorch-artifacts"
)
else:
experiment_id = existing_experiment.experiment_id
mlflow.set_experiment(experiment_name)
# Add or update tags to the created experiment.
mlflow.set_experiment_tags({
"project_name": "Fraud Prevention",
"team": "Data Science Core",
"priority": "High"
})
input_example = False
with mlflow.start_run():
mlflow.log_param("lr", 0.001)
mlflow.log_param("batch_size", 64)
for epoch in range(5): # train for 5 epochs
model.train()
train_loss = 0
correct = 0
total = 0
for data, target in train_loader:
input_example = data.numpy()
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
train_loss += loss.item()
_, predicted = torch.max(output.data, 1)
total += target.size(0)
correct += (predicted == target).sum().item()
epoch_loss = train_loss / len(train_loader)
epoch_acc = correct / total
# Log metrics per epoch
mlflow.log_metric("train_loss", epoch_loss, step=epoch)
mlflow.log_metric("train_accuracy", epoch_acc, step=epoch)
# You can tag each run under an experiment independently.
mlflow.set_tag("version", "1.0")
print(f"Epoch {epoch+1}, Loss: {epoch_loss:.4f}, Accuracy: {epoch_acc:.4f}")
# Save trained model
mlflow.pytorch.log_model(model, "mnist_model", signature=model_signature, input_example=input_example)
Real-Time Feedback During Training
When training a model inside a notebook, you should be informed in real time about:
- Training progress: logs and outputs displayed directly in the notebook cells and UI.
- Evaluation results: metrics such as RMSE, MAE, or accuracy printed immediately after each run.
- Trained model artifacts: confirmation that the model has been saved and registered.
Model deployment
If you believe your model is ready and you would like to deploy it so it can be used in other parts of the platform, please proceed to the Getting Started Model Deployments Section.
3 - Working with LLMs in Notebooks
This documentation explains how to communicate with Large Language Models (LLMs) directly from a notebook environment.
Requirements
Before you start, make sure you have the necessary dependencies installed in your notebook environment.
Install the OpenAI SDK
The OpenAI SDK (openai) is only required if you want to write code in Python and communicate with LLMs via the SDK.
pip install openai
Listing models
Before starting, you may want to see which models are available in your environment. This helps you choose the right model for your task.
from openai import OpenAI
client = OpenAI(
base_url = os.environ.get("FATHOM_SDK_BASE_URL") + '/llms/v1',
api_key = "",
default_headers = {
"Authorization": os.environ.get("FATHOM_SDK_AUTHORIZATION")
}
)
models = client.models.list()
print(models)import requests
import os
import json
response = requests.get(
os.environ.get("FATHOM_SDK_BASE_URL") + "/llms/v1/models",
headers={"Authorization": os.environ.get("FATHOM_SDK_AUTHORIZATION")},
)
if response.status_code == 200:
print("Models list:")
print(json.dumps(response.json(), indent=4))
else:
print("Error:", response)This will output a list of model identifiers (e.g., gpt-4.1, gpt-4o-mini, etc.) that you can use in subsequent calls.
Creating chats
Chats allow you to interact with an LLM in a conversational style. You can provide a sequence of messages, and the model will respond accordingly.
from openai import OpenAI
client = OpenAI(
base_url=os.environ.get("FATHOM_SDK_BASE_URL") + "/llms/v1",
api_key="",
default_headers={"Authorization": os.environ.get("FATHOM_SDK_AUTHORIZATION")},
)
response = client.chat.completions.create(
model="Qwen/Qwen2.5-VL-3B-Instruct",
messages=[
{"role": "developer", "content": "Talk like a pirate."},
{
"role": "user",
"content": "How do I check if a Python object is an instance of a class?",
},
],
)
print(response)import requests
import os
import json
data = {
"model": "google/gemma-3-12b-it",
"messages": [{"role": "user", "content": "What time is it in Poland"}],
}
response = requests.post(
os.environ.get("FATHOM_SDK_BASE_URL") + "/llms/v1/chat/completions",
headers={"Authorization": os.environ.get("FATHOM_SDK_AUTHORIZATION")},
json=data,
)
if response.status_code == 200:
print("Success:")
print(json.dumps(response.json(), indent=4))
else:
print("Error:", response)Direct Communication with a Custom LLM Endpoint
In some cases, you may want to communicate with an LLM that is not OpenAI-compatible. This usually means the model is hosted on a custom server or API endpoint. Instead of using the built-in chat.completions.create or completions.create methods, you can send requests directly to your endpoint using standard HTTP libraries such as requests.
Important
When listing models, each model also contains a property uris.base. Example value:
/v1/backends/gemini/
This property is the base path you must use to construct the URL for direct communication with the backend. It is only relevant when you want to bypass the SDK and talk directly to the LLM server.
import requests
import os
import json
backend_uri = "/v1/backends/gemini/" # uri retrieved from models list
data = {
"model": "models/gemini-2.5-flash",
"messages": [
{"role": "user", "content": "What time is it in Poland"}
]
}
response = requests.post(os.environ.get("FATHOM_SDK_BASE_URL") + '/llms' + backend_uri + 'chat/completions', headers={
'Authorization': os.environ.get("FATHOM_SDK_AUTHORIZATION")
}, json=data)
if response.status_code == 200:
print('Success:')
print(json.dumps(response.json(), indent=4))
else:
print('Error:', response);
4 - Working with Databases in Notebooks
Deprecation Notice
Warning: Support for custom databases may be removed in the futureQdrant
Qdrant is a vector database designed for storing and searching embeddings, making it a powerful tool in machine learning workflows. In a notebook context, it allows you to seamlessly manage collections of vectors generated by LLMs, enabling tasks like semantic search or similarity matching. By integrating Qdrant with LLM outputs, you can build intelligent applications that combine natural language understanding with efficient vector-based retrieval.
Requirements
Before you start, make sure you have the necessary dependencies installed in your notebook environment.
Python - Qdrant SDK
pip install qdrant_client
Listing collections
You can list all collections available in your Qdrant instance. This is useful to check which datasets are already stored.
from qdrant_client.async_qdrant_client import AsyncQdrantClient
import os
q = AsyncQdrantClient(
url = os.environ.get("FATHOM_SDK_BASE_URL"),
check_compatibility = False,
prefix = os.environ.get("FATHOM_SDK_SERVICE_PATH_VECTOR_DATABASE").rstrip("/"),
timeout = 30,
headers = {
"authorization": os.environ.get("FATHOM_SDK_AUTHORIZATION")
}
);
all_collections = await q.get_collections()
print(all_collections)This will return metadata about all collections currently stored in Qdrant.
Creating a collection
You can create a new collection to store vectors. When creating a collection, you need to specify the vector size and distance metric.
from qdrant_client.async_qdrant_client import AsyncQdrantClient
from qdrant_client.http.models import (
VectorParams
)
import os
q = AsyncQdrantClient(
url = os.environ.get("FATHOM_SDK_BASE_URL"),
check_compatibility = False,
prefix = os.environ.get("FATHOM_SDK_SERVICE_PATH_VECTOR_DATABASE").rstrip("/"),
timeout = 30,
headers = {
"authorization": os.environ.get("FATHOM_SDK_AUTHORIZATION")
}
);
result = await q.create_collection(
collection_name="my_collection",
vectors_config=VectorParams(
size=128,
distance="Cosine"
)
);
print(result)This example creates a collection named my_collection with vectors of size 128 and cosine similarity as the distance metric.