This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Notebooks

This section explains how to run and test code directly in Notebooks, making it easy to experiment, visualize, and prototype.

1 - Accessing Data for Machine Learning Models

This document describes different approaches to accessing and preparing data required for training machine learning models.

Data - sequences

Retrieving Sequences in Parquet Format

Parquet is a columnar storage format optimized for large-scale data processing. It is widely used in machine learning pipelines due to its efficiency and compatibility with distributed systems.

To access a sequence in Parquet format, you only need to construct the correct URL pointing to the resource. The general pattern is:

http://localhost/v1/data/sequences/<sequenceId>.parquet?dataProjectId=<projectId>
  • sequenceId - The unique identifier of the sequence. You can obtain this ID from the Data Workspace.
  • dataProjectId - The identifier of the project in which the sequence resides.

Sequences can be retrieved using standard ML tools like Pandas, simply by constructing the correct URL with the sequenceId and dataProjectId.

%pip install pyarrow

import pandas as pd
df = pd.read_parquet("http://localhost/v1/data/sequences/0add4bdc-cff6-4f26-a904-c38b5956e60b.parquet?dataProjectId=680b61b0aedd6f9e639d8699")
df.head(10)

2 - Model training

The model training section demonstrates how to build, track, and manage machine learning experiments in Python.

Building effective machine learning models requires not only robust algorithms but also a well-structured workflow for experimentation, tracking, and reproducibility. Python has become the de facto language for machine learning due to its rich ecosystem of libraries such as scikit-learn, TensorFlow, PyTorch, and XGBoost, which provide powerful tools for model development across classical and deep learning tasks.

To complement these libraries, MLflow offers an open-source platform SDK to manage the end-to-end machine learning lifecycle. It enables:

  • Experiment tracking: Logging parameters, metrics, and artifacts for each run.
  • Model management: Packaging models in a standardized format for deployment.
  • Reproducibility: Ensuring experiments can be replicated across environments.
  • Collaboration: Sharing results and models across teams.

Dependencies

Before starting model training, ensure that the required Python libraries are installed. These dependencies provide the core functionality for building and tracking machine learning experiments.

Scikit-learn

Run the following command in your notebook:

pip install scikit-learn mlflow==3.5.1

Use the following code snippet as a template for training a machine learning model with Scikit-learn:

# Original source code and more details can be found in:
# https://www.mlflow.org/docs/latest/tutorials-and-examples/tutorial.html

# The data set used in this example is from
# http://archive.ics.uci.edu/ml/datasets/Wine+Quality
# P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
# Modeling wine preferences by data mining from physicochemical properties.
# In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

import warnings
import sys

import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNet
from urllib.parse import urlparse
import mlflow
import mlflow.sklearn
from mlflow.models.signature import infer_signature

import logging

logging.basicConfig(level=logging.WARN)
logger = logging.getLogger(__name__)


def eval_metrics(actual, pred):
    rmse = np.sqrt(mean_squared_error(actual, pred))
    mae = mean_absolute_error(actual, pred)
    r2 = r2_score(actual, pred)
    return rmse, mae, r2


if __name__ == "__main__":
    warnings.filterwarnings("ignore")
    np.random.seed(40)

    # Read the wine-quality csv file from the URL
    csv_url = (
        "http://archive.ics.uci.edu/ml"
        "/machine-learning-databases/wine-quality/winequality-red.csv"
    )
    try:
        data = pd.read_csv(csv_url, sep=";")
    except Exception as e:
        logger.exception(
            "Unable to download training & test CSV, "
            "check your internet connection. Error: %s",
            e,
        )
        
    # Split the data into training and test sets. (0.75, 0.25) split.
    train, test = train_test_split(data)

    # The predicted column is "quality" which is a scalar from [3, 9]
    train_x = train.drop(["quality"], axis=1)
    test_x = test.drop(["quality"], axis=1)
    train_y = train[["quality"]]
    test_y = test[["quality"]]

    alpha = 0.5
    l1_ratio = 0.5

    experiment_name = "wine-classification"

    existing_experiment = mlflow.get_experiment_by_name(experiment_name)
    if existing_experiment is None:
        experiment_id = mlflow.create_experiment(
            name=experiment_name
        )
    else:
        experiment_id = existing_experiment.experiment_id

    mlflow.set_experiment(experiment_name)

    # Add or update tags to the created experiment.
    mlflow.set_experiment_tags({
        "project_name": "Fraud Prevention",
        "team": "Data Science Core",
        "priority": "High"
    })

    with mlflow.start_run(experiment_id=experiment_id):
        lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)
        lr.fit(train_x, train_y)

        predicted_qualities = lr.predict(test_x)

        (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)

        print("Elasticnet model (alpha=%f, l1_ratio=%f):" % (alpha, l1_ratio))
        print("  RMSE: %s" % rmse)
        print("  MAE: %s" % mae)
        print("  R2: %s" % r2)

        mlflow.log_param("alpha", alpha)
        mlflow.log_param("l1_ratio", l1_ratio)
        mlflow.log_metric("rmse", rmse)
        mlflow.log_metric("r2", r2)
        mlflow.log_metric("mae", mae)
        # You can tag each run under an experiment independently.
        mlflow.set_tag("version", "1.0")

        tracking_url_type_store = urlparse(mlflow.get_tracking_uri()).scheme
        model_signature = infer_signature(train_x, train_y)

        if tracking_url_type_store != "file":
            mlflow.sklearn.log_model(
                lr,
                "my-new-model",
                registered_model_name="ElasticnetWineModel",
                input_example=train_x.head(1),
                signature=model_signature,
            )
        else:
            mlflow.sklearn.log_model(lr, "model", signature=model_signature)
        
print("done.")

Script Breakdown

  • Dataset: Wine Quality dataset from UCI ML repository.
  • Model: ElasticNet regression (combines L1 and L2 regularization).
  • Metrics logged: RMSE, MAE, R².
  • Parameters, metrics, and the trained model are logged automatically.
  • Creates or reuses an experiment (wine-classification) and stores results there.

Pytorch

Run the following command in your notebook:

pip install torch torchvision mlflow==3.5.1

Use the following code snippet as a template for training a machine learning model with PyTorch:

from torch.fft import Tensor
from mlflow.types import Schema, TensorSpec
import mlflow
import mlflow.pytorch
from mlflow.models import ModelSignature
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
import numpy as np
import logging

logging.basicConfig(level=logging.WARN)
logger = logging.getLogger(__name__)

# Transformations for MNIST images
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

# Load MNIST dataset
train_dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_dataset = datasets.MNIST('./data', train=False, download=True, transform=transform)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=1000, shuffle=False)

# Simple neural network
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(28*28, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = x.view(-1, 28*28)
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Define input schema
inputs_schema = Schema([TensorSpec(type=np.dtype(np.float32), shape=(-1, 1, 28, 28))])

# Define output schema
outputs_schema = Schema([TensorSpec(type=np.dtype(np.float32), shape=(-1, 10))])

# Create the signature
model_signature = ModelSignature(inputs=inputs_schema, outputs=outputs_schema)

model = Net()
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

experiment_name = "Testing"

existing_experiment = mlflow.get_experiment_by_name(experiment_name)
if existing_experiment is None:
    experiment_id = mlflow.create_experiment(
        name=experiment_name,
        artifact_location="mlflow-artifacts:/pytorch-artifacts"
    )
else:
    experiment_id = existing_experiment.experiment_id

mlflow.set_experiment(experiment_name)

# Add or update tags to the created experiment.
mlflow.set_experiment_tags({
    "project_name": "Fraud Prevention",
    "team": "Data Science Core",
    "priority": "High"
})

input_example = False

with mlflow.start_run():
    mlflow.log_param("lr", 0.001)
    mlflow.log_param("batch_size", 64)

    for epoch in range(5):  # train for 5 epochs
        model.train()
        train_loss = 0
        correct = 0
        total = 0

        for data, target in train_loader:
            input_example = data.numpy()
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()

            train_loss += loss.item()
            _, predicted = torch.max(output.data, 1)
            total += target.size(0)
            correct += (predicted == target).sum().item()

        epoch_loss = train_loss / len(train_loader)
        epoch_acc = correct / total

        # Log metrics per epoch
        mlflow.log_metric("train_loss", epoch_loss, step=epoch)
        mlflow.log_metric("train_accuracy", epoch_acc, step=epoch)
        # You can tag each run under an experiment independently.
        mlflow.set_tag("version", "1.0")

        print(f"Epoch {epoch+1}, Loss: {epoch_loss:.4f}, Accuracy: {epoch_acc:.4f}")

    # Save trained model
    mlflow.pytorch.log_model(model, "mnist_model", signature=model_signature, input_example=input_example)

Real-Time Feedback During Training

When training a model inside a notebook, you should be informed in real time about:

  • Training progress: logs and outputs displayed directly in the notebook cells and UI.
  • Evaluation results: metrics such as RMSE, MAE, or accuracy printed immediately after each run.
  • Trained model artifacts: confirmation that the model has been saved and registered.

Model deployment

If you believe your model is ready and you would like to deploy it so it can be used in other parts of the platform, please proceed to the Getting Started Model Deployments Section.

3 - Working with LLMs in Notebooks

Getting started guide for integrating Large Language Models (LLMs) into your notebook workflows.

This documentation explains how to communicate with Large Language Models (LLMs) directly from a notebook environment.

Requirements

Before you start, make sure you have the necessary dependencies installed in your notebook environment.

Install the OpenAI SDK

The OpenAI SDK (openai) is only required if you want to write code in Python and communicate with LLMs via the SDK.

pip install openai

Listing models

Before starting, you may want to see which models are available in your environment. This helps you choose the right model for your task.

from openai import OpenAI

client = OpenAI(
    base_url = os.environ.get("FATHOM_SDK_BASE_URL") + '/llms/v1',
    api_key = "",
    default_headers = {
        "Authorization": os.environ.get("FATHOM_SDK_AUTHORIZATION")
    }
)

models = client.models.list()

print(models)
import requests
import os
import json

response = requests.get(
    os.environ.get("FATHOM_SDK_BASE_URL") + "/llms/v1/models",
    headers={"Authorization": os.environ.get("FATHOM_SDK_AUTHORIZATION")},
)

if response.status_code == 200:
    print("Models list:")
    print(json.dumps(response.json(), indent=4))

else:
    print("Error:", response)

This will output a list of model identifiers (e.g., gpt-4.1, gpt-4o-mini, etc.) that you can use in subsequent calls.

Creating chats

Chats allow you to interact with an LLM in a conversational style. You can provide a sequence of messages, and the model will respond accordingly.

from openai import OpenAI

client = OpenAI(
    base_url=os.environ.get("FATHOM_SDK_BASE_URL") + "/llms/v1",
    api_key="",
    default_headers={"Authorization": os.environ.get("FATHOM_SDK_AUTHORIZATION")},
)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-VL-3B-Instruct",
    messages=[
        {"role": "developer", "content": "Talk like a pirate."},
        {
            "role": "user",
            "content": "How do I check if a Python object is an instance of a class?",
        },
    ],
)

print(response)
import requests
import os
import json

data = {
    "model": "google/gemma-3-12b-it",
    "messages": [{"role": "user", "content": "What time is it in Poland"}],
}

response = requests.post(
    os.environ.get("FATHOM_SDK_BASE_URL") + "/llms/v1/chat/completions",
    headers={"Authorization": os.environ.get("FATHOM_SDK_AUTHORIZATION")},
    json=data,
)

if response.status_code == 200:
    print("Success:")
    print(json.dumps(response.json(), indent=4))

else:
    print("Error:", response)

Direct Communication with a Custom LLM Endpoint

In some cases, you may want to communicate with an LLM that is not OpenAI-compatible. This usually means the model is hosted on a custom server or API endpoint. Instead of using the built-in chat.completions.create or completions.create methods, you can send requests directly to your endpoint using standard HTTP libraries such as requests.

import requests
import os
import json

backend_uri = "/v1/backends/gemini/" # uri retrieved from models list

data = {
    "model": "models/gemini-2.5-flash",
    "messages": [
        {"role": "user", "content": "What time is it in Poland"}
    ]
}

response = requests.post(os.environ.get("FATHOM_SDK_BASE_URL") + '/llms' + backend_uri + 'chat/completions', headers={
    'Authorization': os.environ.get("FATHOM_SDK_AUTHORIZATION")
}, json=data)

if response.status_code == 200:
    print('Success:')
    print(json.dumps(response.json(), indent=4))

else:
    print('Error:', response);

4 - Working with Databases in Notebooks

This page explains how to integrate databases into your notebook workflows.

Qdrant

Qdrant is a vector database designed for storing and searching embeddings, making it a powerful tool in machine learning workflows. In a notebook context, it allows you to seamlessly manage collections of vectors generated by LLMs, enabling tasks like semantic search or similarity matching. By integrating Qdrant with LLM outputs, you can build intelligent applications that combine natural language understanding with efficient vector-based retrieval.

Requirements

Before you start, make sure you have the necessary dependencies installed in your notebook environment.

Python - Qdrant SDK

pip install qdrant_client

Listing collections

You can list all collections available in your Qdrant instance. This is useful to check which datasets are already stored.

from qdrant_client.async_qdrant_client import AsyncQdrantClient
import os

q = AsyncQdrantClient(
    url = os.environ.get("FATHOM_SDK_BASE_URL"),
    check_compatibility = False,
    prefix = os.environ.get("FATHOM_SDK_SERVICE_PATH_VECTOR_DATABASE").rstrip("/"),
    timeout = 30,
    headers = {
        "authorization": os.environ.get("FATHOM_SDK_AUTHORIZATION")
    }
);

all_collections = await q.get_collections()

print(all_collections)

This will return metadata about all collections currently stored in Qdrant.

Creating a collection

You can create a new collection to store vectors. When creating a collection, you need to specify the vector size and distance metric.

from qdrant_client.async_qdrant_client import AsyncQdrantClient
from qdrant_client.http.models import (
    VectorParams
)
import os

q = AsyncQdrantClient(
    url = os.environ.get("FATHOM_SDK_BASE_URL"),
    check_compatibility = False,
    prefix = os.environ.get("FATHOM_SDK_SERVICE_PATH_VECTOR_DATABASE").rstrip("/"),
    timeout = 30,
    headers = {
        "authorization": os.environ.get("FATHOM_SDK_AUTHORIZATION")
    }
);

result = await q.create_collection(
    collection_name="my_collection",
    vectors_config=VectorParams(
        size=128,
        distance="Cosine"
    )
);

print(result)

This example creates a collection named my_collection with vectors of size 128 and cosine similarity as the distance metric.