Accessing Data for Machine Learning Models

This document describes different approaches to accessing and preparing data required for training machine learning models.

Data - sequences

Retrieving Sequences in Parquet Format

Parquet is a columnar storage format optimized for large-scale data processing. It is widely used in machine learning pipelines due to its efficiency and compatibility with distributed systems.

To access a sequence in Parquet format, you only need to construct the correct URL pointing to the resource. The general pattern is:

http://localhost/v1/data/sequences/<sequenceId>.parquet?dataProjectId=<projectId>
  • sequenceId - The unique identifier of the sequence. You can obtain this ID from the Data Workspace.
  • dataProjectId - The identifier of the project in which the sequence resides.

Sequences can be retrieved using standard ML tools like Pandas, simply by constructing the correct URL with the sequenceId and dataProjectId.

%pip install pyarrow

import pandas as pd
df = pd.read_parquet("http://localhost/v1/data/sequences/0add4bdc-cff6-4f26-a904-c38b5956e60b.parquet?dataProjectId=680b61b0aedd6f9e639d8699")
df.head(10)