ML-Split Command Examples¶

Practical examples for using nbctl ml-split to create production ML pipelines.

Basic Usage¶

Split ML Notebook¶

nbctl ml-split ml_model.ipynb

Result:

ml_pipeline/
├── __init__.py
├── main.py
├── requirements.txt
├── data_collection.py
├── data_preprocessing.py
├── feature_engineering.py
├── model_training.py
└── model_evaluation.py

Custom Output Directory¶

nbctl ml-split ml_model.ipynb --output src/ml/

Run Generated Pipeline¶

cd ml_pipeline
pip install -r requirements.txt
python main.py

Notebook Structure¶

Organize Your Notebook¶

Structure with clear headers for best results:

# Data Collection
[code to load data]

# Data Preprocessing
[code to clean data]

# Feature Engineering
[code to create features]

# Model Training
[code to train model]

# Model Evaluation
[code to evaluate model]

Generated Code Examples¶

Module Structure¶

Each generated module:

# data_collection.py
def run(context=None):
    """Execute data collection step."""
    import pandas as pd

    # Load data
    df = pd.read_csv('data.csv')

    # Return variables for next step
    return locals()

Main Pipeline¶

Generated main.py:

import data_collection
import data_preprocessing
import model_training

# Execute pipeline
context = data_collection.run()
context = data_preprocessing.run(context)
context = model_training.run(context)

print("Pipeline completed successfully!")

Complete Workflow¶

From Notebook to Production¶

#!/bin/bash
# notebook-to-production.sh

# 1. Develop in notebook
# (work on ml_experiment.ipynb)

# 2. Split into pipeline
nbctl ml-split ml_experiment.ipynb --output ml_pipeline/

# 3. Set up Python package
cd ml_pipeline/
pip install -r requirements.txt

# 4. Test pipeline
python main.py

# 5. Deploy
# Copy ml_pipeline/ to production server

Advanced Examples¶

Multiple ML Notebooks¶

Split multiple experiments:

#!/bin/bash
# split-all-experiments.sh

for nb in experiments/*.ipynb; do
    name=$(basename "$nb" .ipynb)
    nbctl ml-split "$nb" --output "pipelines/$name/"
done

echo "All notebooks split into pipelines"

Custom Pipeline Structure¶

# Split to specific directory structure
nbctl ml-split ml_model.ipynb --output src/models/classifier/

# Result:
# src/models/classifier/
# ├── data_collection.py
# ├── model_training.py
# └── ...

Testing Generated Pipeline¶

# Split notebook
nbctl ml-split ml_model.ipynb

# Test pipeline
cd ml_pipeline

# Install dependencies
pip install -r requirements.txt

# Run tests
python -m pytest tests/  # if you add tests

# Run pipeline
python main.py

Real-World Examples¶

Image Classification Pipeline¶

Notebook structure:

# Data Collection
- Load image dataset
- Download from S3

# Data Preprocessing
- Resize images
- Normalize pixels

# Data Augmentation
- Random flips
- Color jittering

# Model Training
- Define CNN architecture
- Train with callbacks

# Model Evaluation
- Calculate accuracy
- Generate confusion matrix

# Model Saving
- Save model weights
- Export to ONNX

Split and run:

nbctl ml-split image_classifier.ipynb --output image_pipeline/
cd image_pipeline/
python main.py

NLP Pipeline¶

Notebook structure:

# Data Collection
- Load text corpus
- Download pretrained model

# Data Preprocessing
- Tokenization
- Cleaning text

# Feature Engineering
- TF-IDF vectors
- Word embeddings

# Model Training
- Fine-tune BERT
- Train classifier

# Model Evaluation
- Calculate F1 score
- Classification report

Time Series Forecasting¶

# Data Collection
- Load historical data
- Fetch from API

# Data Preprocessing
- Handle missing values
- Normalize features

# Feature Engineering
- Create lag features
- Rolling statistics

# Data Splitting
- Train/val/test split
- Time-based split

# Model Training
- Train LSTM
- Hyperparameter tuning

# Model Evaluation
- RMSE, MAE metrics
- Plot predictions

CI/CD Integration¶

GitHub Actions¶

.github/workflows/ml-pipeline.yml:

name: ML Pipeline
on: [push]

jobs:
  test-pipeline:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2

      - name: Setup Python
        uses: actions/setup-python@v2

      - name: Install nbctl
        run: pip install nbctl

      - name: Generate pipeline
        run: |
          nbctl ml-split notebooks/ml_model.ipynb --output ml_pipeline/

      - name: Test pipeline
        run: |
          cd ml_pipeline
          pip install -r requirements.txt
          python main.py

Docker Deployment¶

FROM python:3.9-slim

WORKDIR /app

# Copy generated pipeline
COPY ml_pipeline/ /app/

# Install dependencies
RUN pip install -r requirements.txt

# Run pipeline
CMD ["python", "main.py"]

Build and run:

docker build -t ml-pipeline .
docker run ml-pipeline

Customization Examples¶

Add Logging¶

Modify generated main.py:

import logging
import data_collection
import model_training

# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Execute pipeline with logging
logger.info("Starting pipeline...")
context = data_collection.run()
logger.info("Data collected")

context = model_training.run(context)
logger.info("Model trained")

logger.info("Pipeline completed!")

Add Error Handling¶

import sys
import data_collection
import model_training

try:
    context = data_collection.run()
    context = model_training.run(context)
    print("Pipeline completed successfully")
except Exception as e:
    print(f"Pipeline failed: {e}")
    sys.exit(1)

Add Configuration¶

Create config.py:

# config.py
DATA_PATH = 'data/train.csv'
MODEL_PATH = 'models/classifier.pkl'
BATCH_SIZE = 32
EPOCHS = 10

Use in pipeline:

import config
import data_collection

context = data_collection.run()
context['data_path'] = config.DATA_PATH

Tips & Best Practices¶

1. Organize Notebook Well¶

Use clear section headers:
# Data Collection (not "Load stuff")
# Model Training (not "Training the model")

2. Test Before Splitting¶

# Run notebook to verify it works
nbctl run ml_model.ipynb

# Then split
nbctl ml-split ml_model.ipynb

3. Version Your Pipeline¶

# Version 1
nbctl ml-split ml_model.ipynb --output ml_pipeline_v1/

# Version 2 (after improvements)
nbctl ml-split ml_model.ipynb --output ml_pipeline_v2/

4. Document Generated Code¶

Add docstrings to generated modules:

# In data_collection.py
def run(context=None):
    """
    Data Collection Step

    Loads training data from CSV files.

    Returns:
        dict: Contains 'df' (DataFrame) and 'labels'
    """
    # ... generated code ...

Troubleshooting¶

No Sections Detected¶

If no markdown headers:

# Add headers to notebook
# Data Collection
# Data Preprocessing
# etc.

# Then split again
nbctl ml-split ml_model.ipynb

Module Import Errors¶

# Ensure all dependencies installed
cd ml_pipeline/
pip install -r requirements.txt

# Or add missing dependencies
pip install scikit-learn pandas numpy

Context Passing Issues¶

Ensure variable names are consistent:

# In data_collection.py
df = pd.read_csv(...)
return locals()  # Returns {'df': ...}

# In preprocessing.py
if context:
    df = context['df']  # Use same name

Run Examples - Execute notebooks before splitting
Info Examples - Analyze before splitting
Extract Examples - Extract results