nbctl ml-split¶

Split ML notebooks into production-ready Python pipeline modules.

Description¶

The ml-split command automatically transforms machine learning notebooks into well-structured Python packages. It intelligently detects ML workflow sections (data collection, preprocessing, training, etc.) and generates modular Python files with proper structure and dependencies.

Use this command to: - Convert ML experiments to production code - Create deployable ML pipelines - Maintain proper software engineering structure - Enable automated ML workflows - Facilitate testing and deployment

Usage¶

nbctl ml-split NOTEBOOK [OPTIONS]

Arguments¶

Argument	Description	Required
`NOTEBOOK`	Path to the ML Jupyter notebook file	Yes

Options¶

Option	Short	Type	Default	Description
`--output`	`-o`	PATH	`ml_pipeline/`	Output directory for pipeline
`--create-main`		Flag	True	Create main.py runner script

Detected Sections¶

The command recognizes 7 common ML workflow patterns based on markdown headers:

Section	Module Name	Triggers
Data Collection	`data_collection.py`	"data collection", "load data", "import data"
Data Preprocessing	`data_preprocessing.py`	"preprocessing", "data cleaning", "clean data"
Feature Engineering	`feature_engineering.py`	"feature engineering", "features", "feature extraction"
Data Splitting	`data_splitting.py`	"split", "train test split", "validation split"
Model Training	`model_training.py`	"train", "training", "fit model"
Model Evaluation	`model_evaluation.py`	"evaluation", "evaluate", "test", "metrics"
Model Saving	`model_saving.py`	"save", "export model", "serialize"

Generated Structure¶

ml_pipeline/
├── __init__.py                  # Package initialization
├── main.py                      # Pipeline runner
├── requirements.txt             # Auto-generated dependencies
├── data_collection.py           # Data loading module
├── data_preprocessing.py        # Cleaning and preprocessing
├── feature_engineering.py       # Feature creation
├── data_splitting.py            # Train/test split
├── model_training.py            # Model training
├── model_evaluation.py          # Evaluation and metrics
└── model_saving.py              # Model persistence

Module Structure¶

Each generated module follows this pattern:

def run(context=None):
    """
    Execute the [section name] step.

    Args:
        context: Dictionary containing variables from previous steps

    Returns:
        Dictionary containing all local variables for next steps
    """
    # If context provided, extract variables
    if context:
        # Variables from previous steps available here
        pass

    # Your notebook code here
    # ...

    # Return all variables for next step
    return locals()

Pipeline Runner (main.py)¶

The generated main.py orchestrates the entire pipeline:

# Execute pipeline in sequence
context = data_collection.run()
context = data_preprocessing.run(context)
context = feature_engineering.run(context)
context = data_splitting.run(context)
context = model_training.run(context)
context = model_evaluation.run(context)
context = model_saving.run(context)

print("Pipeline completed successfully!")

Context Passing¶

Variables automatically flow between pipeline steps:

# Step 1: Data collection
def run(context=None):
    df = pd.read_csv('data.csv')
    return locals()  # Returns {'df': DataFrame}

# Step 2: Preprocessing (receives df)
def run(context=None):
    if context:
        df = context['df']  # Access df from previous step
    df_clean = preprocess(df)
    return locals()  # Returns {'df': DataFrame, 'df_clean': DataFrame}

# Step 3: Training (receives df and df_clean)
def run(context=None):
    if context:
        df_clean = context['df_clean']  # Access cleaned data
    model = train(df_clean)
    return locals()

Requirements Generation¶

Dependencies are automatically extracted from imports:

# Detected imports in notebook
import pandas
import numpy
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot

# Generated requirements.txt
pandas>=1.0.0
numpy>=1.19.0
scikit-learn>=0.24.0
matplotlib>=3.0.0

Output Messages¶

Success¶

ML Pipeline created successfully!

Generated modules:
  - data_collection.py (15 lines)
  - data_preprocessing.py (42 lines)
  - feature_engineering.py (28 lines)
  - model_training.py (35 lines)
  - model_evaluation.py (22 lines)
  - main.py (runner)
  - requirements.txt (5 dependencies)

Output directory: ml_pipeline/

To run the pipeline:
  cd ml_pipeline
  pip install -r requirements.txt
  python main.py

Warnings¶

⚠ Warning: No section headers detected
  Consider adding markdown headers to organize your notebook:
  # Data Collection
  # Data Preprocessing
  # Model Training
  etc.

  All code will be placed in a single module.

Exit Codes¶

Code	Meaning
0	Success, pipeline generated
1	File not found or invalid notebook
2	No code cells found
3	Output directory creation failed

Notes¶

Section detection: Based on markdown headers (case-insensitive)
Code grouping: Code cells are grouped by their nearest preceding header
Import extraction: All import statements are collected for requirements.txt
Context preservation: Variables flow between steps automatically
Runnable immediately: Generated pipeline can be executed right away
No notebook dependency: Generated code doesn't require Jupyter

Best Practices¶

Notebook Organization¶

Structure your notebook with clear markdown headers:

# Data Collection
[code cells for loading data]

# Data Preprocessing
[code cells for cleaning]

# Feature Engineering
[code cells for feature creation]

# Model Training
[code cells for training]

# Model Evaluation
[code cells for evaluation]

Variable Naming¶

Use consistent variable names that make sense across steps:

df for main DataFrame
X_train, X_test for features
y_train, y_test for labels
model for trained model

Module Independence¶

While modules can access context: - Minimize dependencies between steps - Make each step as self-contained as possible - Document expected inputs and outputs

run - Execute the generated pipeline
info - Analyze notebook before splitting
extract - Extract outputs for analysis