Metaflow

How to integrate W&B with Metaflow.

Overview

Metaflow is a framework created by Netflix for creating and running ML workflows.

This integration lets users apply decorators to Metaflow steps and flows to automatically log parameters and artifacts to W&B.

  • Decorating a step will turn logging off or on for certain types within that step.
  • Decorating the flow will turn logging off or on for every step in the flow.

Quickstart

Install W&B and login

!pip install -Uqqq metaflow fastcore wandb

import wandb
wandb.login()
pip install -Uqqq metaflow fastcore wandb
wandb login

Decorate your flows and steps

Decorating a step turns logging off or on for certain types within that step.

In this example, all datasets and models in start will be logged

from wandb.integration.metaflow import wandb_log

class WandbExampleFlow(FlowSpec):
    @wandb_log(datasets=True, models=True, settings=wandb.Settings(...))
    @step
    def start(self):
        self.raw_df = pd.read_csv(...).    # pd.DataFrame -> upload as dataset
        self.model_file = torch.load(...)  # nn.Module    -> upload as model
        self.next(self.transform)

Decorating a flow is equivalent to decorating all the constituent steps with a default.

In this case, all steps in WandbExampleFlow default to logging datasets and models by default, just like decorating each step with @wandb_log(datasets=True, models=True)

from wandb.integration.metaflow import wandb_log

@wandb_log(datasets=True, models=True)  # decorate all @step 
class WandbExampleFlow(FlowSpec):
    @step
    def start(self):
        self.raw_df = pd.read_csv(...).    # pd.DataFrame -> upload as dataset
        self.model_file = torch.load(...)  # nn.Module    -> upload as model
        self.next(self.transform)

Decorating the flow is equivalent to decorating all steps with a default. That means if you later decorate a Step with another @wandb_log, it overrides the flow-level decoration.

In this example:

  • start and mid log both datasets and models.
  • end logs neither datasets nor models.
from wandb.integration.metaflow import wandb_log

@wandb_log(datasets=True, models=True)  # same as decorating start and mid
class WandbExampleFlow(FlowSpec):
  # this step will log datasets and models
  @step
  def start(self):
    self.raw_df = pd.read_csv(...).    # pd.DataFrame -> upload as dataset
    self.model_file = torch.load(...)  # nn.Module    -> upload as model
    self.next(self.mid)

  # this step will also log datasets and models
  @step
  def mid(self):
    self.raw_df = pd.read_csv(...).    # pd.DataFrame -> upload as dataset
    self.model_file = torch.load(...)  # nn.Module    -> upload as model
    self.next(self.end)

  # this step is overwritten and will NOT log datasets OR models
  @wandb_log(datasets=False, models=False)
  @step
  def end(self):
    self.raw_df = pd.read_csv(...).    
    self.model_file = torch.load(...)

Access your data programmatically

You can access the information we’ve captured in three ways: inside the original Python process being logged using the wandb client library, with the web app UI, or programmatically using our Public API. Parameters are saved to W&B’s config and can be found in the Overview tab. datasets, models, and others are saved to W&B Artifacts and can be found in the Artifacts tab. Base python types are saved to W&B’s summary dict and can be found in the Overview tab. See our guide to the Public API for details on using the API to get this information programmatically from outside .

Cheat sheet

Data Client library UI
Parameter(...) wandb.config Overview tab, Config
datasets, models, others wandb.use_artifact("{var_name}:latest") Artifacts tab
Base Python types (dict, list, str, etc.) wandb.summary Overview tab, Summary

wandb_log kwargs

kwarg Options
datasets
  • True: Log instance variables that are a dataset
  • False
models
  • True: Log instance variables that are a model
  • False
others
  • True: Log anything else that is serializable as a pickle
  • False
settings
  • wandb.Settings(…): Specify your own wandb settings for this step or flow
  • None: Equivalent to passing wandb.Settings()

By default, if:

  • settings.run_group is None, it will be set to {flow_name}/{run_id}
  • settings.run_job_type is None, it will be set to {run_job_type}/{step_name}

Frequently Asked Questions

What exactly do you log? Do you log all instance and local variables?

wandb_log only logs instance variables. Local variables are NEVER logged. This is useful to avoid logging unnecessary data.

Which data types get logged?

We currently support these types:

Logging Setting Type
default (always on)
  • dict, list, set, str, int, float, bool
datasets
  • pd.DataFrame
  • pathlib.Path
models
  • nn.Module
  • sklearn.base.BaseEstimator
others

How can I configure logging behavior?

Kind of Variable behavior Example Data Type
Instance Auto-logged self.accuracy float
Instance Logged if datasets=True self.df pd.DataFrame
Instance Not logged if datasets=False self.df pd.DataFrame
Local Never logged accuracy float
Local Never logged df pd.DataFrame

Is artifact lineage tracked?

Yes. If you have an artifact that is an output of step A and an input to step B, we automatically construct the lineage DAG for you.

For an example of this behavior, please see this notebook and its corresponding W&B Artifacts page


Last modified January 29, 2025: d270eb0