Concepts
Dagster provides a variety of abstractions for building and orchestrating data pipelines. These concepts enable a modular, declarative approach to data engineering, making it easier to manage dependencies, monitor execution, and ensure data quality.
Asset
An asset
represents a logical unit of data such as a table, dataset, or machine learning model. Assets can have dependencies on other assets, forming the data lineage for your pipelines. As the core abstraction in Dagster, assets can interact with many other Dagster entities to facilitate certain tasks. When you define an asset, either with the @dg.asset
decorator or via a component, the definition is automatically added to a top-level Definitions
object.
Concept | Relationship |
---|---|
asset check | asset may use an asset check |
asset spec | asset is described by an asset spec |
component | asset may be programmatically built by a component |
config | asset may use a config |
definitions | asset is added to a top-level Definitions object to be deployed |
io manager | asset may use a io manager |
partition | asset may use a partition |
resource | asset may use a resource |
job | asset may be used in a job |
schedule | asset may be used in a schedule |
sensor | asset may be used in a sensor |
Asset check
An asset_check
is associated with an asset
to ensure it meets certain expectations around data quality, freshness or completeness. Asset checks run when the asset is executed and store metadata about the related run and if all the conditions of the check were met.
Concept | Relationship |
---|---|
asset | asset check may be used by an asset |
definitions | asset check is added to a top-level Definitions object to be deployed |
Asset spec
Specs are standalone objects that describe the identity and metadata of Dagster entities without defining their behavior. For example, an AssetSpec
contains essential information like the asset's key
(its unique identifier) and tags (labels for organizing and annotating the asset), but it doesn't include the logic for materializing that asset.
Concept | Relationship |
---|---|
asset | asset spec may describe the identity and metadata of an asset |
Code location
A code location is a collection of Dagster entity definitions deployed in a specific environment. A code location determines the Python environment (including the version of Dagster being used as well as any other Python dependencies). A Dagster project can have multiple code locations, helping isolate dependencies.
Concept | Relationship |
---|---|
definitions | code location must contain at least one top-level Definitions object |
Component
Components are objects that programmatically build assets
and other Dagster entity definitions, such as asset_checks
, schedules
, resources
, and sensors
. They accept schematized configuration parameters (which are specified using YAML or lightweight Python) and use them to build the actual definitions you need. Components are designed to help you quickly bootstrap parts of your Dagster project and serve as templates for repeatable patterns.
Concept | Relationship |
---|---|
asset | component builds assets and other definitions |
asset check | component builds asset_checks and other definitions |
definitions | component builds assets and other definitions |
job | component builds jobs and other definitions |
schedule | component builds schedules and other definitions |
sensor | component builds sensors and other definitions |
resource | component builds resources and other definitions |
Config
A config
is used to specify config schema for assets, jobs, schedules, and sensors. A RunConfig
is a container for all the configuration that can be passed to a run. This allows for parameterization and the reuse of pipelines to serve multiple purposes.
Concept | Relationship |
---|---|
asset | config may be used by an asset |
job | config may be used by a job |
schedule | config may be used by a schedule |
sensor | config may be used by a sensor |
Definitions
In Dagster, "definitions" means two things:
- The objects that combine metadata about Dagster entities with Python functions that define how they behave, for example,
asset
,ScheduleDefinition
, and resource definitions. - The top-level
Definitions
object that contains references to all the definitions in a Dagster project. Entities included in theDefinitions
object will be deployed and visible within the Dagster UI.
Concept | Relationship |
---|---|
asset | Top-level Definitions object may contain one or more asset definitions |
asset check | Top-level Definitions object may contain one or more asset check definitions |
io manager | Top-level Definitions object may contain one or more io manager definitions |
job | Top-level Definitions object may contain one or more job definitions |
resource | Top-level Definitions object may contain one or more resource definitions |
schedule | Top-level Definitions object may contain one or more schedule definitions |
sensor | Top-level Definitions object may contain one or more sensor definitions |
component | definition may be the output of a component |
code location | definitions must be deployed in a code location |
Graph
A GraphDefinition
connects multiple ops
together to form a DAG. If you are using assets
, you will not need to use graphs directly.
Concept | Relationship |
---|---|
config | graph may use a config |
op | graph must include one or more ops |
job | graph must be part of job to execute |
IO manager
An IOManager
defines how data is stored and retrieved between the execution of assets
and ops
. This allows for a customizable storage and format at any interaction in a pipeline.
Concept | Relationship |
---|---|
asset | io manager may be used by an asset |
definitions | io manager is added to a top-level Definitions object to be deployed |
Job
A job
is a subset of assets
or the GraphDefinition
of ops
. Jobs are the main form of execution in Dagster.
Concept | Relationship |
---|---|
asset | job may contain a selection of assets |
config | job may use a config |
graph | job may contain a graph |
schedule | job may be used by a schedule |
sensor | job may be used by a sensor |
definitions | job is added to a top-level Definitions object to be deployed |
Op
An op
is a computational unit of work. Ops are arranged into a GraphDefinition
to dictate their order. Ops have largely been replaced by assets
.
Concept | Relationship |
---|---|
type | op may use a type |
graph | op must be contained in graph to execute |
Partition
A PartitionsDefinition
represents a logical slice of a dataset or computation mapped to a certain segments (such as increments of time). Partitions enable incremental processing, making workflows more efficient by only running on relevant subsets of data.
Concept | Relationship |
---|---|
asset | partition may be used by an asset |
Resource
A ResourceDefinition
is a way to make external resources (like database or API connections) available to Dagster entities (like assets, schedules, or sensors) during job execution, and to clean up after execution resolves. A ConfigurableResource
is a resource that uses structured configuration. For more information, see Configuring resources.
Concept | Relationship |
---|---|
asset | resource may be used by an asset |
schedule | resource may be used by a schedule |
sensor | resource may be used by a sensor |
definitions | resource is added to a top-level Definitions object to be deployed |
Type
A type
is a way to define and validate the data passed between ops
.
Concept | Relationship |
---|---|
op | type may be used by an op |
Schedule
A ScheduleDefinition
is a way to automate jobs
or assets
to occur on a specified interval. In the cases that a job or asset is parameterized, the schedule can also be set with a run configuration (RunConfig
) to match.
Concept | Relationship |
---|---|
asset | schedule may include a job or selection of assets |
config | schedule may include a config if the job or assets include a config |
job | schedule may include a job or selection of assets |
definitions | schedule is added to a top-level Definitions object to be deployed |
Sensor
A sensor
is a way to trigger jobs
or assets
when an event occurs, such as a file being uploaded or a push notification. In the cases that a job or asset is parameterized, the sensor can also be set with a run configuration (RunConfig
) to match.
Concept | Relationship |
---|---|
asset | sensor may include a job or selection of assets |
config | sensor may include a config if the job or assets include a config |
job | sensor may include a job or selection of assets |
definitions | sensor is added to a top-level Definitions object to be deployed |