104.2. Explore the butler repository¶
104.2. Explore the Butler repository¶
Data Release: Data Preview 1
Container Size: large
LSST Science Pipelines version: r29.1.1
Last verified to run: 2025-06-21
Repository: github.com/lsst/tutorial-notebooks
Learning objective: How to explore Butler collections, datasets, dimensions, and schema.
LSST data products: Butler dimension records (metadata)
Packages: lsst.daf.butler
Credit: Originally developed by the Rubin Community Science team. Please consider acknowledging them if this notebook is used for the preparation of journal articles, software releases, or other notebooks.
Get Support: Everyone is encouraged to ask questions or raise issues in the Support Category of the Rubin Community Forum. Rubin staff will respond to all questions posted there.
1. Introduction¶
The Butler is LSST Science Pipelines middleware for managing, reading, and writing datasets.
As the interface between the pipelines and the data, it is often referred to as "middleware".
Butler-related documentation:
This tutorial demonstrates how to explore the Butler: how to discover what kind of data is accessible and its schema, which dimensions to use to query and retrieve datasets, and how to visualize metadata.
Related tutorials: The first 100-level Butler tutorial in this series provides a basic introduction to how to use the Butler; later tutorials in this series demonstrate advanced Butler queries.
1.1. Import packages¶
Import the Butler
module from the lsst.daf.butler
package,
the display
module from the lsst.afw
package (for image display),
and the sphgeom
package for spherical geometry functions.
Also import numpy
and modules from matplotlib
.
from lsst.daf.butler import Butler
import lsst.sphgeom as sphgeom
import lsst.geom as geom
from lsst.utils.plotting import (get_multiband_plot_colors,
get_multiband_plot_symbols,
get_multiband_plot_linestyles)
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.lines as mlines
from matplotlib.patches import Polygon
1.2. Define parameters and functions¶
Create an instance of the Butler with the repository and collection for DP1, and assert that it exists.
butler = Butler("dp1", collections="LSSTComCam/DP1")
assert butler is not None
Use the colorblind-friendly colortable tableau-colorblind10
.
plt.style.use('seaborn-v0_8-colorblind')
prop_cycle = plt.rcParams['axes.prop_cycle']
colors = prop_cycle.by_key()['color']
Define the standard colors and linestyles to represent LSST filters.
filter_names = ['u', 'g', 'r', 'i', 'z', 'y']
filter_colors = get_multiband_plot_colors()
filter_symbols = get_multiband_plot_symbols()
filter_linestyles = get_multiband_plot_linestyles()
2. Explore the Butler¶
Learn how to explore the datasets available via the Butler, and their properties.
2.1. Repositories and collections¶
Repositories are defined by data release (e.g., DP0.2, DP1).
In Section 1.2 above, the DP1 repository ("dp1"
) and collection ("LSSTComCam/DP1"
) were defined when the Butler was instantiated with
butler = Butler("dp1", collections="LSSTComCam/DP1")
.
Any queries for data run with the Butler would look only in that collection.
Print the information about collection LSSTComCam/DP1
.
butler.collections.query_info("LSSTComCam/DP1")
[CollectionInfo(name='LSSTComCam/DP1', type=<CollectionType.CHAINED: 3>, doc='', children=('LSSTComCam/runs/DRP/DP1/DM-51335', 'LSSTComCam/runs/DRP/DP1/v29_0_0/DM-50260'), parents=None, dataset_types=None)]
It is also possible to explore what other repositories and collections exist.
List all available repositories.
butler.get_known_repos()
{'dp02', 'dp02-remote', 'dp1'}
Collections are groups of self-consistent datasets, e.g., the outputs of a processing run that uses a single set of calibrations.
Store a list of all collection names in all_collections
.
all_collections = butler.collections.query("*")
Option to print all collections.
# for collection in all_collections:
# print(collection)
del all_collections
2.2. Dataset types¶
Commonly-used image dataset types:
deep_coadd
visit_image
Commonly-used catalog dataset types:
object
source
Uncomment and execute the following cell to print a very long list of every possible dataset type.
Warning: The
butler.registry
module used below is being deprecated, but the functionality toqueryDatasetTypes
will be made available elsewhere.
Warning: The
queryDatasetTypes
function returns every type that has ever been named or defined, and is not a list of dataset types that are currently populated and available.
# butler.registry.queryDatasetTypes()
Option to print only the dataset types with coadd
in their name.
# butler.registry.queryDatasetTypes('*coadd*')
Option to print the dataset types with map
in their name.
# butler.registry.queryDatasetTypes('*map*')
Option to print the dataset types with object
in their name.
# butler.registry.queryDatasetTypes('*object*')
2.3. Dimensions and elements¶
The dimensions and elements are the properties of a dataset type that can be constrained in a Butler query.
Commonly-used dimensions:
band
tract
patch
visit
detector
Commonly-used element:
visit_detector_region
Print the full lists of dimensions and elements.
print(butler.dimensions.getStaticDimensions())
{band, healpix1, healpix2, healpix3, healpix4, healpix5, healpix6, healpix7, healpix8, healpix9, healpix10, healpix11, healpix12, healpix13, healpix14, healpix15, healpix16, healpix17, htm1, htm2, htm3, htm4, htm5, htm6, htm7, htm8, htm9, htm10, htm11, htm12, htm13, htm14, htm15, htm16, htm17, htm18, htm19, htm20, htm21, htm22, htm23, htm24, instrument, skymap, day_obs, detector, group, physical_filter, subfilter, tract, visit_system, exposure, patch, visit}
print(butler.dimensions.getStaticElements())
{band, healpix1, healpix2, healpix3, healpix4, healpix5, healpix6, healpix7, healpix8, healpix9, healpix10, healpix11, healpix12, healpix13, healpix14, healpix15, healpix16, healpix17, htm1, htm2, htm3, htm4, htm5, htm6, htm7, htm8, htm9, htm10, htm11, htm12, htm13, htm14, htm15, htm16, htm17, htm18, htm19, htm20, htm21, htm22, htm23, htm24, instrument, skymap, day_obs, detector, group, physical_filter, subfilter, tract, visit_system, exposure, patch, visit, visit_definition, visit_detector_region, visit_system_membership}
Show that the dimensions for the deep_coadd
dataset type are band, skymap, tract, patch
,
and that the returned format for images is ExposureF
.
butler.get_dataset_type('deep_coadd')
DatasetType('deep_coadd', {band, skymap, tract, patch}, ExposureF)
Print each dimension key using the dimensions.data_coordinate_keys
attribute.
dataset_type = butler.get_dataset_type('deep_coadd')
for dimension in dataset_type.dimensions.data_coordinate_keys:
print(dimension)
band skymap tract patch
Print all dimension keys for the visit_image
dataset type.
dataset_type = butler.get_dataset_type('visit_image')
for dimension in dataset_type.dimensions.data_coordinate_keys:
print(dimension)
instrument detector visit band day_obs physical_filter
2.3.1. Required dimensions¶
Some dimensions are redundant, like band
and physical_filter
, or day_obs
and visit
. Print only the subset of required dimensions for the visit_image
dataset type.
dataset_type = butler.get_dataset_type('visit_image')
for dimension in dataset_type.dimensions.required:
print(dimension)
instrument detector visit
del dataset_type
2.4. Schema¶
The set of dimensions for a given dataset type have schema which define how a Butler query for a given dataset type can be constructed.
Print the schema for each dimension of the deep_coadd
dataset type.
dataset_type = butler.get_dataset_type('deep_coadd')
for dimension in dataset_type.dimensions.data_coordinate_keys:
print('dimension = ', dimension)
print(butler.dimensions[dimension].schema)
print(' ')
dimension = band band: name: string dimension = skymap skymap: name: string hash: hash A hash of the skymap's parameters. tract_max: int Maximum ID for tracts in this skymap, exclusive. patch_nx_max: int Number of patches in the x direction in each tract. patch_ny_max: int Number of patches in the y direction in each tract. dimension = tract tract: skymap: string id: int region: region dimension = patch patch: skymap: string tract: int id: int cell_x: int Which column this patch occupies in the tract's grid of patches. cell_y: int Which row this patch occupies in the tract's grid of patches. region: region
The output above shows, for example, that a query for deep_coadd
images can be constrained by the tract
and patch
id
values (unique integers), or with region
(see example in Section 3.1).
Option to print the schema for each dimension of the visit_image
dataset type.
# dataset_type = butler.get_dataset_type('visit_image')
# for dimension in dataset_type.dimensions.data_coordinate_keys:
# print('dimension = ', dimension)
# print(butler.dimensions[dimension].schema)
# print(' ')
Option to print the schema for the object
table.
# dataset_type = butler.get_dataset_type('object')
# for dimension in dataset_type.dimensions.data_coordinate_keys:
# print('dimension = ', dimension)
# print(butler.dimensions[dimension].schema)
# print(' ')
2.5. Dimension records¶
Return all records for a given dimension in the collection (i.e., all unique instruments, bands, tracts, patches, visits, detectors).
Print all unique instruments in the collection, and see that DP1 contains only LSSTComCam data.
butler.query_dimension_records('instrument')
[instrument.RecordClass(name='LSSTComCam', visit_max=7050123199999, visit_system=2, exposure_max=7050123199999, detector_max=1000, class_name='lsst.obs.lsst.LsstComCam')]
Print all unique bands (filters) in the collection, and see that the LSST filters ugrizy are all included.
butler.query_dimension_records('band')
[band.RecordClass(name='r'), band.RecordClass(name='z'), band.RecordClass(name='y'), band.RecordClass(name='g'), band.RecordClass(name='i'), band.RecordClass(name='unknown'), band.RecordClass(name='u'), band.RecordClass(name='OG590'), band.RecordClass(name='white')]
Print the unique tracts in the DP1 collection, and see that there are only 31 because DP1 covers a relatively small area.
all_tracts = butler.query_dimension_records('tract')
all_tract_ids = []
for tract in all_tracts:
all_tract_ids.append(tract.id)
print('Number of DP1 tracts: ', len(all_tract_ids))
print(all_tract_ids)
del all_tracts, all_tract_ids
Number of DP1 tracts: 31 [453, 454, 531, 532, 2234, 2235, 2393, 2394, 4016, 4017, 4217, 4218, 4848, 4849, 5062, 5063, 5064, 5305, 5306, 5525, 5526, 7610, 7611, 7849, 7850, 10221, 10222, 10463, 10464, 10704, 10705]
Options to get all visit or all detector ids.
# all_visits = butler.query_dimension_records('visit')
# all_visit_ids = []
# for visit in all_visits:
# all_visit_ids.append(visit.id)
# print('Number of DP1 visits: ', len(all_visit_ids))
# del all_visits, all_visit_ids
# all_detectors = butler.query_dimension_records('detector')
# all_detector_ids = []
# for detector in all_detectors:
# all_detector_ids.append(detector.id)
# print('Number of DP1 detectors: ', len(all_detector_ids))
# print(all_detector_ids)
# del all_detectors, all_detector_ids
2.6. DataIds¶
The dataId
is a dictionary in which the keys are dimensions, and the values uniquely identify a data product
of a given dataset type.
In Section 2.3., the dimensions of a deep_coadd
were shown to be band
, skymap
, tract
, and patch
.
The dataId
for a unique deep_coadd
image is a dictionary that contains the values for each dimension (key), such as:
dataId = {"band": "r", "skymap": "lsst_cells_v1", "tract": 5063, "patch": 14}
Option to print all unique dataId
values for instrument, band, and tract.
This returns similar lists as in Section 2.5 above.
# butler.query_data_ids('instrument')
# butler.query_data_ids('band')
Tracts are defined for a particular skymap.
# butler.query_data_ids('tract')
2.7. Skymap, tract, and patch¶
The skymap defines the all-sky tesselation of tracts and patches for the deep coadd images.
Use the Butler's get_dataset_type
function to show that the dimension for the skyMap
dataset type is just skymap
,
and that the returned format for the skymap is SkyMap
.
butler.get_dataset_type("skyMap")
DatasetType('skyMap', {skymap}, SkyMap)
Show the schema for the skymap
.
dataset_type = butler.get_dataset_type('skyMap')
for dimension in dataset_type.dimensions.data_coordinate_keys:
print('dimension = ', dimension)
print(butler.dimensions[dimension].schema)
print(' ')
dimension = skymap skymap: name: string hash: hash A hash of the skymap's parameters. tract_max: int Maximum ID for tracts in this skymap, exclusive. patch_nx_max: int Number of patches in the x direction in each tract. patch_ny_max: int Number of patches in the y direction in each tract.
Return all skymap
records for DP1: there is only one, lsst_cells_v1
.
butler.query_dimension_records('skymap')
[skymap.RecordClass(name='lsst_cells_v1', hash=b'\xe2snA$M\xf0\x0b\xbd\xb9:\xec\xac\n\xe4\xb2\xd1\x9d\xda\x91', tract_max=18938, patch_nx_max=10, patch_ny_max=10)]
Retrieve the skymap
from the butler.
skymap = butler.get("skyMap", skymap="lsst_cells_v1")
Option to print the help
for skymap
.
# help(skymap)
Print all methods available for the returned skymap
.
for tmp in dir(skymap):
if tmp[0] != '_':
print(tmp)
ConfigClass SKYMAP_DATASET_TYPE_NAME SKYMAP_RUN_COLLECTION_NAME config findAllTracts findClosestTractPatchList findTract findTractIdArray findTractPatchList generateTract getRaDecRange getRingIndices getSha1 logSkyMapInfo register updateSha1
A few of these methods are particularly useful.
Print the minimum and maximum RA and Dec covered by a given tract.
tract = skymap.getRaDecRange(5063)
tract
(Angle(52.258064516129039, degrees), Angle(53.917050691244235, degrees), Angle(-28.264462809917354, degrees), Angle(-26.776859504132226, degrees))
Print the tract and patch for RA, Dec = $53.076, -28.110$ deg.
ra = 53.076
dec = -28.110
point = geom.SpherePoint(ra*geom.degrees, dec*geom.degrees)
tract = skymap.findTract(point)
patch = tract.findPatch(point)
print(tract.tract_id, patch.getSequentialIndex())
del ra, dec, tract, patch
5063 15
Tracts and patches overlap. Get all overlapping tracts and patches.
tplist = skymap.findTractPatchList([point])
Display the list of tracts and patches, tplist
, and see that one tract and two patches overlap the coordinates.
tplist
[(TractInfo(id=5063, ctrCoord=[0.5326332312997893, 0.7090801040994212, -0.46206844394039615]), (PatchInfo(index=Index2D(x=4, y=1), innerBBox=(minimum=(12000, 3000), maximum=(14999, 5999)), outerBBox=(minimum=(11800, 2800), maximum=(15199, 6199)), cellInnerDimensions=(150, 150), cellBorder=50, numCellsPerPatchInner=22), PatchInfo(index=Index2D(x=5, y=1), innerBBox=(minimum=(15000, 3000), maximum=(17999, 5999)), outerBBox=(minimum=(14800, 2800), maximum=(18199, 6199)), cellInnerDimensions=(150, 150), cellBorder=50, numCellsPerPatchInner=22)))]
Print the tract and two patch ids for all that overlap the provided coordinates.
for tmp in tplist:
tract = tmp[0]
patches = tmp[1]
for patch in patches:
print(tract.getId(), patch.getSequentialIndex())
5063 14 5063 15
del point, tplist, skymap
3. Visualize Butler metadata¶
3.1. Tract and patch boundaries¶
Use the skymap
and deep_coadd
metadata to draw tract and patch boundaries.
Retrieve all the tract
and patch
dimension records for all tracts
and patches
within 10 degrees of RA, Dec = $53.076, -28.110$ deg.
Those coordinates are near the center of the Extended Chandra Deep Field South (ECDFS).
region = sphgeom.Region.from_ivoa_pos("CIRCLE 53.076 -28.110 10.0")
tract_dimrecs = butler.query_dimension_records("tract",
where="tract.region OVERLAPS :region",
bind={"region": region})
patch_dimrecs = butler.query_dimension_records("patch",
where="patch.region OVERLAPS :region",
bind={"region": region})
print('Number of overlapping patches: ', len(patch_dimrecs))
Number of overlapping patches: 93
Define a color dictionary that assigns a color from the colortable to each tract.
color_dict = {}
for r, rec in enumerate(tract_dimrecs):
color_dict[rec.dataId['tract']] = colors[r]
print(color_dict)
{4848: '#0072B2', 4849: '#009E73', 5062: '#D55E00', 5063: '#CC79A7', 5064: '#F0E442'}
The patch dimension records returned from the Butler include the region
as a ConvexPolygon3D
with four vertices.
Convert these vertices to 2D sky coordinates (RA and Dec), then use the Polygon
module and the add_patch
function to draw each patch in the figure, using a different color per patch.
fig, ax = plt.subplots(figsize=(6, 6))
for rec in patch_dimrecs:
vertices = rec.region.getVertices()
vertices_deg = []
for vertex in vertices:
vertices_deg.append([geom.SpherePoint(vertex).getRa().asDegrees(),
geom.SpherePoint(vertex).getDec().asDegrees()])
polygon = Polygon(vertices_deg, closed=True, facecolor='None',
edgecolor=color_dict[rec.dataId['tract']])
ax.add_patch(polygon)
ax.set_xlim(54.1, 52.1)
ax.set_ylim(-28.9, -27.1)
ax.set_xlabel('RA')
ax.set_ylabel('Dec')
handles = []
labels = []
for clr in color_dict:
line = mlines.Line2D([], [], color=color_dict[clr], linestyle='-')
handles.append(line)
labels.append(clr)
plt.legend(loc='upper left', ncol=3, handles=handles, labels=labels)
plt.title('Tracts and patches that overlap ECDFS')
plt.show()
Figure 1: The boundaries of the 93 patches from the five different tracts that overlap the ECDFS DP1 field.
del region, tract_dimrecs, patch_dimrecs
3.2. Visit dates¶
Plot the cumulative histogram of the number of visits in the ECDFS field over time, by filter.
Define the search region as in Section 3.1, above.
region = sphgeom.Region.from_ivoa_pos("CIRCLE 53.076 -28.110 10.0")
Set the bins of the plot to cover the ECDFS observation dates, plus a few days before and after.
use_bins = np.arange(60660-60622, dtype='int') + 60622
For each filter, return all visit dimension records that overlap the region. Create an array of the start times, in MJD, from every visit's timespan. Plot the cumulative distribution of start time to see how observations accumulated for ECDFS, per filter.
fig = plt.figure(figsize=(6, 4))
for band in filter_names:
visit_dimrecs = butler.query_dimension_records("visit",
where="band.name = :band AND \
visit.region OVERLAPS :region",
bind={"band": band, "region": region})
temp = []
for visit in visit_dimrecs:
temp.append(visit.timespan.begin.mjd)
all_visit_begin_mjd = np.sort(np.asarray(temp, dtype='float'))
temp_label = band + ' (' + str(len(visit_dimrecs)) + ')'
n, bins, patches = plt.hist(all_visit_begin_mjd, use_bins,
cumulative=True, histtype='step',
linewidth=2, alpha=0.7,
color=filter_colors[band],
label=temp_label)
for patch in patches:
patch.set_linestyle(filter_linestyles[band])
del temp, all_visit_begin_mjd, temp_label, n, bins, patches
plt.legend(loc='upper left')
plt.xlabel('MJD')
plt.ylabel('Cumulative number of visits')
plt.show()
Figure 2: The total, cumulative number of visits by Modified Julian Date (MJD) for each filter.
Exercises for the learner¶
- Print all dataset types with
source
in their name (Section 2.2). - Print each dimension key for dataset type
visit_image
(Section 2.3). - Show that seven patches from three tracts overlap coordinates RA, Dec = $53.84, -28.35$ deg.