104.2. Explore the butler repository¶

104_2_Explore_the_butler_repo

104.2. Explore the Butler repository¶

For the Rubin Science Platform at data.lsst.cloud.
Data Release: Data Preview 1
Container Size: large
LSST Science Pipelines version: r29.1.1
Last verified to run: 2025-06-21
Repository: github.com/lsst/tutorial-notebooks

Learning objective: How to explore Butler collections, datasets, dimensions, and schema.

LSST data products: Butler dimension records (metadata)

Packages: lsst.daf.butler

Credit: Originally developed by the Rubin Community Science team. Please consider acknowledging them if this notebook is used for the preparation of journal articles, software releases, or other notebooks.

Get Support: Everyone is encouraged to ask questions or raise issues in the Support Category of the Rubin Community Forum. Rubin staff will respond to all questions posted there.

1. Introduction¶

The Butler is LSST Science Pipelines middleware for managing, reading, and writing datasets.

As the interface between the pipelines and the data, it is often referred to as "middleware".

Butler-related documentation:

This tutorial demonstrates how to explore the Butler: how to discover what kind of data is accessible and its schema, which dimensions to use to query and retrieve datasets, and how to visualize metadata.

Related tutorials: The first 100-level Butler tutorial in this series provides a basic introduction to how to use the Butler; later tutorials in this series demonstrate advanced Butler queries.

1.1. Import packages¶

Import the Butler module from the lsst.daf.butler package, the display module from the lsst.afw package (for image display), and the sphgeom package for spherical geometry functions.

Also import numpy and modules from matplotlib.

In [1]:

from lsst.daf.butler import Butler
import lsst.sphgeom as sphgeom
import lsst.geom as geom
from lsst.utils.plotting import (get_multiband_plot_colors,
                                 get_multiband_plot_symbols,
                                 get_multiband_plot_linestyles)

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.lines as mlines
from matplotlib.patches import Polygon

1.2. Define parameters and functions¶

Create an instance of the Butler with the repository and collection for DP1, and assert that it exists.

In [2]:

butler = Butler("dp1", collections="LSSTComCam/DP1")
assert butler is not None

Use the colorblind-friendly colortable tableau-colorblind10.

In [3]:

plt.style.use('seaborn-v0_8-colorblind')
prop_cycle = plt.rcParams['axes.prop_cycle']
colors = prop_cycle.by_key()['color']

Define the standard colors and linestyles to represent LSST filters.

In [4]:

filter_names = ['u', 'g', 'r', 'i', 'z', 'y']
filter_colors = get_multiband_plot_colors()
filter_symbols = get_multiband_plot_symbols()
filter_linestyles = get_multiband_plot_linestyles()

2. Explore the Butler¶

Learn how to explore the datasets available via the Butler, and their properties.

2.1. Repositories and collections¶

Repositories are defined by data release (e.g., DP0.2, DP1).

In Section 1.2 above, the DP1 repository ("dp1") and collection ("LSSTComCam/DP1") were defined when the Butler was instantiated with butler = Butler("dp1", collections="LSSTComCam/DP1"). Any queries for data run with the Butler would look only in that collection.

Print the information about collection LSSTComCam/DP1.

In [5]:

butler.collections.query_info("LSSTComCam/DP1")

Out[5]:

[CollectionInfo(name='LSSTComCam/DP1', type=<CollectionType.CHAINED: 3>, doc='', children=('LSSTComCam/runs/DRP/DP1/DM-51335', 'LSSTComCam/runs/DRP/DP1/v29_0_0/DM-50260'), parents=None, dataset_types=None)]

It is also possible to explore what other repositories and collections exist.

List all available repositories.

In [6]:

butler.get_known_repos()

Out[6]:

{'dp02', 'dp02-remote', 'dp1'}

Collections are groups of self-consistent datasets, e.g., the outputs of a processing run that uses a single set of calibrations.

Store a list of all collection names in all_collections.

In [7]:

all_collections = butler.collections.query("*")

Option to print all collections.

In [8]:

# for collection in all_collections:
#     print(collection)

In [9]:

del all_collections

2.2. Dataset types¶

Commonly-used image dataset types:

deep_coadd
visit_image

Commonly-used catalog dataset types:

object
source

Uncomment and execute the following cell to print a very long list of every possible dataset type.

Warning: The butler.registry module used below is being deprecated, but the functionality to queryDatasetTypes will be made available elsewhere.

Warning: The queryDatasetTypes function returns every type that has ever been named or defined, and is not a list of dataset types that are currently populated and available.

In [10]:

# butler.registry.queryDatasetTypes()

Option to print only the dataset types with coadd in their name.

In [11]:

# butler.registry.queryDatasetTypes('*coadd*')

Option to print the dataset types with map in their name.

In [12]:

# butler.registry.queryDatasetTypes('*map*')

Option to print the dataset types with object in their name.

In [13]:

# butler.registry.queryDatasetTypes('*object*')

2.3. Dimensions and elements¶

The dimensions and elements are the properties of a dataset type that can be constrained in a Butler query.

Commonly-used dimensions:

band
tract
patch
visit
detector

Commonly-used element:

visit_detector_region

Print the full lists of dimensions and elements.

In [14]:

print(butler.dimensions.getStaticDimensions())

{band, healpix1, healpix2, healpix3, healpix4, healpix5, healpix6, healpix7, healpix8, healpix9, healpix10, healpix11, healpix12, healpix13, healpix14, healpix15, healpix16, healpix17, htm1, htm2, htm3, htm4, htm5, htm6, htm7, htm8, htm9, htm10, htm11, htm12, htm13, htm14, htm15, htm16, htm17, htm18, htm19, htm20, htm21, htm22, htm23, htm24, instrument, skymap, day_obs, detector, group, physical_filter, subfilter, tract, visit_system, exposure, patch, visit}

In [15]:

print(butler.dimensions.getStaticElements())

{band, healpix1, healpix2, healpix3, healpix4, healpix5, healpix6, healpix7, healpix8, healpix9, healpix10, healpix11, healpix12, healpix13, healpix14, healpix15, healpix16, healpix17, htm1, htm2, htm3, htm4, htm5, htm6, htm7, htm8, htm9, htm10, htm11, htm12, htm13, htm14, htm15, htm16, htm17, htm18, htm19, htm20, htm21, htm22, htm23, htm24, instrument, skymap, day_obs, detector, group, physical_filter, subfilter, tract, visit_system, exposure, patch, visit, visit_definition, visit_detector_region, visit_system_membership}

Show that the dimensions for the deep_coadd dataset type are band, skymap, tract, patch, and that the returned format for images is ExposureF.

In [16]:

butler.get_dataset_type('deep_coadd')

Out[16]:

DatasetType('deep_coadd', {band, skymap, tract, patch}, ExposureF)

Print each dimension key using the dimensions.data_coordinate_keys attribute.

In [17]:

dataset_type = butler.get_dataset_type('deep_coadd')
for dimension in dataset_type.dimensions.data_coordinate_keys:
    print(dimension)

band
skymap
tract
patch

Print all dimension keys for the visit_image dataset type.

In [18]:

dataset_type = butler.get_dataset_type('visit_image')
for dimension in dataset_type.dimensions.data_coordinate_keys:
    print(dimension)

instrument
detector
visit
band
day_obs
physical_filter

2.3.1. Required dimensions¶

Some dimensions are redundant, like band and physical_filter, or day_obs and visit. Print only the subset of required dimensions for the visit_image dataset type.

In [19]:

dataset_type = butler.get_dataset_type('visit_image')
for dimension in dataset_type.dimensions.required:
    print(dimension)

instrument
detector
visit

In [20]:

del dataset_type

2.4. Schema¶

The set of dimensions for a given dataset type have schema which define how a Butler query for a given dataset type can be constructed.

Print the schema for each dimension of the deep_coadd dataset type.

In [21]:

dataset_type = butler.get_dataset_type('deep_coadd')
for dimension in dataset_type.dimensions.data_coordinate_keys:
    print('dimension = ', dimension)
    print(butler.dimensions[dimension].schema)
    print(' ')

dimension =  band
band:
  name: string

dimension =  skymap
skymap:
  name: string
  hash: hash
    A hash of the skymap's parameters.
  tract_max: int
    Maximum ID for tracts in this skymap, exclusive.
  patch_nx_max: int
    Number of patches in the x direction in each tract.
  patch_ny_max: int
    Number of patches in the y direction in each tract.

dimension =  tract
tract:
  skymap: string
  id: int
  region: region

dimension =  patch
patch:
  skymap: string
  tract: int
  id: int
  cell_x: int
    Which column this patch occupies in the tract's grid of patches.
  cell_y: int
    Which row this patch occupies in the tract's grid of patches.
  region: region

The output above shows, for example, that a query for deep_coadd images can be constrained by the tract and patch id values (unique integers), or with region (see example in Section 3.1).

Option to print the schema for each dimension of the visit_image dataset type.

In [22]:

# dataset_type = butler.get_dataset_type('visit_image')
# for dimension in dataset_type.dimensions.data_coordinate_keys:
#     print('dimension = ', dimension)
#     print(butler.dimensions[dimension].schema)
#     print(' ')

Option to print the schema for the object table.

In [23]:

# dataset_type = butler.get_dataset_type('object')
# for dimension in dataset_type.dimensions.data_coordinate_keys:
#     print('dimension = ', dimension)
#     print(butler.dimensions[dimension].schema)
#     print(' ')

2.5. Dimension records¶

Return all records for a given dimension in the collection (i.e., all unique instruments, bands, tracts, patches, visits, detectors).

Print all unique instruments in the collection, and see that DP1 contains only LSSTComCam data.

In [24]:

butler.query_dimension_records('instrument')

Out[24]:

[instrument.RecordClass(name='LSSTComCam', visit_max=7050123199999, visit_system=2, exposure_max=7050123199999, detector_max=1000, class_name='lsst.obs.lsst.LsstComCam')]

Print all unique bands (filters) in the collection, and see that the LSST filters ugrizy are all included.

In [25]:

butler.query_dimension_records('band')

Out[25]:

[band.RecordClass(name='r'),
 band.RecordClass(name='z'),
 band.RecordClass(name='y'),
 band.RecordClass(name='g'),
 band.RecordClass(name='i'),
 band.RecordClass(name='unknown'),
 band.RecordClass(name='u'),
 band.RecordClass(name='OG590'),
 band.RecordClass(name='white')]

Print the unique tracts in the DP1 collection, and see that there are only 31 because DP1 covers a relatively small area.

In [26]:

all_tracts = butler.query_dimension_records('tract')
all_tract_ids = []
for tract in all_tracts:
    all_tract_ids.append(tract.id)
print('Number of DP1 tracts: ', len(all_tract_ids))
print(all_tract_ids)
del all_tracts, all_tract_ids

Number of DP1 tracts:  31
[453, 454, 531, 532, 2234, 2235, 2393, 2394, 4016, 4017, 4217, 4218, 4848, 4849, 5062, 5063, 5064, 5305, 5306, 5525, 5526, 7610, 7611, 7849, 7850, 10221, 10222, 10463, 10464, 10704, 10705]

Options to get all visit or all detector ids.

In [27]:

# all_visits = butler.query_dimension_records('visit')
# all_visit_ids = []
# for visit in all_visits:
#     all_visit_ids.append(visit.id)
# print('Number of DP1 visits: ', len(all_visit_ids))
# del all_visits, all_visit_ids

In [28]:

# all_detectors = butler.query_dimension_records('detector')
# all_detector_ids = []
# for detector in all_detectors:
#     all_detector_ids.append(detector.id)
# print('Number of DP1 detectors: ', len(all_detector_ids))
# print(all_detector_ids)
# del all_detectors, all_detector_ids

2.6. DataIds¶

The dataId is a dictionary in which the keys are dimensions, and the values uniquely identify a data product of a given dataset type.

In Section 2.3., the dimensions of a deep_coadd were shown to be band, skymap, tract, and patch. The dataId for a unique deep_coadd image is a dictionary that contains the values for each dimension (key), such as:

dataId = {"band": "r", "skymap": "lsst_cells_v1", "tract": 5063, "patch": 14}

Option to print all unique dataId values for instrument, band, and tract. This returns similar lists as in Section 2.5 above.

In [29]:

# butler.query_data_ids('instrument')

In [30]:

# butler.query_data_ids('band')

Tracts are defined for a particular skymap.

In [31]:

# butler.query_data_ids('tract')

2.7. Skymap, tract, and patch¶

The skymap defines the all-sky tesselation of tracts and patches for the deep coadd images.

Use the Butler's get_dataset_type function to show that the dimension for the skyMap dataset type is just skymap, and that the returned format for the skymap is SkyMap.

In [32]:

butler.get_dataset_type("skyMap")

Out[32]:

DatasetType('skyMap', {skymap}, SkyMap)

Show the schema for the skymap.

In [33]:

dataset_type = butler.get_dataset_type('skyMap')
for dimension in dataset_type.dimensions.data_coordinate_keys:
    print('dimension = ', dimension)
    print(butler.dimensions[dimension].schema)
    print(' ')

dimension =  skymap
skymap:
  name: string
  hash: hash
    A hash of the skymap's parameters.
  tract_max: int
    Maximum ID for tracts in this skymap, exclusive.
  patch_nx_max: int
    Number of patches in the x direction in each tract.
  patch_ny_max: int
    Number of patches in the y direction in each tract.

Return all skymap records for DP1: there is only one, lsst_cells_v1.

In [34]:

butler.query_dimension_records('skymap')

Out[34]:

[skymap.RecordClass(name='lsst_cells_v1', hash=b'\xe2snA$M\xf0\x0b\xbd\xb9:\xec\xac\n\xe4\xb2\xd1\x9d\xda\x91', tract_max=18938, patch_nx_max=10, patch_ny_max=10)]

Retrieve the skymap from the butler.

In [35]:

skymap = butler.get("skyMap", skymap="lsst_cells_v1")

Option to print the help for skymap.

In [36]:

# help(skymap)

Print all methods available for the returned skymap.

In [37]:

for tmp in dir(skymap):
    if tmp[0] != '_':
        print(tmp)

ConfigClass
SKYMAP_DATASET_TYPE_NAME
SKYMAP_RUN_COLLECTION_NAME
config
findAllTracts
findClosestTractPatchList
findTract
findTractIdArray
findTractPatchList
generateTract
getRaDecRange
getRingIndices
getSha1
logSkyMapInfo
register
updateSha1

A few of these methods are particularly useful.

Print the minimum and maximum RA and Dec covered by a given tract.

In [38]:

tract = skymap.getRaDecRange(5063)
tract

Out[38]:

(Angle(52.258064516129039, degrees),
 Angle(53.917050691244235, degrees),
 Angle(-28.264462809917354, degrees),
 Angle(-26.776859504132226, degrees))

Print the tract and patch for RA, Dec = $53.076, -28.110$ deg.

In [39]:

ra = 53.076
dec = -28.110
point = geom.SpherePoint(ra*geom.degrees, dec*geom.degrees)
tract = skymap.findTract(point)
patch = tract.findPatch(point)
print(tract.tract_id, patch.getSequentialIndex())
del ra, dec, tract, patch

5063 15

Tracts and patches overlap. Get all overlapping tracts and patches.

In [40]:

tplist = skymap.findTractPatchList([point])

Display the list of tracts and patches, tplist, and see that one tract and two patches overlap the coordinates.

In [41]:

tplist

Out[41]:

[(TractInfo(id=5063, ctrCoord=[0.5326332312997893, 0.7090801040994212, -0.46206844394039615]),
  (PatchInfo(index=Index2D(x=4, y=1), innerBBox=(minimum=(12000, 3000), maximum=(14999, 5999)), outerBBox=(minimum=(11800, 2800), maximum=(15199, 6199)), cellInnerDimensions=(150, 150), cellBorder=50, numCellsPerPatchInner=22),
   PatchInfo(index=Index2D(x=5, y=1), innerBBox=(minimum=(15000, 3000), maximum=(17999, 5999)), outerBBox=(minimum=(14800, 2800), maximum=(18199, 6199)), cellInnerDimensions=(150, 150), cellBorder=50, numCellsPerPatchInner=22)))]

Print the tract and two patch ids for all that overlap the provided coordinates.

In [42]:

for tmp in tplist:
    tract = tmp[0]
    patches = tmp[1]
    for patch in patches:
        print(tract.getId(), patch.getSequentialIndex())

5063 14
5063 15

In [43]:

del point, tplist, skymap

3. Visualize Butler metadata¶

3.1. Tract and patch boundaries¶

Use the skymap and deep_coadd metadata to draw tract and patch boundaries.

Retrieve all the tract and patch dimension records for all tracts and patches within 10 degrees of RA, Dec = $53.076, -28.110$ deg. Those coordinates are near the center of the Extended Chandra Deep Field South (ECDFS).

In [44]:

region = sphgeom.Region.from_ivoa_pos("CIRCLE 53.076 -28.110 10.0")
tract_dimrecs = butler.query_dimension_records("tract",
                                               where="tract.region OVERLAPS :region",
                                               bind={"region": region})
patch_dimrecs = butler.query_dimension_records("patch",
                                               where="patch.region OVERLAPS :region",
                                               bind={"region": region})
print('Number of overlapping patches: ', len(patch_dimrecs))

Number of overlapping patches:  93

Define a color dictionary that assigns a color from the colortable to each tract.

In [45]:

color_dict = {}
for r, rec in enumerate(tract_dimrecs):
    color_dict[rec.dataId['tract']] = colors[r]
print(color_dict)

{4848: '#0072B2', 4849: '#009E73', 5062: '#D55E00', 5063: '#CC79A7', 5064: '#F0E442'}

The patch dimension records returned from the Butler include the region as a ConvexPolygon3D with four vertices.

Convert these vertices to 2D sky coordinates (RA and Dec), then use the Polygon module and the add_patch function to draw each patch in the figure, using a different color per patch.

In [46]:

fig, ax = plt.subplots(figsize=(6, 6))

for rec in patch_dimrecs:
    vertices = rec.region.getVertices()
    vertices_deg = []
    for vertex in vertices:
        vertices_deg.append([geom.SpherePoint(vertex).getRa().asDegrees(),
                             geom.SpherePoint(vertex).getDec().asDegrees()])
    polygon = Polygon(vertices_deg, closed=True, facecolor='None',
                      edgecolor=color_dict[rec.dataId['tract']])
    ax.add_patch(polygon)

ax.set_xlim(54.1, 52.1)
ax.set_ylim(-28.9, -27.1)
ax.set_xlabel('RA')
ax.set_ylabel('Dec')

handles = []
labels = []
for clr in color_dict:
    line = mlines.Line2D([], [], color=color_dict[clr], linestyle='-')
    handles.append(line)
    labels.append(clr)
plt.legend(loc='upper left', ncol=3, handles=handles, labels=labels)

plt.title('Tracts and patches that overlap ECDFS')
plt.show()

No description has been provided for this image

Figure 1: The boundaries of the 93 patches from the five different tracts that overlap the ECDFS DP1 field.

In [47]:

del region, tract_dimrecs, patch_dimrecs

3.2. Visit dates¶

Plot the cumulative histogram of the number of visits in the ECDFS field over time, by filter.

Define the search region as in Section 3.1, above.

In [48]:

region = sphgeom.Region.from_ivoa_pos("CIRCLE 53.076 -28.110 10.0")

Set the bins of the plot to cover the ECDFS observation dates, plus a few days before and after.

In [49]:

use_bins = np.arange(60660-60622, dtype='int') + 60622

For each filter, return all visit dimension records that overlap the region. Create an array of the start times, in MJD, from every visit's timespan. Plot the cumulative distribution of start time to see how observations accumulated for ECDFS, per filter.

In [50]:

fig = plt.figure(figsize=(6, 4))
for band in filter_names:
    visit_dimrecs = butler.query_dimension_records("visit",
                                                   where="band.name = :band AND \
                                                   visit.region OVERLAPS :region",
                                                   bind={"band": band, "region": region})
    temp = []
    for visit in visit_dimrecs:
        temp.append(visit.timespan.begin.mjd)
    all_visit_begin_mjd = np.sort(np.asarray(temp, dtype='float'))
    temp_label = band + ' (' + str(len(visit_dimrecs)) + ')'

    n, bins, patches = plt.hist(all_visit_begin_mjd, use_bins,
                                cumulative=True, histtype='step',
                                linewidth=2, alpha=0.7,
                                color=filter_colors[band],
                                label=temp_label)
    for patch in patches:
        patch.set_linestyle(filter_linestyles[band])

    del temp, all_visit_begin_mjd, temp_label, n, bins, patches

plt.legend(loc='upper left')
plt.xlabel('MJD')
plt.ylabel('Cumulative number of visits')
plt.show()

Figure 2: The total, cumulative number of visits by Modified Julian Date (MJD) for each filter.

Exercises for the learner¶

Print all dataset types with source in their name (Section 2.2).
Print each dimension key for dataset type visit_image (Section 2.3).
Show that seven patches from three tracts overlap coordinates RA, Dec = $53.84, -28.35$ deg.

In [ ]: