104.4. Catalog queries with the butler¶
104.4. Catalog queries with the Butler¶
Data Release: Data Preview 1
Container Size: large
LSST Science Pipelines version: r29.1.1
Last verified to run: 2025-06-21
Repository: github.com/lsst/tutorial-notebooks
Learning objective: How to query and retrieve catalog data with the Butler.
LSST data products: source
, object
Packages: lsst.daf.butler
Credit: Originally developed by the Rubin Community Science team. Please consider acknowledging them if this notebook is used for the preparation of journal articles, software releases, or other notebooks.
Get Support: Everyone is encouraged to ask questions or raise issues in the Support Category of the Rubin Community Forum. Rubin staff will respond to all questions posted there.
1. Introduction¶
The Butler is LSST Science Pipelines middleware for managing, reading, and writing datasets, and is the recommended tool for finding and retrieving images in the Notebook Aspect.
Warning: avoid returning all columns of a table.
Some tables have hundreds of columns (see the Science Data Model Schemas page), and returning more columns than the container size chosen at login (e.g., a large container is 16 G) will cause the Notebook's kernel to crash. It is best to only retrieve the columns necessary for the analysis from the Butler, as demonstrated in Section 6.
Using Butler vs. TAP for catalog access.
For catalog access TAP (Table Access Protocol) is the recommended tool.
TAP queries can be cone or polygon spatial searches, and can include constraints on the rows (i.e., the measured values).
The Butler, on the other hand, offers access to tables as they are stored in parquet format.
Most of the catalogs are stored by entire tract (object
tables) or visit (source
tables).
For this reason, query and subsetting capabilities with the Butler are more limited.
Since parquet is column-oriented, individual columns can be returned by name, but all rows are always returned.
This tutorial demonstrates how to form queries for Butler-based catalog data, and how to retrieve tables of data.
Related tutorials: The earlier 100-level Butler tutorials in this series show how to retrieve image data with the Butler and how to explore and discover the dataset types and their properties. Other 100-level series demonstrate catalog access with TAP.
1.1. Import packages¶
Import the butler
and sphgeom
from the LSST Science Pipelines, and also the astropy
module for working with times.
from lsst.daf.butler import Butler, Timespan
import lsst.sphgeom as sphgeom
from astropy.time import Time
1.2. Define parameters¶
Create an instance of the Butler with the repository and collection for DP1, and assert that it exists.
butler = Butler("dp1", collections="LSSTComCam/DP1")
assert butler is not None
Coordinates: Use coordinates RA, Dec = $53.076, -28.110$ deg, which are near the center of the Extended Chandra Deep Field South (ECDFS).
ra = 53.076
dec = -28.110
Region: Define a circle with a radius of 2 deg, centered on the coordinates.
region = sphgeom.Region.from_ivoa_pos("CIRCLE 53.076 -28.110 2.0")
Filter: Use the $r$-band.
band = 'r'
Time and timespan: Always use International Atomic Time (Temps Atomique International; TAI) times with the Butler.
Define a time
that is the MJD at the midpoint of the visit image with visitId
= 2024110800263.
Define a timespan
that is one night, early on in the series of observations with LSSTComCam.
mjd = 60623.27103473955
time = Time(mjd, format="mjd", scale="tai")
time1 = Time(60623.0, format="mjd", scale="tai")
time2 = Time(60624.0, format="mjd", scale="tai")
timespan = Timespan(time1, time2)
timespan
del time1, time2
Visit list: Use a set of visit identifiers that are the first 10 $r$-band visits on MJD 60623.
visit_set = (2024110800246, 2024110800247, 2024110800250,
2024110800251, 2024110800254, 2024110800255,
2024110800258, 2024110800259, 2024110800262,
2024110800263)
2. Query formation¶
A call to the Butler's query_datasets
function requires at least a dataset_type
(the table name, e.g., source
or object
) and a where
statement (a string expression resembling an SQL WHERE clause), which can use bind
parameters to format the query.
The order_by
and limit
parameters are optional.
butler.query_datasets(<dataset_type>,
where=<query>,
bind=<bind_dictionary>,
order_by=<dimension_list>,
limit=<integer>)
The query_datasets
function returns the dataset references for all data that meet the query constraints.
The dataset reference can then be used with the butler.get
function to retrieve the data itself.
2.1. Dataset type¶
Dataset types for DP1 source and object catalogs:
object
source
object_forced_source
dia_object
dia_source
dia_object_forced_source
ss_object
ss_source
There are also the visit_table
and visit_detector_table
dataset types, containing observation metadata.
Print the information for each table dataset type.
tables = ['object', 'source', 'object_forced_source',
'dia_object', 'dia_source', 'dia_object_forced_source',
'ss_object', 'ss_source',
'visit_table', 'visit_detector_table']
for table in tables:
print('')
print(table)
print(butler.get_dataset_type(table))
print('required: ', butler.get_dataset_type(table).dimensions.required)
object DatasetType('object', {skymap, tract}, ArrowAstropy) required: {skymap, tract} source DatasetType('source', {band, instrument, day_obs, physical_filter, visit}, ArrowAstropy) required: {instrument, visit} object_forced_source DatasetType('object_forced_source', {skymap, tract, patch}, ArrowAstropy) required: {skymap, tract, patch} dia_object DatasetType('dia_object', {skymap, tract}, ArrowAstropy) required: {skymap, tract} dia_source DatasetType('dia_source', {skymap, tract}, ArrowAstropy) required: {skymap, tract} dia_object_forced_source DatasetType('dia_object_forced_source', {skymap, tract, patch}, ArrowAstropy) required: {skymap, tract, patch} ss_object DatasetType('ss_object', {}, ArrowAstropy) required: {} ss_source DatasetType('ss_source', {}, ArrowAstropy) required: {} visit_table DatasetType('visit_table', {instrument}, ArrowAstropy) required: {instrument} visit_detector_table DatasetType('visit_detector_table', {instrument}, ArrowAstropy) required: {instrument}
2.2. Where statement¶
The where
statement can be used to place constraints on the dimensions and fields of a given dataset type.
For the object
table, tract is the dimension.
Unlike with the deep_coadd
images, band is not a dimension because the object
table aggregates photometry from all six bands for a given object (for a given row of the table).
For the source
table, visit is the dimension.
Detector, band, and date are not required, as the source
table aggregates over detectors in a visit, and the band and date are defined by the visit.
The following two cells provide the option to print the dimensions and schema for the object
and source
dataset types.
# dataset_type = butler.get_dataset_type('object')
# for dimension in dataset_type.dimensions.data_coordinate_keys:
# print('dimension = ', dimension)
# print(butler.dimensions[dimension].schema)
# print(' ')
# dataset_type = butler.get_dataset_type('source')
# for dimension in dataset_type.dimensions.data_coordinate_keys:
# print('dimension = ', dimension)
# print(butler.dimensions[dimension].schema)
# print(' ')
Execute a query for source
tables that are associated with visits that were obtained in the $r$-band and overlap the defined coordinates.
query = f"band.name = '{band}' AND visit.region OVERLAPS POINT({ra}, {dec})"
print(query)
dataset_refs = butler.query_datasets("source",
where=query)
print(len(dataset_refs))
del query, dataset_refs
band.name = 'r' AND visit.region OVERLAPS POINT(53.076, -28.11) 237
2.3. Bind parameters¶
Recreate the same query as above, but in the query statement use bind parameters (placeholders) and define the list bind_params
to hold the values.
query = "band.name = :band AND " \
"visit.region OVERLAPS POINT(:ra, :dec)"
print(query)
bind_params = {"band": band, "ra": ra, "dec": dec}
print(bind_params)
band.name = :band AND visit.region OVERLAPS POINT(:ra, :dec) {'band': 'r', 'ra': 53.076, 'dec': -28.11}
dataset_refs = butler.query_datasets("source",
where=query,
bind=bind_params)
print(len(dataset_refs))
del dataset_refs
237
2.5. Order by¶
The order_by
parameter accepts a list of strings specifying the sort parameters.
Results can be sorted by dimension name (e.g., band
, visit
, detector
) and
dimension name and schema field (e.g., band.name
, visit.day_obs
, visit.timespan.begin
).
dataset_refs = butler.query_datasets("source",
where=query,
bind=bind_params,
order_by=["visit.timespan.begin"])
print(dataset_refs[0])
del dataset_refs
source@{instrument: 'LSSTComCam', visit: 2024110800246, band: 'r', day_obs: 20241108, physical_filter: 'r_03'} [sc=ArrowAstropy] (run=LSSTComCam/runs/DRP/DP1/v29_0_0/DM-50260/20250419T073356Z id=3ef75473-b78a-4509-a24b-2afbb9e91d70)
Use a -
in front of the string to reverse sort.
dataset_refs = butler.query_datasets("source",
where=query,
bind=bind_params,
order_by=["-visit.timespan.begin"])
print(dataset_refs[0])
del dataset_refs
source@{instrument: 'LSSTComCam', visit: 2024121000426, band: 'r', day_obs: 20241210, physical_filter: 'r_03'} [sc=ArrowAstropy] (run=LSSTComCam/runs/DRP/DP1/v29_0_0/DM-50260/20250419T073356Z id=3a5f789b-17d3-4d4d-a828-835d54f8bb62)
2.6. Limit¶
Use the limit
parameter to limit the number of results.
This is recommended when testing queries.
dataset_refs = butler.query_datasets("source",
where=query,
bind=bind_params,
order_by=["visit.timespan.begin"],
limit=10)
print(len(dataset_refs))
del dataset_refs
10
Use a -
in front of the limit value to have the query return a warning if
there were more results that were not returned.
dataset_refs = butler.query_datasets("source",
where=query,
bind=bind_params,
order_by=["visit.timespan.begin"],
limit=-10)
print(len(dataset_refs))
del dataset_refs
lsst.daf.butler._butler WARNING: More datasets are available than the requested limit of 10.
10
query = f"visit.region OVERLAPS POINT({ra}, {dec})"
dataset_refs = butler.query_datasets("source",
where=query)
print(len(dataset_refs))
del query, dataset_refs
855
Execute a query for object
tables that overlap the defined coordinates.
query = f"patch.region OVERLAPS POINT({ra}, {dec})"
dataset_refs = butler.query_datasets("object",
where=query)
print(len(dataset_refs))
del query, dataset_refs
1
One result is returned from the above query on object
tables.
3.2. Overlaps region¶
Execute a query for source
tables that overlap the region defined in Section 1.2.
Print the region. It is a circle, centered on the ECDFS field, with a 2 deg radius that covers the entire area observed with LSSTComCam.
print(region)
Circle([0.5298928578176015, 0.7051356719745795, -0.47116583422703023], 0.03490658503988659)
query = "visit.region OVERLAPS :region"
bind_params = {'region': region}
dataset_refs = butler.query_datasets("source",
where=query,
bind=bind_params)
print(len(dataset_refs))
del query, bind_params, dataset_refs
855
This is the same number of dataset_refs
as returned by the "overlaps point" query in Section 4.1, because every visit overlapped the center of the ECDFS field.
Try again with a larger region that encompasses several separated LSSTComCam fields, and show that many more source
tables are returned.
string2 = "CIRCLE 50.0 -30 20.0"
region2 = sphgeom.Region.from_ivoa_pos(string2)
query = "visit.region OVERLAPS :region"
bind_params = {'region': region2}
dataset_refs = butler.query_datasets("source",
where=query,
bind=bind_params)
print(len(dataset_refs))
del query, bind_params, dataset_refs
1169
del string2, region2
print(time)
60623.27103473955
query = "visit.timespan OVERLAPS :time"
bind_params = {'time': time}
dataset_refs = butler.query_datasets("source",
where=query,
bind=bind_params)
print(len(dataset_refs))
1
It is the one source
table from the singular visit
that was occuring at that time.
Print the dataId
.
for ref in dataset_refs:
print(ref.dataId)
{instrument: 'LSSTComCam', visit: 2024110800263, band: 'r', day_obs: 20241108, physical_filter: 'r_03'}
Store the visit identifier number in my_visit_id
to use in Section 5.1.
ref0 = dataset_refs[0]
my_visit_id = ref0.dataId.get('visit')
print(my_visit_id)
2024110800263
del query, bind_params, dataset_refs, ref0
4.2. Overlaps timespan¶
Execute a query for all source
tables that have a visit.timespan
(the time between the start and the end of the exposure) that overlaps with the timespan
defined in Section 1.2.
Order the results by the start time of the visit.timespan
.
print(timespan)
[2024-11-09T00:00:00, 2024-11-10T00:00:00)
query = "visit.timespan OVERLAPS :timespan"
bind_params = {'timespan': timespan}
dataset_refs = butler.query_datasets("source",
where=query,
bind=bind_params,
order_by=["visit.timespan.begin"])
print(len(dataset_refs))
del query, bind_params, dataset_refs
69
query = f"visit.id = {my_visit_id}"
print(query)
dataset_refs = butler.query_datasets("source",
where=query)
print(len(dataset_refs))
del query, dataset_refs
visit.id = 2024110800263
1
5.2 Source tables for a list of visits¶
Print the set of visits defined in Section 1.2.
print(visit_set)
(2024110800246, 2024110800247, 2024110800250, 2024110800251, 2024110800254, 2024110800255, 2024110800258, 2024110800259, 2024110800262, 2024110800263)
Convert this to a string, so it can be embedded in a query statement.
string_visit_set = "(" + ", ".join(str(value) for value in visit_set) + ")"
print(string_visit_set)
(2024110800246, 2024110800247, 2024110800250, 2024110800251, 2024110800254, 2024110800255, 2024110800258, 2024110800259, 2024110800262, 2024110800263)
Find all source
tables that are associated with these 10 visits.
query = f"visit.id IN {string_visit_set}"
print(query)
dataset_refs = butler.query_datasets("source",
where=query)
print(len(dataset_refs))
visit.id IN (2024110800246, 2024110800247, 2024110800250, 2024110800251, 2024110800254, 2024110800255, 2024110800258, 2024110800259, 2024110800262, 2024110800263)
10
6. Retrieve catalog data from the Butler¶
6.1. Results format¶
Although the Butler is accessing catalog data stored in parquet format (Section 1), it returns astropy
tables; that is what the ArrowAstropy
part of the returned dataset type means (Section 2.1).
Pandas dataframes can be returned by the Butler by including storageClass='DataFrame'
in a butler.get()
statement, like the one in Section 6.2, but it is recommended to use the default astropy
formatted tables.
6.2. Specify columns to be returned¶
When retrieving catalog data from the Butler, always specify the columns to be returned.
Some tables have hundreds of columns, and it is best to only retrieve the columns that are necessary for a given analysis.
All columns are listed in Rubin's Science Data Model Schemas page.
Always pass the desired column names to a butler.get()
statement.
Column names are passed with the parameters
dictionary.
Requesting individual columns is done by passing, e.g., {'columns': ['col1', 'col2', 'col3']}
, as in the example below.
Define a list of source
table column names to retrieve.
column_names = ['sourceId', 'detector', 'coord_ra', 'coord_dec', 'psfFlux']
6.3. Return sources from a dataset reference¶
Query for the source
table from the first visit in the timespan.
query = "visit.timespan OVERLAPS :timespan"
bind_params = {'timespan': timespan}
dataset_refs = butler.query_datasets("source",
where=query,
bind=bind_params,
order_by=["visit.timespan.begin"],
limit=-1)
lsst.daf.butler._butler WARNING: More datasets are available than the requested limit of 1.
Retrieve the desired columns from the source
table, and print the total number of rows (sources) returned.
sources = butler.get(dataset_refs[0],
parameters={'columns': column_names})
print(len(sources))
29534
Option to display the table.
# sources
del query, bind_params, dataset_refs, sources
6.4. Return sources for a visit id¶
Retrieve the source
table for the visit defined as my_visit_id
in Section 4.1.
print(my_visit_id)
sources = butler.get("source",
visit=my_visit_id,
parameters={'columns': column_names})
print(len(sources))
2024110800263
25922
Option to show the table.
# sources
del my_visit_id, sources
7. Appendix¶
7.1. Related tables and images¶
A note about the relationship between object
& deep_coadd
, and source
& visit_image
.
These "catalog:image" pairs do not have the same dimensions (are not 1:1) in the Butler, as might be assumed.
The object
catalog contains forced photometry measurements made on the multi-band deep_coadd
images.
The deep_coadd
images are stored in the Butler by patch (the skymap is divided into tracts, the tracts into patches),
but the object
table is stored by tract.
The source
catalog contains measurements for detected sources in visit_images
.
The visit_images
are stored in the Butler by detector (a visit with LSSTComCam has 9 detector visit_images
),
but the source
table is stored by visit.
Thus in the use case where a single image, either a deep_coadd
image or a visit_image
is being examined,
and the desire is to retrieve measurements for all the objects
or sources
in that image,
the Butler will return all the objects from the same tract as that deep_coadd
's patch,
and all the sources from the same visit as that visit_image
's detector.