104.4. Catalog queries with the butler¶

104_4_Catalog_queries_with_the_butler

104.4. Catalog queries with the Butler¶

For the Rubin Science Platform at data.lsst.cloud.
Data Release: Data Preview 1
Container Size: large
LSST Science Pipelines version: r29.1.1
Last verified to run: 2025-06-21
Repository: github.com/lsst/tutorial-notebooks

Learning objective: How to query and retrieve catalog data with the Butler.

LSST data products: source, object

Packages: lsst.daf.butler

Credit: Originally developed by the Rubin Community Science team. Please consider acknowledging them if this notebook is used for the preparation of journal articles, software releases, or other notebooks.

Get Support: Everyone is encouraged to ask questions or raise issues in the Support Category of the Rubin Community Forum. Rubin staff will respond to all questions posted there.

1. Introduction¶

The Butler is LSST Science Pipelines middleware for managing, reading, and writing datasets, and is the recommended tool for finding and retrieving images in the Notebook Aspect.

Warning: avoid returning all columns of a table.

Some tables have hundreds of columns (see the Science Data Model Schemas page), and returning more columns than the container size chosen at login (e.g., a large container is 16 G) will cause the Notebook's kernel to crash. It is best to only retrieve the columns necessary for the analysis from the Butler, as demonstrated in Section 6.

Using Butler vs. TAP for catalog access.

For catalog access TAP (Table Access Protocol) is the recommended tool. TAP queries can be cone or polygon spatial searches, and can include constraints on the rows (i.e., the measured values). The Butler, on the other hand, offers access to tables as they are stored in parquet format. Most of the catalogs are stored by entire tract (object tables) or visit (source tables). For this reason, query and subsetting capabilities with the Butler are more limited. Since parquet is column-oriented, individual columns can be returned by name, but all rows are always returned.

This tutorial demonstrates how to form queries for Butler-based catalog data, and how to retrieve tables of data.

Related tutorials: The earlier 100-level Butler tutorials in this series show how to retrieve image data with the Butler and how to explore and discover the dataset types and their properties. Other 100-level series demonstrate catalog access with TAP.

1.1. Import packages¶

Import the butler and sphgeom from the LSST Science Pipelines, and also the astropy module for working with times.

In [1]:

from lsst.daf.butler import Butler, Timespan
import lsst.sphgeom as sphgeom
from astropy.time import Time

1.2. Define parameters¶

Create an instance of the Butler with the repository and collection for DP1, and assert that it exists.

In [2]:

butler = Butler("dp1", collections="LSSTComCam/DP1")
assert butler is not None

Coordinates: Use coordinates RA, Dec = $53.076, -28.110$ deg, which are near the center of the Extended Chandra Deep Field South (ECDFS).

In [3]:

ra = 53.076
dec = -28.110

Region: Define a circle with a radius of 2 deg, centered on the coordinates.

In [4]:

region = sphgeom.Region.from_ivoa_pos("CIRCLE 53.076 -28.110 2.0")

Filter: Use the $r$-band.

In [5]:

band = 'r'

Time and timespan: Always use International Atomic Time (Temps Atomique International; TAI) times with the Butler. Define a time that is the MJD at the midpoint of the visit image with visitId = 2024110800263. Define a timespan that is one night, early on in the series of observations with LSSTComCam.

In [6]:

mjd = 60623.27103473955
time = Time(mjd, format="mjd", scale="tai")

In [7]:

time1 = Time(60623.0, format="mjd", scale="tai")
time2 = Time(60624.0, format="mjd", scale="tai")
timespan = Timespan(time1, time2)
timespan
del time1, time2

Visit list: Use a set of visit identifiers that are the first 10 $r$-band visits on MJD 60623.

In [8]:

visit_set = (2024110800246, 2024110800247, 2024110800250,
             2024110800251, 2024110800254, 2024110800255,
             2024110800258, 2024110800259, 2024110800262,
             2024110800263)

2. Query formation¶

A call to the Butler's query_datasets function requires at least a dataset_type (the table name, e.g., source or object) and a where statement (a string expression resembling an SQL WHERE clause), which can use bind parameters to format the query. The order_by and limit parameters are optional.

butler.query_datasets(<dataset_type>,
                      where=<query>,
                      bind=<bind_dictionary>,
                      order_by=<dimension_list>,
                      limit=<integer>)

The query_datasets function returns the dataset references for all data that meet the query constraints. The dataset reference can then be used with the butler.get function to retrieve the data itself.

2.1. Dataset type¶

Dataset types for DP1 source and object catalogs:

object
source
object_forced_source
dia_object
dia_source
dia_object_forced_source
ss_object
ss_source

There are also the visit_table and visit_detector_table dataset types, containing observation metadata.

Print the information for each table dataset type.

In [9]:

tables = ['object', 'source', 'object_forced_source',
          'dia_object', 'dia_source', 'dia_object_forced_source',
          'ss_object', 'ss_source',
          'visit_table', 'visit_detector_table']

In [10]:

for table in tables:
    print('')
    print(table)
    print(butler.get_dataset_type(table))
    print('required: ', butler.get_dataset_type(table).dimensions.required)

object
DatasetType('object', {skymap, tract}, ArrowAstropy)
required:  {skymap, tract}

source
DatasetType('source', {band, instrument, day_obs, physical_filter, visit}, ArrowAstropy)
required:  {instrument, visit}

object_forced_source
DatasetType('object_forced_source', {skymap, tract, patch}, ArrowAstropy)
required:  {skymap, tract, patch}

dia_object
DatasetType('dia_object', {skymap, tract}, ArrowAstropy)
required:  {skymap, tract}

dia_source
DatasetType('dia_source', {skymap, tract}, ArrowAstropy)
required:  {skymap, tract}

dia_object_forced_source
DatasetType('dia_object_forced_source', {skymap, tract, patch}, ArrowAstropy)
required:  {skymap, tract, patch}

ss_object
DatasetType('ss_object', {}, ArrowAstropy)
required:  {}

ss_source
DatasetType('ss_source', {}, ArrowAstropy)
required:  {}

visit_table
DatasetType('visit_table', {instrument}, ArrowAstropy)
required:  {instrument}

visit_detector_table
DatasetType('visit_detector_table', {instrument}, ArrowAstropy)
required:  {instrument}

2.2. Where statement¶

The where statement can be used to place constraints on the dimensions and fields of a given dataset type.

For the object table, tract is the dimension. Unlike with the deep_coadd images, band is not a dimension because the object table aggregates photometry from all six bands for a given object (for a given row of the table).

For the source table, visit is the dimension. Detector, band, and date are not required, as the source table aggregates over detectors in a visit, and the band and date are defined by the visit.

The following two cells provide the option to print the dimensions and schema for the object and source dataset types.

In [11]:

# dataset_type = butler.get_dataset_type('object')
# for dimension in dataset_type.dimensions.data_coordinate_keys:
#     print('dimension = ', dimension)
#     print(butler.dimensions[dimension].schema)
#     print(' ')

In [12]:

# dataset_type = butler.get_dataset_type('source')
# for dimension in dataset_type.dimensions.data_coordinate_keys:
#     print('dimension = ', dimension)
#     print(butler.dimensions[dimension].schema)
#     print(' ')

Execute a query for source tables that are associated with visits that were obtained in the $r$-band and overlap the defined coordinates.

In [13]:

query = f"band.name = '{band}' AND visit.region OVERLAPS POINT({ra}, {dec})"
print(query)

dataset_refs = butler.query_datasets("source",
                                     where=query)
print(len(dataset_refs))

del query, dataset_refs

band.name = 'r' AND visit.region OVERLAPS POINT(53.076, -28.11)
237

2.3. Bind parameters¶

Recreate the same query as above, but in the query statement use bind parameters (placeholders) and define the list bind_params to hold the values.

In [14]:

query = "band.name = :band AND " \
        "visit.region OVERLAPS POINT(:ra, :dec)"
print(query)

bind_params = {"band": band, "ra": ra, "dec": dec}
print(bind_params)

band.name = :band AND visit.region OVERLAPS POINT(:ra, :dec)
{'band': 'r', 'ra': 53.076, 'dec': -28.11}

In [15]:

dataset_refs = butler.query_datasets("source",
                                     where=query,
                                     bind=bind_params)
print(len(dataset_refs))
del dataset_refs

2.5. Order by¶

The order_by parameter accepts a list of strings specifying the sort parameters. Results can be sorted by dimension name (e.g., band, visit, detector) and dimension name and schema field (e.g., band.name, visit.day_obs, visit.timespan.begin).

In [16]:

dataset_refs = butler.query_datasets("source",
                                     where=query,
                                     bind=bind_params,
                                     order_by=["visit.timespan.begin"])
print(dataset_refs[0])
del dataset_refs

source@{instrument: 'LSSTComCam', visit: 2024110800246, band: 'r', day_obs: 20241108, physical_filter: 'r_03'} [sc=ArrowAstropy] (run=LSSTComCam/runs/DRP/DP1/v29_0_0/DM-50260/20250419T073356Z id=3ef75473-b78a-4509-a24b-2afbb9e91d70)

Use a - in front of the string to reverse sort.

In [17]:

dataset_refs = butler.query_datasets("source",
                                     where=query,
                                     bind=bind_params,
                                     order_by=["-visit.timespan.begin"])
print(dataset_refs[0])
del dataset_refs

source@{instrument: 'LSSTComCam', visit: 2024121000426, band: 'r', day_obs: 20241210, physical_filter: 'r_03'} [sc=ArrowAstropy] (run=LSSTComCam/runs/DRP/DP1/v29_0_0/DM-50260/20250419T073356Z id=3a5f789b-17d3-4d4d-a828-835d54f8bb62)

2.6. Limit¶

Use the limit parameter to limit the number of results. This is recommended when testing queries.

In [18]:

dataset_refs = butler.query_datasets("source",
                                     where=query,
                                     bind=bind_params,
                                     order_by=["visit.timespan.begin"],
                                     limit=10)
print(len(dataset_refs))
del dataset_refs

Use a - in front of the limit value to have the query return a warning if there were more results that were not returned.

In [19]:

dataset_refs = butler.query_datasets("source",
                                     where=query,
                                     bind=bind_params,
                                     order_by=["visit.timespan.begin"],
                                     limit=-10)
print(len(dataset_refs))
del dataset_refs

lsst.daf.butler._butler WARNING: More datasets are available than the requested limit of 10.

3. Spatial queries for tables¶

3.1. Overlaps point¶

Execute a query for source tables that overlap the defined coordinates, without specifying band. For this example, embed the coordinates in the query string instead of using bind parameters.

In [20]:

query = f"visit.region OVERLAPS POINT({ra}, {dec})"
dataset_refs = butler.query_datasets("source",
                                     where=query)
print(len(dataset_refs))
del query, dataset_refs

Execute a query for object tables that overlap the defined coordinates.

In [21]:

query = f"patch.region OVERLAPS POINT({ra}, {dec})"
dataset_refs = butler.query_datasets("object",
                                     where=query)
print(len(dataset_refs))
del query, dataset_refs

One result is returned from the above query on object tables.

3.2. Overlaps region¶

Execute a query for source tables that overlap the region defined in Section 1.2.

Print the region. It is a circle, centered on the ECDFS field, with a 2 deg radius that covers the entire area observed with LSSTComCam.

In [22]:

print(region)

Circle([0.5298928578176015, 0.7051356719745795, -0.47116583422703023], 0.03490658503988659)

In [23]:

query = "visit.region OVERLAPS :region"
bind_params = {'region': region}
dataset_refs = butler.query_datasets("source",
                                     where=query,
                                     bind=bind_params)
print(len(dataset_refs))
del query, bind_params, dataset_refs

This is the same number of dataset_refs as returned by the "overlaps point" query in Section 4.1, because every visit overlapped the center of the ECDFS field.

Try again with a larger region that encompasses several separated LSSTComCam fields, and show that many more source tables are returned.

In [24]:

string2 = "CIRCLE 50.0 -30 20.0"
region2 = sphgeom.Region.from_ivoa_pos(string2)

In [25]:

query = "visit.region OVERLAPS :region"
bind_params = {'region': region2}
dataset_refs = butler.query_datasets("source",
                                     where=query,
                                     bind=bind_params)
print(len(dataset_refs))
del query, bind_params, dataset_refs

In [26]:

del string2, region2

4. Temporal queries for tables¶

4.1. Overlaps timepoint¶

Execute a query to return the source tables that contain all the sources detected in the visit that overlaps the specific point in time defined in Section 1.2.

In [27]:

print(time)

60623.27103473955

In [28]:

query = "visit.timespan OVERLAPS :time"
bind_params = {'time': time}
dataset_refs = butler.query_datasets("source",
                                     where=query,
                                     bind=bind_params)
print(len(dataset_refs))

It is the one source table from the singular visit that was occuring at that time.

Print the dataId.

In [29]:

for ref in dataset_refs:
    print(ref.dataId)

{instrument: 'LSSTComCam', visit: 2024110800263, band: 'r', day_obs: 20241108, physical_filter: 'r_03'}

Store the visit identifier number in my_visit_id to use in Section 5.1.

In [30]:

ref0 = dataset_refs[0]
my_visit_id = ref0.dataId.get('visit')
print(my_visit_id)

2024110800263

In [31]:

del query, bind_params, dataset_refs, ref0

4.2. Overlaps timespan¶

Execute a query for all source tables that have a visit.timespan (the time between the start and the end of the exposure) that overlaps with the timespan defined in Section 1.2.

Order the results by the start time of the visit.timespan.

In [32]:

print(timespan)

[2024-11-09T00:00:00, 2024-11-10T00:00:00)

In [33]:

query = "visit.timespan OVERLAPS :timespan"
bind_params = {'timespan': timespan}
dataset_refs = butler.query_datasets("source",
                                     where=query,
                                     bind=bind_params,
                                     order_by=["visit.timespan.begin"])
print(len(dataset_refs))
del query, bind_params, dataset_refs

5. Query for value in list¶

Create a query with an IN statement to find data with particular values in a list.

5.1. Source table for a given visit¶

For the identified visit in Section 4.1, my_visit_id, query for the source table for that visit.

In [34]:

query = f"visit.id = {my_visit_id}"
print(query)
dataset_refs = butler.query_datasets("source",
                                     where=query)
print(len(dataset_refs))
del query, dataset_refs

visit.id = 2024110800263

5.2 Source tables for a list of visits¶

Print the set of visits defined in Section 1.2.

In [35]:

print(visit_set)

(2024110800246, 2024110800247, 2024110800250, 2024110800251, 2024110800254, 2024110800255, 2024110800258, 2024110800259, 2024110800262, 2024110800263)

Convert this to a string, so it can be embedded in a query statement.

In [36]:

string_visit_set = "(" + ", ".join(str(value) for value in visit_set) + ")"
print(string_visit_set)

(2024110800246, 2024110800247, 2024110800250, 2024110800251, 2024110800254, 2024110800255, 2024110800258, 2024110800259, 2024110800262, 2024110800263)

Find all source tables that are associated with these 10 visits.

In [37]:

query = f"visit.id IN {string_visit_set}"
print(query)
dataset_refs = butler.query_datasets("source",
                                     where=query)
print(len(dataset_refs))

visit.id IN (2024110800246, 2024110800247, 2024110800250, 2024110800251, 2024110800254, 2024110800255, 2024110800258, 2024110800259, 2024110800262, 2024110800263)

6. Retrieve catalog data from the Butler¶

6.1. Results format¶

Although the Butler is accessing catalog data stored in parquet format (Section 1), it returns astropy tables; that is what the ArrowAstropy part of the returned dataset type means (Section 2.1).

Pandas dataframes can be returned by the Butler by including storageClass='DataFrame' in a butler.get() statement, like the one in Section 6.2, but it is recommended to use the default astropy formatted tables.

6.2. Specify columns to be returned¶

When retrieving catalog data from the Butler, always specify the columns to be returned.

Some tables have hundreds of columns, and it is best to only retrieve the columns that are necessary for a given analysis.

All columns are listed in Rubin's Science Data Model Schemas page.

Always pass the desired column names to a butler.get() statement. Column names are passed with the parameters dictionary. Requesting individual columns is done by passing, e.g., {'columns': ['col1', 'col2', 'col3']}, as in the example below.

Define a list of source table column names to retrieve.

In [38]:

column_names = ['sourceId', 'detector', 'coord_ra', 'coord_dec', 'psfFlux']

6.3. Return sources from a dataset reference¶

Query for the source table from the first visit in the timespan.

In [39]:

query = "visit.timespan OVERLAPS :timespan"
bind_params = {'timespan': timespan}
dataset_refs = butler.query_datasets("source",
                                     where=query,
                                     bind=bind_params,
                                     order_by=["visit.timespan.begin"],
                                     limit=-1)

lsst.daf.butler._butler WARNING: More datasets are available than the requested limit of 1.

Retrieve the desired columns from the source table, and print the total number of rows (sources) returned.

In [40]:

sources = butler.get(dataset_refs[0],
                     parameters={'columns': column_names})
print(len(sources))

Option to display the table.

In [41]:

# sources

In [42]:

del query, bind_params, dataset_refs, sources

6.4. Return sources for a visit id¶

Retrieve the source table for the visit defined as my_visit_id in Section 4.1.

In [43]:

print(my_visit_id)
sources = butler.get("source",
                     visit=my_visit_id,
                     parameters={'columns': column_names})
print(len(sources))

2024110800263

Option to show the table.

In [44]:

# sources

In [45]:

del my_visit_id, sources

7. Appendix¶

A note about the relationship between object & deep_coadd, and source & visit_image.

These "catalog:image" pairs do not have the same dimensions (are not 1:1) in the Butler, as might be assumed.

The object catalog contains forced photometry measurements made on the multi-band deep_coadd images. The deep_coadd images are stored in the Butler by patch (the skymap is divided into tracts, the tracts into patches), but the object table is stored by tract.

The source catalog contains measurements for detected sources in visit_images. The visit_images are stored in the Butler by detector (a visit with LSSTComCam has 9 detector visit_images), but the source table is stored by visit.

Thus in the use case where a single image, either a deep_coadd image or a visit_image is being examined, and the desire is to retrieve measurements for all the objects or sources in that image, the Butler will return all the objects from the same tract as that deep_coadd's patch, and all the sources from the same visit as that visit_image's detector.

In [ ]: