102.1. Get started with TAP¶

102_1_Get_started_with_TAP

102.1. Get started with TAP¶

For the Rubin Science Platform at data.lsst.cloud.
Data Release: Data Preview 1
Container Size: large
LSST Science Pipelines version: r29.1.1
Last verified to run: 2025-06-25
Repository: github.com/lsst/tutorial-notebooks

Learning objective: How to use the TAP service efficiently.

LSST data products: Object table

Packages: lsst.rsp.get_tap_service

Credit: Originally developed by the Rubin Community Science team. Please consider acknowledging them if this notebook is used for the preparation of journal articles, software releases, or other notebooks.

Get Support: Everyone is encouraged to ask questions or raise issues in the Support Category of the Rubin Community Forum. Rubin staff will respond to all questions posted there.

1. Introduction¶

TAP (Table Access Protocol) provides standardized access to catalog data for discovery, search, and retrieval. Full documentation for TAP is provided by the International Virtual Observatory Alliance (IVOA).

Related tutorials: The other 100-level tutorials in this series demonstrate advanced ADQL statements. The 200-level tutorials describe the contents of the catalog data and have example queries for each.

1.1. TAP glossary¶

schema - Database terminology for the abstract design that represents the storage of data in a database.
TAP schema - The set of tables that describe the data tables and their columns.
table collection - A collection of tables (catalogs), e.g., for a given data release.
table - A collection of related data held in a table format in a database.
query - A string formatted in ADQL that selects data from a table, with contraints if desired.
results - The output of the TAP service's search method when a query is passed.

1.2. ADQL statements¶

The documentation for ADQL includes more information about syntax, keywords, operators, functions, and so on. ADQL is similar to SQL (Structured Query Langage).

A typical ADQL statement has at least three components:

SELECT <columns> FROM <catalog> WHERE <constraints>

where

<columns> is a comma-separated list of the columns to return,
<catalog> is the name of the catalog to retreive data from, and
<constraints> imposes a restriction that only rows with column values that meet the constraints are returned.

For example, say there is a catalog called "mysurveydata" with 5 columns, "col1", "col2", and so on. The ADQL statement:

SELECT col3, col4, col5 FROM mysurveydata WHERE col1 > 0.5 AND col5 < 10

would return a table that has three columns, and as many rows as meet both of the restrictions in the WHERE statement.

1.3. Import packages¶

Import general python packages numpy and the RSP TAP service.

In [1]:

import numpy as np
from lsst.rsp import get_tap_service

2. Instantiate the TAP service¶

Get an instance of the TAP service, and assert that it exists.

In [2]:

service = get_tap_service("tap")
assert service is not None

3. Explore the TAP schema¶

Browsable TAP schema are accessible at sdm-schemas.lsst.io and in the Portal Aspect's user interface.

In a Notebook, it is possible to interact programmatically with the TAP schema.

3.1. Use caution with synchronous queries¶

Caution: Typically users should not make synchronous queries that retrieve all available columns (SELECT *). The only time this is acceptable and useful is when querying table schema.

A "synchronous" query is when the search is executed and the results are retrieved at the same time, and no other cells can be run while the search is executing.

Synchronous queries are best used sparingly, and only for retrieving table metadata. However asynchronous queries should be used whenever table data is being returned.

Queries that use SELECT *, which will retrieve all columns, are only appropriate for schemas or tables with very few (<20) columns. Queries that return data should specify which of the tens to thousands of columns are necessary to return.

3.2. List all schemas¶

Pass a query for the TAP schemas to the TAP service, and store the results in results.

In [3]:

results = service.search('SELECT * FROM tap_schema.schemas')

Show the results.

In [4]:

results

Out[4]:

<DALResultsTable length=4>
   schema_name    utype  ... schema_index
      str64       str512 ...    int32
----------------- ------ ... ------------
dp02_dc2_catalogs        ...            2
              dp1        ...            0
             ivoa        ...            1
       tap_schema        ...       100000

The to_table() method converts results into an Astropy table.

Convert results to a table and display the table.

In [5]:

results.to_table()

Out[5]:

Table length=4

schema_name	utype	description	schema_index
str64	str512	str512	int32
dp02_dc2_catalogs		Data Preview 0.2 contains the image and catalog products of the Rubin Science Pipelines v23 processing of the DESC Data Challenge 2 simulation, which covered 300 square degrees of the wide-fast-deep LSST survey region over 5 years.	2
dp1		Data Preview 1	0
ivoa		ObsCore v1.1 attributes in ObsTAP realization	1
tap_schema		A TAP-standard-mandated schema to describe tablesets in a TAP 1.1 service	100000

In [6]:

del results

3.3. List all tables¶

Create a query that selects all table names from the TAP schema for DP1. Retrieve the results as an Astropy table and display it.

In [7]:

query = "SELECT * FROM tap_schema.tables " \
        "WHERE tap_schema.tables.schema_name = 'dp1'" \
        "ORDER BY table_index ASC"
results = service.search(query).to_table()
results

Out[7]:

Table length=11

schema_name	table_name	table_type	utype	description	table_index
str512	str64	str8	str512	str512	int32
dp1	dp1.CcdVisit	table		Defines a single detector of a visit	0
dp1	dp1.DiaObject	table		this is diaObjectTable_tract in the butler repo	0
dp1	dp1.DiaSource	table		this is diaSourceTable_tract in the butler repo	0
dp1	dp1.ForcedSource	table		The forced source table contains forced-photometry source measurement on an individual Exposure based on a Multifit shape model derived from a deep detection.	0
dp1	dp1.ForcedSourceOnDiaObject	table		this is forcedSourceOnDiaObjectTable_tract in the butler repo	0
dp1	dp1.MPCORB	table		The orbit catalog produced by the Minor Planet Center. Ingested daily. O(10M) rows by survey end. The columns are described at https://minorplanetcenter.net//iau/info/MPOrbitFormat.html	0
dp1	dp1.Object	table		The object table contains descriptions of the multi-epoch static astronomical objects, in particular their astrophysical properties as derived from analysis of the Sources that are associated with them. Note that fast moving objects are kept in the MovingObject tables. Note that less-frequently used columns are stored in a separate table called Object_Extra.	0
dp1	dp1.Source	table		Table to store high signal-to-noise sources;. A source is a measurement of Object's properties from a single image that contains its footprint on the sky.	0
dp1	dp1.SSObject	table		LSST-computed per-object quantities. 1:1 relationship with MPCORB. Recomputed daily, upon MPCORB ingestion.	0
dp1	dp1.SSSource	table		LSST-computed per-source quantities. 1:1 relationship with DIASource. Recomputed daily, upon MPCORB ingestion.	0
dp1	dp1.Visit	table		Defines a single Visit.	0

In [8]:

del query, results

3.4. List all columns¶

Create a query that selects all column names, data types, description, and units from the DP1 Object table. Retrieve the results as an Astropy table.

In [9]:

query = "SELECT column_name, datatype, description, unit " \
        "FROM tap_schema.columns " \
        "WHERE table_name = 'dp1.Object'"
results = service.search(query).to_table()

Option to display the results. The table's display will automatically be truncated by the notebook because the Object table has 1296 columns.

In [10]:

# results

Use a for loop to look for the presence of coord in the column_name column for every row of the results table, and write the name when there is a match.

In [11]:

search_string = 'coord'
# search_string = 'g_psfFlux'
for cname in results['column_name']:
    if cname.find(search_string) > -1:
        print(cname)

coord_dec
coord_decErr
coord_ra
coord_ra_dec_Cov
coord_raErr

Uncomment the second line in the cell above and re-execute it to print all columns containing the string g_psfFlux.

Print the unique set of data types or units.

In [12]:

print(np.unique(results['datatype']))
# print(np.unique(results['unit']))

datatype
--------
 boolean
    char
  double
   float
     int
    long

In [13]:

del query, results

4. Best practices for TAP queries¶

The Object table used in this demonstration contains forced measurements in the multi-band deep_coadd images at the coordinates of all signal-to-noise ratio >5 detections in a deep_coadd of any filter. The 200-level tutorial for the Object catalog contains further detail.

4.1. Use asynchronous queries¶

For all TAP searches for data (not schema), one should use asynchronous queries.

Asynchronous means that the query is submitted as a job to the TAP service, and it can run in the background until it completes. The results can then be retrieved right away, or at a later time.

The step-by-step process for asynchronous queries starts by defining the job by passing the query to the TAP service.

job = service.submit_job(query)

The next step is to run the job and, optionally, wait for it to return either a COMPLETED or ERROR result. While waiting is optional, tutorial notebooks always wait because the next steps of a tutorial depend on having the retrieved data.

job.run()
job.wait(phases=['COMPLETED', 'ERROR'])
print('Job phase is', job.phase)

Finally, display the error message if the query was not successful, or retrieve the results if the query was completed.

if job.phase == 'ERROR':
    job.raise_if_error()
assert job.phase == 'COMPLETED'
results = job.fetch_result()

4.2. Use spatial constraints¶

All TAP searches on tables that include sky coordinates should include spatial constraints.

It is recommended to always start with spatial constraints for a small region and then expand the search area.

Spatial constraints should be implemented with either an ADQL cone or polygon search.

4.2.1. Table sharding by coordinate¶

On disk, the catalog tables are sharded by coordinate (RA, Dec). This can be thought of as the database being divided up by spatial region (shard) and distributed across multiple servers. Queries that include spatial constraints which minimize the number of shards that have to be searched through will be much faster than queries that have no (or very wide) spatial constraints.

4.2.2. Cone searches¶

A cone search means a catalog query that returns all objects within a given radius of a sky coordinate. Since a given angular diameter corresponds to a larger physical diameter at larger distances, the volume queried is a cone shape in 3D, not a cylinder.

In ADQL, a cone search is executed with WHERE CONTAINS(POINT(), CIRCLE()) = 1.

POINT() passes the catalog's columns for sky coordinates.
CIRCLE() defines the center and radius of the search circle.
CONTAINS() = 1 constrains the query to only return rows for which the statement "this circle contains this point" is "True" (=1).

Column constraints on the coordinates, either with ADQL like -30 < coord_dec AND coord_dec < -25 or using a WHERE ... BETWEEN statement, are not recommended because the TAP service doesn't interpret them the same as a cone or polygon search.

Create a cone search on the Object table for objects within 0.01 degrees of coordinates near the center of the ECDFS field, RA, Dec = 53, -28 degrees.

In [14]:

query = """SELECT coord_ra, coord_dec
        FROM dp1.Object
        WHERE CONTAINS(POINT('ICRS', coord_ra, coord_dec),
        CIRCLE('ICRS', 53, -28, 0.01)) = 1"""
print(query)

SELECT coord_ra, coord_dec
        FROM dp1.Object
        WHERE CONTAINS(POINT('ICRS', coord_ra, coord_dec),
        CIRCLE('ICRS', 53, -28, 0.01)) = 1

Execute the query asynchronously, and print the number of rows returned.

In [15]:

job = service.submit_job(query)
job.run()
job.wait(phases=['COMPLETED', 'ERROR'])
print('Job phase is', job.phase)
if job.phase == 'ERROR':
    job.raise_if_error()
assert job.phase == 'COMPLETED'
results = job.fetch_result()
print(len(results))

Job phase is COMPLETED
141

4.2.3. Polygon searches¶

For a polygon search, replace CIRCLE with POLYGON(ra1, dec1, ra2, dec2, ra3, dec3, ...) where each ra1, dec is a vertex of the polygon.

4.3. Sort results with ORDER BY¶

Add an ORDER BY statement to the TAP query to sort the results in ascending (ASC) or descending (DESC) order.

Order the results by ascending coord_ra values.

In [16]:

query = """SELECT coord_ra, coord_dec
        FROM dp1.Object
        WHERE CONTAINS(POINT('ICRS', coord_ra, coord_dec),
        CIRCLE('ICRS', 53, -28, 0.01)) = 1
        ORDER BY coord_ra ASC"""
print(query)

SELECT coord_ra, coord_dec
        FROM dp1.Object
        WHERE CONTAINS(POINT('ICRS', coord_ra, coord_dec),
        CIRCLE('ICRS', 53, -28, 0.01)) = 1
        ORDER BY coord_ra ASC

4.4. Limit rows with TOP¶

For debugging and testing queries, the recommended way to restrict the number of rows returned is to use very small spatial regions, not to use TOP.

The TAP service first applies WHERE constraints, then ORDER BY, and then TOP. If the query is not well constrained, i.e., if thousands or more objects meet the WHERE constraints, then they all must first be sorted before the top number are returned. This is a waste of time and compute resources.

However, it can be useful to only retrieve a subset of the rows which meet the query constraints. To do this, add TOP N after SELECT to return only the top N number of rows.

In [17]:

query = """SELECT TOP 10 coord_ra, coord_dec
        FROM dp1.Object
        WHERE CONTAINS(POINT('ICRS', coord_ra, coord_dec),
        CIRCLE('ICRS', 53, -28, 0.01)) = 1
        ORDER BY coord_ra ASC"""
print(query)

SELECT TOP 10 coord_ra, coord_dec
        FROM dp1.Object
        WHERE CONTAINS(POINT('ICRS', coord_ra, coord_dec),
        CIRCLE('ICRS', 53, -28, 0.01)) = 1
        ORDER BY coord_ra ASC

Execute the query and confirm that only 10 rows are returned.

In [18]:

job = service.submit_job(query)
job.run()
job.wait(phases=['COMPLETED', 'ERROR'])
print('Job phase is', job.phase)
if job.phase == 'ERROR':
    job.raise_if_error()
assert job.phase == 'COMPLETED'
results = job.fetch_result()
print(len(results))

Job phase is COMPLETED
10

For users with TAP experience, including maxrec in the job definition will also work, but use of TOP is recommended.

4.4.1. Use caution with TOP and ORDER BY¶

Combined use of TOP and ORDER BY in ADQL queries can be dangerous: it may take an unexpectedly long time because the database is trying to first sort, and then extract the top N elements. It is best to only combine TOP and ORDER BY if the query's WHERE statements significantly cut down the number of objects that would need to be sorted.

In [ ]: