Leveraging Transparency in Coverage (TiC) Data for Health Care Price Analysis

The Transparency in Coverage (TiC) regulations have introduced unprecedented visibility into negotiated health care prices in the United States. By requiring insurers to publish machine-readable files containing payer–provider contracted rates starting in 2022, the policy has created a new data source for studying price variation. However, the scale, inconsistency, and missing information within the TiC data mean that rigorous methodological work is required before it can be used for research. This brief explores the nature of this data, how it is accessed and processed, and how it can be used for analysis, with a detailed walkthrough of a real example examining childbirth prices in Pennsylvania.

The Nature and Contents of TiC Data

TiC data consist of large-scale machine-readable files (MRFs), typically in JSON or CSV formats, that disclose the contracted rates between payers and providers for covered health care services. These files contain detailed pricing information at the payer, plan, provider, and billing codes level. In theory, this allows researchers to observe the full distribution of negotiated prices for services across markets, payers, and providers.

In practice, however, these files are extremely large and complex. A single payer’s monthly release can span hundreds of gigabytes or more, and the data are structured for computational processing rather than human interpretation. As a result, TiC data are best understood not as a research-ready dataset, but as a raw input that must be transformed through a structured process before it becomes analytically useful.

Acquisition and the Access Burden

Although TiC data files are technically “public” and hosted on payer websites, the sheer volume of data creates a high barrier to entry. Payers do not host their files in a consistent way. Some provide thousands of separate download links and others provide single, multi-terabyte consolidated files. Researchers must, therefore, build systematic collection processes to identify, download, and track relevant files over time. Furthermore, because the TiC files are updated monthly, longitudinal research requires substantial storage infrastructure and version-control protocols to manage the constant influx of new data.

Given these demands, many researchers may find it more efficient to partner with organizations that are already doing the data processing work rather than attempting to manage the collection and cleaning processes themselves. For the analysis of childbirth prices in Pennsylvania, HCCI partnered with Gigasheet, a company that specializes in collecting and processing raw machine-readable files. Purchasing access or partnering with such organizations allows researchers to bypass the resource-intensive extraction and processing phases and focus on analysis.

From Raw Machine Readable Files to Analytic Dataset

Transforming TiC data into a usable dataset involves a multi-stage process that includes collection, extraction, cleaning, enrichment, and analysis. After downloading the files, researchers must parse complex JSON structures into tabular formats and remove duplicate records, as payers often report the same rate several times across different plan identifiers.

The most critical stage is cleaning and validation. TiC data contain a large number of implausible or irrelevant records, which can constitute up to 90% of a raw file. The most common issue is “zombie rates”, prices assigned to provider–service combinations that would not occur in practice, such as a cardiologist having a negotiated rate for a chiropractic adjustment. These must be removed using clinical logic and provider data. Cleaning is not a one-time step; it is iterative, as duplicates and inconsistencies can reappear after transformations or joins.

As part of the data cleaning process, the data may be enriched by linking to external sources. Provider identifiers can be matched to the National Provider Identifier (NPI) registry to obtain taxonomy/specialty, practice location, and other attributes. Organizational structures are often resolved using Employer Identification Numbers (EINs) and public datasets. This enrichment process is essential for turning raw data into meaningful observations about real-world health care systems.

Research Questions Define Analytic Datasets: Childbirth Prices in Pennsylvania An Example

In our analysis of childbirth prices in Pennsylvania, we structured the methodology in a way that is broadly applicable to most TiC-based research questions. Using TiC data for research is not simply a matter of downloading files and calculating averages. Instead, the analysis begins by defining a research question – narrowing to a specific geography, a specific service, and a specific comparison framework. In this case, we asked how much negotiated prices for childbirth vary in Pennsylvania, and whether the patterns found in prior claims-based analysis could be replicated using Transparency in Coverage files. The research question and proposed analytic approach matters because it determines every downstream decision: which files to collect, which billing codes to keep, which providers to include, and which rates should be excluded as not meaningfully comparable.

The first step in building an analytic dataset to answer our question was to limit TiC files to major payers operating in the state. This restriction left us with data from eight payers in Pennsylvania (Highmark, Capital Blue Cross, UPMC, Independence Blue Cross, Aetna, Geisinger, Cigna, and UnitedHealthcare) with over 11,000 raw MRFs, 30 billion rows, and 1.6 terabytes of data. Zombie rates were removed as part of the initial data ingestion and processing.

We then defined the service of interest using MS-DRG codes for childbirth (e.g., 783–788, 796–798, 805–807, 768), ensuring codes were standardized across payers. For this analysis we had to remove leading zeros from DRG values so that codes would match consistently across payers and source files and allow for accurate aggregation and price comparisons in later stages of the analysis.

We then restricted the dataset by provider characteristics from NPPES data joined to the TiC data on NPI. We limited to acute inpatient facility taxonomies and institutional billing types. This step is crucial because childbirth hospital comparisons should be based on settings that plausibly provide inpatient delivery care. We also limited to provider ZIP codes in Pennsylvania and mapped ZIP codes to Core-Based Statistical Areas (CBSAs) using the provider’s primary practice location. This allows for within-market and across-market comparisons, which are more informative than a statewide average alone. Our analysis of childbirth prices in Pennsylvania included 35 CBSAs in the state.

We then restricted the dataset to comparable rate types by including dollar-denominated negotiated rates and excluding per diem and percentage-based rates that could not be reliably converted into hospital admission prices. Per diem rates were excluded because TiC files do not include length-of-stay information, so a daily hospital rate cannot be converted into a comparable childbirth admission price. Percentage rates were excluded because the denominator or reference amount was unclear, making them uninterpretable in dollar terms.

We then performed provider and payer entity resolution. For providers, we used EINs to define providers at the hospital or system level. For payers, we standardized payer and plan names to minimize duplication of provider-payer level negotiated rates. Entity resolution is important when using TiC data because variation can sometimes reflect reporting inconsistencies rather than true price differences. If one payer’s files list the same rate under several affiliate names, failing to normalize may overcount observations and miscalculate price variation.

From this cleaned and enriched dataset, we constructed analytic files summarizing median negotiated rates at relevant levels (e.g., CBSA–payer–provider–DRG). Variation was assessed using percentile-based measures (such as 90th-to-10th percentile ratios), with minimum observation thresholds applied to ensure reliability. We examined variation across markets, within providers, and across payers.

Finally, we validated results against prior claims-based findings from earlier HCCI research. We found that the childbirth price ranges and average levels in TiC data were broadly aligned with the claims-based estimates in Pennsylvania, which supports the use of TiC data for market analysis.

Limitations: A Lack of Utilization Data

Despite its depth, TiC data have significant caveats that must be managed to avoid misleading conclusions. The most notable limitation is the total absence of utilization or volume data. TiC data show that a rate exists in a contract but does not indicate whether any patient actually received that service at that price. Consequently, it is difficult to calculate an average price paid in a market or estimate spending without blending the TiC data with external claims datasets that show service frequency. Aggregated claims data can be used to provide a benchmark of utilization that, when combined with TiC data, can estimate more accurate market level price and spending estimates.

CMS is aware of the limitations created by a lack of utilization data with the TiC files and has included proposed rules for the inclusion of a Utilization File with the TiC data. The addition of utilization data and the other proposed changes may greatly improve the useability of the TiC data in the future.

Conclusion

Transparency in Coverage data represent a significant advancement in the availability of information on negotiated health care prices, offering researchers a new lens into variation across payers, providers, and markets. As demonstrated in the childbirth analysis in Pennsylvania, TiC data can be used to replicate and extend findings from traditional claims-based research, particularly in understanding the range and distribution of negotiated rates across payers and providers.

At the same time, the value of TiC data depends heavily on the methods used to create an analytic dataset. The raw data are not inherently research-ready and require substantial processing, including careful service definition, data cleaning, provider and payer entity resolution, and restrictions to ensure comparability. Without these steps, analyses may not be replicable and risk reflecting the messiness of the raw data rather than meaningful differences in prices. Additionally, the absence of utilization data remains a fundamental limitation, requiring integration with external sources to fully assess spending and average prices.

Overall, TiC data should be viewed as a powerful but incomplete resource. When used appropriately, they can provide important insights into health care pricing dynamics and market structure. As data quality improves and methods continue to evolve, TiC data are likely to become an increasingly valuable complement to claims data in health services research.

Leveraging Transparency in Coverage (TiC) Data for Health Care Price Analysis

Share this post