Data Linking Toolkit: Step 4 – Link Data

Figure: Step 4: Link Data

In Step 4, data linking partners begin the more technical activities. Whether Part C or Part B 619 program staff are linking data within their own program or with another agency or program, successful data linking requires a number of technical activities. Because some activities are iterative, decisions made during one may require data linking partners to return to one or more previous activities. For example, the partners may return to identifying data elements to improve the number of potential matching records. Additionally, some activities may not be required. For example, if data linking partners share a common unique ID, they should move directly to standardizing the data elements in Activity 4g. The table below displays the roles of team members potentially involved in Step 4 activities.

Figure: Step 4: Team Members

TIP: Although data can be linked manually, data linking partners usually use software to compare records and link the data. Step 4 describes the major activities required to link data without regard to specific software. No matter the software used, Part C and Part B 619 program staff are encouraged to access and use the suite of tools (e.g., Align, Connect) that the Common Education Data Standards (CEDS) initiative created to facilitate data linking.

Activity 4a: Select record-matching approach

What about linking non-child records?

Most Part C and Part B 619 data are focused on the provision of services to children. Therefore, examples in this toolkit focus on linking child-level records. However, in some cases, Part C or Part B 619 program staff may need to link workforce or program-level data to answer key questions about policy and program improvement. Regardless of whether linking is focused on child-level, workforce, or program-level data, or any combination of these, partners need to follow the same basic activities.

A shared unique ID across data sets greatly facilitates data linking activities. If data linking partners share a unique ID, they can skip Activities 4a–4e associated with the matching process. However, many data sets do not share a unique ID. In these instances, it is important to establish a record-matching approach. There are two primary approaches for matching records: deterministic and probabilistic.

Deterministic matching is a process that requires exact matching of all selected elements for two records to be considered a match. All selected elements (e.g., birthdates, names, sex) must be exact matches. If any element differs between two records, it is considered a nonmatch. However, methods can be used to account for “equivalencies” (e.g., Bill and William, Rd and road). Deterministic matching often uses a limited number of elements in the matching algorithm—generally only elements unlikely to change.

Read More +

Activity 4b: Identify data elements for use case

Next, data linking partners must select data elements from both data sets that will enable matching (if needed) and support all the analyses. The partners must determine which elements one or both partners will contribute. Some elements will be strictly used for analysis, others for matching, and some for both.

TIP: If an element found in both data sets is not used for matching or analysis, it is not necessary to include it in the linked data set.
TIP: Data dictionaries are often the best source of information to understand details on elements within the data sets.

Activity 4c: Select data elements for the use case

Partners must identify and include all elements needed to answer the data linking use case in the linked data set. Each use case dictates the elements needed. Some common Part C or Part B 619 elements include but are not limited to the following elements:2

  • Disability or reason eligible (HI, DD, established condition, etc.)
  • IDC-9/IDC-10 code
  • Service type (PT, OT, VI)
  • Setting or education environment
  • Service level (hours, frequency, intensity)
  • Dates to support service duration (service beginning and end dates)
  • Other program involvement data reported by Part C or Part B 619 families (foster care, CAPTA, EHDI)
  • Eligibility determination (date referred, delay reason)
  • Child outcomes
  • Family outcomes
  • Transition (dates, reasons, delays)
  • Exit (dates, reason)
Read More +

Activity 4d: Select data elements for matching

Demographic elements used for disaggregation as part of the use case analysis are also frequently selected to match records. Depending on the matching approach, data linking partners may select primary and secondary elements for matching. Deterministic matching relies solely on primary matching elements whereas probabilistic matching often includes primary and secondary elements.

Primary matching elements are least likely to change and serve as strong differentiators in the matching process. Examples of frequently used primary matching elements are:

  • Child’s date of birth
  • Mother’s maiden name
  • Child’s last name
  • Child’s first name3
  • Sex4
TIP: Even if all the primary matching elements listed are not initially included in the matching algorithm, they may be useful later if the algorithm is modified.
Read More +

Activity 4e: Create matching algorithm

Determining how many elements to match against, which elements to match, and what scores to assign those elements is part art and part science. It is an iterative process of adjusting the scores, and possibly the elements, during the matching process. DaSy recommends that data linking partners consider at least three primary matching elements and at least four secondary matching elements if they are using probabilistic matching. Then, the partners assign scores to each element. If the partners are using deterministic matching, they assign scores to only primary matching elements.

Read More +

Activity 4f: Establish business rules

Business rules are criteria used to make decisions about data. In data linking, business rules are needed for at least two reasons. The first is when an important element for a confidently matched record pair differs. The second is to investigate unknown matches.

Business rules for addressing differences in confidently matched record pairs

When the two data sets have the same data for the same element for a matched record pair, all is well. But when the data for an important element for a confidently matched record pair differ across the data sets, data linking partners need to establish and activate business rules. The partners must determine which source should be captured for that one differing element. This is called the “source of authority/truth.”

Read More +

Activity 4g: Standardize data elements

Data elements for matching or analysis should be in a consistent format across data sets. For example, dates (e.g., date of birth, service start date, service end date) can be formatted as a single data field or multiple fields. Even in a single field, dates can be formatted differently (e.g., MM/DD/YYYY, MM/DD/YY, September 29, 2021). All date-related elements in both data sets that will be used for matching or analysis should have the same format. If the format differs, data linking partners will need to transform (modify) one data set to align with the other data set. The partners should document standardization rules to accommodate future linking of the data sets.

Read More +

Activity 4h: Match records

After establishing the initial matching algorithm, setting business rules, and standardizing data elements, data linking partners can begin matching records. This involves comparing the matching elements in each record in one data set with those in each record in the other data set, based on the predetermined matching algorithm.

Read More +

Activity 4i: Create joined data set and check data quality

Once record matching is final, partners can link the data from the two matched records. Linking combines all the selected elements from the two matched records into a single new record in a new data set.

After the data are linked, it is important to check the integrity of the technical linking (i.e., check that the records were accurately combined). The partners should spot-check an adequate number of records to make sure the linked elements in the new data set are identical to the elements in both original data sources. This ensures that nothing was missed or incorrectly performed in the important linking activity.

Activity 4j: Certify joined dataset

Program or agency staff and leadership should carefully review the results of the matching, linking, and analysis before signing off on the data linking work. In some cases, program or agency leadership will be comfortable with certifying the work before the analysis is conducted (Step 5) and the results made available. In other cases, especially when interested stakeholders are awaiting the results of the linked data, certification may need to wait until after the use case has been addressed (i.e., after the data are analyzed and possibly a report drafted for internal review). Eventually, all parties must agree that the data linking was conducted accurately and is complete.


2 The elements shown are from the list of elements in the DaSy Data System Framework, subcomponent System Design and Development, Quality Indicator 3.
3 First initial is sometimes used to reduce issues with misspellings or nicknames.
4 Sex is an important element for analysis and is usually brought in and matched across datasets. However, it is frequently not scored in the algorithm.

Published July 2022.