Get Started with RadReportParser

Anatomy of Radiology Report

Understanding the structure of radiology reports is essential for effective text extraction. This section explains how radreportparser models and processes radiology report text.

Report Structure

A radiology report typically consists of several distinct sections, each serving a specific purpose in communicating the radiological findings. radreportparser uses a section-based parsing approach, where:

Each section is identified by a section keyword (e.g., “HISTORY:”, “FINDINGS:”)
Section content extends from its keyword until the next section keyword
Sections are non-overlapping and sequential

The typical section of radiology report ¹ follows the structure in Figure 1:

Note

Section order can vary between institutions and report types. radreportparser is designed to handle flexible section ordering.

Radiology Report Data Model

radreportparser uses the RadReport dataclass to represent structured report data. This class provides:

Type-safe section storage
Easy access to individual sections
Conversion to standard data formats

Here’s an example of working with the RadReport class:

from radreportparser import RadReport

# Create a RadReport instance
report = RadReport(
    title="EMERGENCY MDCT OF THE BRAIN",
    history="A 25-year-old female presents with headache...",
    technique="Axial helical scan of the brain...",
    findings="The brain shows age-appropriate volume...",
    impression="No intracranial hemorrhage..."
)

# Display the report object
report

RadReport(title='EMERGENCY MDCT OF THE BRAIN', history='A 25-year-old female presents with headache...', technique='Axial helical scan of the brain...', comparison=None, findings='The brain shows age-appropriate volume...', impression='No intracranial hemorrhage...')

Each section can be accessed as an attribute:

report.title

'EMERGENCY MDCT OF THE BRAIN'

# Sections that weren't specified return None
type(report.comparison)

NoneType

The RadReport class supports serialization to both Python dictionaries and JSON format:

Convert to python dictonary:

report.to_dict()

{'title': 'EMERGENCY MDCT OF THE BRAIN',
 'history': 'A 25-year-old female presents with headache...',
 'technique': 'Axial helical scan of the brain...',
 'comparison': None,
 'findings': 'The brain shows age-appropriate volume...',
 'impression': 'No intracranial hemorrhage...'}

Convert to JSON object:

print(report.to_json(indent = 2))

{
  "title": "EMERGENCY MDCT OF THE BRAIN",
  "history": "A 25-year-old female presents with headache...",
  "technique": "Axial helical scan of the brain...",
  "comparison": null,
  "findings": "The brain shows age-appropriate volume...",
  "impression": "No intracranial hemorrhage..."
}

Tip

Use to_dict(exclude_none=True) or to_json(exclude_none=True) to omit empty sections from the output.

Extract Report

The RadReportExtractor class provides high-level methods for extracting structured sections from radiology report text. Let’s walk through the extraction process using a sample report:

# Sample brain CT report (with markdown formatting)
report_text = """
EMERGENCY MDCT OF THE BRAIN

**HISTORY:** A 25-year-old female presents with headache. Physical examination reveals no focal neurological deficits.

TECHNIQUE: Axial helical scan of the brain performed with coronal and sagittal reconstructions.

*Comparison:* None.

Findings:

The brain shows age-appropriate volume with normal parenchymal attenuation and gray-white differentiation. No acute infarction or hemorrhage identified. The ventricles are normal in size without intraventricular hemorrhage. No extra-axial collection, midline shift, or brain herniation. The vascular structures appear normal. The calvarium and skull base show no fracture. Visualized paranasal sinuses, mastoids, and upper cervical spine are unremarkable.

=== IMPRESSION ===

- No intracranial hemorrhage, acute large territorial infarction, extra-axial collection, midline shift, brain herniation, or skull fracture identified.
"""

Initializing the Extractor

Before extracting sections, we need to initialize a RadReportExtractor() instance:

from radreportparser import RadReportExtractor, is_re2_available

# Initialize extractor with default configuration using build-in `re` module
extractor = RadReportExtractor()

Alternatively, you can use the Google re2 module (must be installed separately) for faster regex processing by specifying the backend parameter:

if is_re2_available():
    extractor2 = RadReportExtractor(backend="re2")

The RadReportExtractor() uses section keywords to identify the start of each section. These keywords are defined as regular expressions and can be customized during initialization using the keys_* parameters.

The default keywords are defined in the KeyWord enum:

from radreportparser import KeyWord

# Example: inspect default keywords for the history section
KeyWord.HISTORY.value

['[^\\w\\n]*History[^\\w\\n]*',
 '[^\\w\\n]*Indications?[^\\w\\n]*',
 '[^\\w\\n]*clinical\\s+history[^\\w\\n]*',
 '[^\\w\\n]*clinical\\s+indications?[^\\w\\n]*']

Extract All Sections

The extract_all() method provides a convenient way to extract all sections at once. It returns a RadReport instance containing the structured data:

# Extract all sections from the report
report = extractor.extract_all(report_text)
report

RadReport(title='EMERGENCY MDCT OF THE BRAIN', history='A 25-year-old female presents with headache. Physical examination reveals no focal neurological deficits.', technique='Axial helical scan of the brain performed with coronal and sagittal reconstructions.', comparison='None.', findings='The brain shows age-appropriate volume with normal parenchymal attenuation and gray-white differentiation. No acute infarction or hemorrhage identified. The ventricles are normal in size without intraventricular hemorrhage. No extra-axial collection, midline shift, or brain herniation. The vascular structures appear normal. The calvarium and skull base show no fracture. Visualized paranasal sinuses, mastoids, and upper cervical spine are unremarkable.', impression='- No intracranial hemorrhage, acute large territorial infarction, extra-axial collection, midline shift, brain herniation, or skull fracture identified.')

# Using `re2` backend
if is_re2_available():
    print(extractor2.extract_all(report_text))

The extracted sections can be accessed individually:

# Access specific sections
print("Title:", report.title)
print("\nHistory:", report.history)

Title: EMERGENCY MDCT OF THE BRAIN

History: A 25-year-old female presents with headache. Physical examination reveals no focal neurological deficits.

The RadReport instance can be easily converted to standard data formats:

# Convert to Python dictionary
report.to_dict()

{'title': 'EMERGENCY MDCT OF THE BRAIN',
 'history': 'A 25-year-old female presents with headache. Physical examination reveals no focal neurological deficits.',
 'technique': 'Axial helical scan of the brain performed with coronal and sagittal reconstructions.',
 'comparison': 'None.',
 'findings': 'The brain shows age-appropriate volume with normal parenchymal attenuation and gray-white differentiation. No acute infarction or hemorrhage identified. The ventricles are normal in size without intraventricular hemorrhage. No extra-axial collection, midline shift, or brain herniation. The vascular structures appear normal. The calvarium and skull base show no fracture. Visualized paranasal sinuses, mastoids, and upper cervical spine are unremarkable.',
 'impression': '- No intracranial hemorrhage, acute large territorial infarction, extra-axial collection, midline shift, brain herniation, or skull fracture identified.'}

# Convert to JSON (pretty-printed)
print(report.to_json(indent = 2))

{
  "title": "EMERGENCY MDCT OF THE BRAIN",
  "history": "A 25-year-old female presents with headache. Physical examination reveals no focal neurological deficits.",
  "technique": "Axial helical scan of the brain performed with coronal and sagittal reconstructions.",
  "comparison": "None.",
  "findings": "The brain shows age-appropriate volume with normal parenchymal attenuation and gray-white differentiation. No acute infarction or hemorrhage identified. The ventricles are normal in size without intraventricular hemorrhage. No extra-axial collection, midline shift, or brain herniation. The vascular structures appear normal. The calvarium and skull base show no fracture. Visualized paranasal sinuses, mastoids, and upper cervical spine are unremarkable.",
  "impression": "- No intracranial hemorrhage, acute large territorial infarction, extra-axial collection, midline shift, brain herniation, or skull fracture identified."
}

Tip

The extract_all() method accepts these parameters:

include_key: Whether to include section keywords in output (default: False)
word_boundary: Whether to use word boundaries in pattern matching (default: False)

Extract Individual Section

You can also extract specific sections using individual methods like extract_history(), extract_findings(), etc. These methods provide more control over extraction parameters.

For example, to extract just the history section:

# Include section keyword (default)
extractor.extract_history(report_text)

'**HISTORY:** A 25-year-old female presents with headache. Physical examination reveals no focal neurological deficits.'

Keyword Inclusion

You can control whether the section keyword is included in the output using the include_key parameter.

# Exclude section Keyword
extractor.extract_history(report_text, include_key=False)

'A 25-year-old female presents with headache. Physical examination reveals no focal neurological deficits.'

Match Strategy

The match_strategy parameter controls how “next section keyword” of each section boundaries are determined.

# Sample report with ambiguous sections
report_reversed = """
HISTORY: Patient with cough 2 days ago
IMPRESSION: Unremarkable
FINDINGS: Normal findings
"""

“greedy” strategy (default)

This stretegy mark the ending of each section by reaching any of the “next section keywords”.

extractor.extract_history(report_reversed, match_strategy="greedy")

'HISTORY: Patient with cough 2 days ago'

‘sequential’ strategy

This stretegy mark the ending of each section by checking each “next section keywords” list sequentially.

extractor.extract_history(report_reversed, match_strategy="sequential")

'HISTORY: Patient with cough 2 days ago\nIMPRESSION: Unremarkable'

Since the keywords for “KeyWord.FINDINGS” comes before that of KeyWord.IMPRESSION, the end of “history” section would terminate just before the “KeyWord.FINDINGS” matched.

Manually Configure Section Keywords

You can customize the RadReportExtractor() to match different report formats by providing your own section keywords. This is particularly useful when working with institution-specific report templates.

Here’s an example using a chest radiograph report format:

# Sample chest X-ray report
chest_xray_text = """
CHEST RADIOGRAPH

CLINICAL INFORMATION: 65-year-old with productive cough

PROCEDURE: PA and lateral chest radiograph

DESCRIPTION:
- Clear lung fields bilaterally
- Normal cardiac silhouette
- No pleural effusion

CONCLUSION:
1. Normal chest radiograph
"""

# Custom section keywords for chest X-ray reports
extractor_custom = RadReportExtractor(
    keys_history=["CLINICAL INFORMATION:", "INDICATION:"],
    keys_technique=["PROCEDURE:", "EXAMINATION:"],
    keys_findings=["FINDINGS:", "DESCRIPTION:"],
    keys_impression=["CONCLUSION:", "IMPRESSION:"]
)

# Extract sections using custom configuration
chest_report = extractor_custom.extract_all(chest_xray_text)
chest_report

RadReport(title='CHEST RADIOGRAPH', history='65-year-old with productive cough', technique='PA and lateral chest radiograph', comparison='', findings='- Clear lung fields bilaterally\n- Normal cardiac silhouette\n- No pleural effusion', impression='1. Normal chest radiograph')

chest_report.history

'65-year-old with productive cough'

chest_report.impression

'1. Normal chest radiograph'

Tip

When configuring custom keywords:

Use variations of section headers commonly seen in your reports
Order keywords from most to least specific for better matching
Remember that matching is case-insensitive by default

Change Regular Expression Backend

Extract Section

While the RadReportExtractor()class is designed specifically for radiology reports, radreportparser also provides a more generic text extraction functionality through the SectionExtractor class, which can extract text sections from any document using custom start and end markers.

Basic Usage

The SectionExtractor class requires:

start_keys: A list of possible starting keywords, each one represented by Python regular expression.
end_keys: A list of possible ending keywords (exclusive), each one represented by Python regular expression.

The SectionExtractor extract text from the start_keys and unitil (but not include) the end_keys.

Here’s a simple example with a radiology report:

from radreportparser import SectionExtractor

# Sample radiology text
rad_text = """
FINDINGS: Normal chest CT
IMPRESSION: No acute abnormality
"""

# Create extractor for findings section
findings_extractor = SectionExtractor(
    start_keys=["FINDINGS:"],
    end_keys=["IMPRESSION:"]
)

# Extract findings
findings = findings_extractor.extract(rad_text)
print(findings)

FINDINGS: Normal chest CT

Flexible Text Extraction

The class’s flexibility makes it useful for parsing various medical documents. Here is an example for pathology report

path_text = """
**SPECIMEN:** Right breast core biopsy

**GROSS DESCRIPTION:** Three cores of tan-white tissue
**MICROSCOPIC EXAMINATION:** 

- The specimen shows normal breast tissue with fibrous stroma

== DIAGNOSIS ==
- Benign breast tissue
- No evidence of malignancy
"""

# Create extractor for microscopic examination
micro_extractor = SectionExtractor(
    start_keys=[r"(:?\W*)MICROSCOPIC EXAMINATION(:?\W*)"],
    end_keys=[r"(:?\W*)DIAGNOSIS(:?\W*)"],
    include_start_keys = False
)
# Extract microscopic section
micro = micro_extractor.extract(path_text)
print(micro)

The specimen shows normal breast tissue with fibrous stroma

Note

The \W* regex pattern is used here to match zero or more non-word character (see re for python regular expression).
include_start_keys = False to exclude start_keys

Extract All Sections

extract_all() method can be used to extract one or more sections that any of start_keys matches.

conversation_text = """
Human: Hi

AI: Hello, How can I help you?

Human:
- What is 1+1
- Think step-by-step

AI: 
1 + 1 = 2
Answer: 2
"""

human_section_extractor = SectionExtractor(
    start_keys=["human:"],
    end_keys=["AI:"],
    include_start_keys = False
)

# Extract all section that match `start_keys`
human_section_extractor.extract_all(conversation_text)

['Hi', '- What is 1+1\n- Think step-by-step']

Advanced Features

Multiple Start/End Keys

You can provide multiple possible section markers:

# Multiple ways to mark findings sections
clinical_note = """
=== OBSERVATIONS ===
Patient appears well

=== ASSESSMENT ===
Normal exam
"""

clincal_extractor = SectionExtractor(
    start_keys=[r"(?:\W*)FINDINGS(?:\W*)", r"(?:\W*)OBSERVATIONS(?:\W*)"],
    end_keys=[r"(?:\W*)PLAN(?:\W*)", r"(?:\W*)ASSESSMENT(?:\W*)"],
    include_start_keys = False
)
clincal_extractor.extract(clinical_note)

'Patient appears well'

Word Boundaries

Use word_boundary=True for more precise matching:

text = """
FINDING Normal
IMPRESSION Abnormal
"""

# Without word boundaries
extractor_no_boundary = SectionExtractor(
    start_keys=["FINDING"],
    end_keys=["IMP"],
    word_boundary=False
)
extractor_no_boundary.extract(text)

'FINDING Normal'

# With word boundaries
extractor_with_boundary = SectionExtractor(
    start_keys=["FINDING"],
    end_keys=["IMP"],
    word_boundary=True
)
# No match for `end_keys` 
extractor_with_boundary.extract(text)

'FINDING Normal\nIMPRESSION Abnormal'

Match Strategy Control

Choose between ‘greedy’ and ‘sequential’ matching:

text = """
OBSERVATION: None 
FINDING: Unremarkable
ASSESSMENT: Stable
NOTES: Continue monitoring
"""

Greedy matching (default):

Section ends by matching any of the start_keys or end_keys

extractor_greedy = SectionExtractor(
    start_keys=["FINDING:", "OBSERVATION:"],
    end_keys=["NOTES:", "ASSESSMENT:"],
    match_strategy="greedy"
)
extractor_greedy.extract(text)

'OBSERVATION: None \nFINDING: Unremarkable'

Sequential matching:

Section ends by matching the start_keys or end_keys sequentially

extractor_sequential = SectionExtractor(
    start_keys=["FINDING:", "OBSERVATION:"],
    end_keys=["NOTES:", "ASSESSMENT:"],
    match_strategy="sequential"
)
extractor_sequential.extract(text)

'FINDING: Unremarkable\nASSESSMENT: Stable'

Footnotes

According to the experience of an author of this package.↩︎