SectionExtractor

SectionExtractor(
    self,
    start_keys: list[str] | None,
    end_keys: list[str] | None,
    include_start_keys: bool = True,
    word_boundary: bool = False,
    flags: Union[re.RegexFlag, int] = re.IGNORECASE,
    match_strategy: Literal['greedy', 'sequential'] = 'greedy',
    backend: Literal['re', 're2'] = 're',
)

Extract sections from text based on start and end keys.

This class provides functionality to extract sections of text that begin with any of the start keys and end just before any of the end keys (i.e., not include the end keys).

match_strategy: Strategy for matching both start and end keys:

Parameters

Name Type Description Default
start_keys list[str] | None List of possible section start markers as regular expression. If None, the section will be extracted from the beginning of the text. required
end_keys list[str] | None List of possible section end markers as regular expression. The end_key will not be included in the extracted section. If None, the section will be extracted until the end of the text. required
include_start_keys bool Whether to include the start key in the extracted section. Default is True. True
word_boundary bool Whether to wrap word boundary  around the keys. Default is True. False
flags Union[re.RegexFlag, int] Regex flags to use in pattern matching. For ‘re’ backend: These are directly passed to re.compile() For ‘re2’ backend: These are converted to re2.Options properties Default is re.IGNORECASE. re.IGNORECASE
match_strategy (greedy, sequential) Strategy for matching both start and end keys. "greedy"
backend (re, re2) Regex backend to use: - “re”: Standard Python regex engine (default) - “re2”: Google’s RE2 engine (must be installed) "re"

Examples

from radreportparser import SectionExtractor
# Create an extractor for finding text between headers
extractor = SectionExtractor(
    start_keys=["FINDINGS:"],
    end_keys=["IMPRESSION:", "CONCLUSION:"]
)
print(extractor)
SectionExtractor(start_keys=['FINDINGS:'], end_keys=['IMPRESSION:', 'CONCLUSION:'], include_start_keys=True, word_boundary=False, flags=re.IGNORECASE, match_strategy='greedy', backend='re')

Methods

Name Description
extract Extract a section from the text using configured patterns.
extract_all Extract all sections from the text that match the configured patterns.

extract

SectionExtractor.extract(text: str, verbose: bool = True)

Extract a section from the text using configured patterns.

Extract a section from text if any of start_keys matches. If multiple start_keys matches are found in text, return section from the first match. The matching strategy is controlled by match_strategy argument in the initialization of SectionExtractor()

Parameters

Name Type Description Default
text str The input text to extract section from. required
verbose bool If true and there are more than one position of text that matches the start_keys, print message to standard output. True

Returns

Name Type Description
str The extracted section text. Returns empty string if section not found.

Examples

# Create an extractor for finding text
from radreportparser import SectionExtractor
extractor = SectionExtractor(
    start_keys=["FINDINGS:"],
    end_keys=["IMPRESSION:"]
)
# Extract section from text
text = "FINDINGS: Normal. IMPRESSION: Clear."
section = extractor.extract(text)
print(section)
FINDINGS: Normal.

extract_all

SectionExtractor.extract_all(text: str)

Extract all sections from the text that match the configured patterns.

Extract one or more section(s) from text if any of start_keys matches. The matching strategy is controlled by match_strategy argument in the initialization of SectionExtractor().

Parameters

Name Type Description Default
text str The input text to extract sections from required

Returns

Name Type Description
List[str] List of extracted section texts. Returns empty list if no sections found.

Examples

# Create an extractor for finding text
from radreportparser import SectionExtractor
extractor = SectionExtractor(
    start_keys=["FINDING:"],
    end_keys=["IMPRESSION:"]
)
text = '''
FINDING: First observation
IMPRESSION: OK
FINDING: Second observation
IMPRESSION: Also OK
'''
sections = extractor.extract_all(text)
print(sections)
['FINDING: First observation', 'FINDING: Second observation']