sherpa_ai.output_parsers package

In This Page:

sherpa_ai.output_parsers package#

Submodules#

sherpa_ai.output_parsers.base module#

class sherpa_ai.output_parsers.base.BaseOutputParser[source]#

Bases: ABC

Abstract base class for output parsers.

Defines the interface for parsing output text.

None#
parse_output(text

str) -> str: This method should be implemented by subclasses to parse the input text.

abstract parse_output(**kwargs) str[source]#

Abstract method to be implemented by subclasses for parsing output text.

Parameters:

text (str) – The input text to be parsed.

Returns:

The parsed output text.

Return type:

str

class sherpa_ai.output_parsers.base.BaseOutputProcessor[source]#

Bases: ABC

Abstract base class for output processors.

Defines the interface for processing output text.

count#

Abstract global variable representing the count of failed validations.

Type:

int

process_output(text

str) -> Tuple[bool, str]: This method should be implemented by subclasses to process the input text.

__call__(text

str) -> Tuple[bool, str]: This method is a convenient shorthand for calling process_output. It is implemented to call process_output and return the result.

count: int = 0#
abstract process_output(text: str, **kwargs) ValidationResult[source]#

Abstract method to be implemented by subclasses for processing output text.

Parameters:

text (str) – The input text to be processed.

Returns:

The result of the processing, including the validity status, the processed text, and optional feedback.

Return type:

ValidationResult

reset_state()[source]#

sherpa_ai.output_parsers.citation_validation module#

class sherpa_ai.output_parsers.citation_validation.CitationValidation(sequence_threshold=0.7, jaccard_threshold=0.7, token_overlap=0.7)[source]#

Bases: BaseOutputProcessor

A class for adding citations to generated text based on a list of resources.

This class inherits from the abstract class BaseOutputParser and provides methods to add citations to each sentence in the generated text based on reference texts and links provided in the ‘resources’ list.

sequence_threshold#

Threshold for common longest subsequence / text. Default is 0.7.

Type:

float

jaccard_threshold#

Jaccard similarity threshold. Default is 0.7.

Type:

float

token_overlap#

Token overlap threshold. Default is 0.7.

Type:

float

Typical usage example: `python citation_parser = CitationValidation(seq_thresh=0.7, jaccard_thresh=0.7, token_overlap=0.7) result = citation_parser.parse_output(generated_text, list_of_resources) `

add_citation_to_sentence(sentence: str, resources: list[ActionResource])[source]#

Uses a list of resources to add citations to a sentence

Returns:

a list of citation identifiers citation_links: a list of citation links (URLs)

Return type:

citation_ids

add_citations(text: str, resources: list[dict]) ValidationResult[source]#
calculate_token_overlap(sentence1, sentence2) tuple[source]#

Calculates the percentage of token overlap between two sentences.

Tokenizes the input sentences and calculates the percentage of token overlap by finding the intersection of the token sets and dividing it by the length of each sentence’s token set.

Parameters:
  • sentence1 (str) – The first sentence for token overlap calculation.

  • sentence2 (str) – The second sentence for token overlap calculation.

Returns:

A tuple containing two float values representing the percentage of token overlap for sentence1 and sentence2, respectively.

Return type:

tuple

flatten_nested_list(nested_list: list[list[str]]) list[str][source]#

Flattens a nested list of strings into a single list of strings.

Parameters:

nested_list (list[list[str]]) – The nested list of strings to be flattened.

Returns:

A flat list containing all non-empty strings from the nested list.

Return type:

list[str]

format_sentence_with_citations(sentence, ids, links)[source]#

Appends citations to sentence

get_failure_message() str[source]#
jaccard_index(sentence1, sentence2) float[source]#

Calculates the Jaccard index between two sentences.

The Jaccard index is a measure of similarity between two sets, defined as the size of the intersection divided by the size of the union of the sets.

Parameters:
  • sentence1 (str) – The first sentence for Jaccard index calculation.

  • sentence2 (str) – The second sentence for Jaccard index calculation.

Returns:

The Jaccard index representing the similarity between the two sentences.

Return type:

float

longest_common_subsequence(text1: str, text2: str) int[source]#

Calculates the length of the longest common subsequence between two texts.

A subsequence of a string is a new string generated from the original string with some characters (can be none) deleted without changing the relative order of the remaining characters.

Args: - text1 (str): The first text for calculating the longest common subsequence. - text2 (str): The second text for calculating the longest common subsequence.

Returns: - int: The length of the longest common subsequence between the two texts.

process_output(text: str, belief: Belief, **kwargs) ValidationResult[source]#

Add citations to sentences in the generated text using resources based on fact checking model.

Args:

text (str): The text which needs citations/references added belief (Belief): Belief of the agent that generated text

Returns:

ValidationResult: The result of citation processing. is_valid is True when citation processing succeeds or no citation resources are provided, False otherwise. result contains the formatted text with citations. feedback providing additional optional information.

Typical usage example:

`python resources = ActionResource(source="http://example.com/source1", content="Some reference text.")] citation_parser = CitationValidation() result = citation_parser.parse_output("Text needing citations.", resources) `

resources_from_belief(belief: Belief) list[ActionResource][source]#

Returns a list of all resources within belief.actions.

split_paragraph_into_sentences(paragraph: str) list[str][source]#

Uses NLTK’s sent_tokenize to split the given paragraph into a list of sentences.

Parameters:

paragraph (str) – The input paragraph to be tokenized into sentences.

Returns:

A list of sentences extracted from the input paragraph.

Return type:

list[str]

sherpa_ai.output_parsers.md_to_slack_parse module#

Post-processors for outputs from the LLM.

class sherpa_ai.output_parsers.md_to_slack_parse.MDToSlackParse[source]#

Bases: BaseOutputParser

A post-processor for converting Markdown links to Slack-compatible format.

This class inherits from the BaseOutputParser and provides a method to parse and convert Markdown-style links to Slack-compatible format in the input text.

Attributes: - pattern (str): Regular expression pattern for identifying Markdown links.

Methods: - parse_output(text: str) -> str:

Parses and converts Markdown links to Slack-compatible format in the input text.

Example Usage: `python md_to_slack_parser = MDToSlackParse() result = md_to_slack_parser.parse_output("Check out [this link](http://example.com)!") `

parse_output(text: str) str[source]#

Parses and converts Markdown links to Slack-compatible format in the input text. Replace with Slack link

Args: - text (str): The input text containing Markdown-style links.

Returns: - str: The modified text with Markdown links replaced by Slack-compatible links.

sherpa_ai.output_parsers.number_validation module#

class sherpa_ai.output_parsers.number_validation.NumberValidation[source]#

Bases: BaseOutputProcessor

Validates the presence or absence of numerical information in a given piece of text.

Typical usage example:

`python number_validator = NumberValidation(source="document") result = number_validator.process_output("The document contains important numbers: 123, 456.") `

get_failure_message() str[source]#
process_output(text: str, belief: Belief, **kwargs) ValidationResult[source]#

Verifies that all numbers within text exist in the belief source text.

Parameters:
  • text – The text to be analyzed

  • belief – Belief of the Agent that generated text

Returns:

The result of number validation. If any number in the text to be processed doesn’t exist in the source text, validation is invalid and contains a feedback string. Otherwise validation is valid.

Return type:

ValidationResult

sherpa_ai.output_parsers.validation_result module#

class sherpa_ai.output_parsers.validation_result.ValidationResult(*, is_valid: bool, result: str, feedback: str = '')[source]#

Bases: BaseModel

Represents the result of validating a string of content.

is_valid#

Indicates whether the validation result is valid (True) or not (False).

Type:

bool

result#

The output of the validation process. A string of validated content.

Type:

str

feedback#

Additional feedback or information about the validation result. Default is an empty string.

Type:

str, optional

Example Usage: `python validation_result = ValidationResult(is_valid=True, result="Some validated text", feedback="No issues found.") `

feedback: str#
is_valid: bool#
model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

result: str#

Module contents#

class sherpa_ai.output_parsers.CitationValidation(sequence_threshold=0.7, jaccard_threshold=0.7, token_overlap=0.7)[source]#

Bases: BaseOutputProcessor

A class for adding citations to generated text based on a list of resources.

This class inherits from the abstract class BaseOutputParser and provides methods to add citations to each sentence in the generated text based on reference texts and links provided in the ‘resources’ list.

sequence_threshold#

Threshold for common longest subsequence / text. Default is 0.7.

Type:

float

jaccard_threshold#

Jaccard similarity threshold. Default is 0.7.

Type:

float

token_overlap#

Token overlap threshold. Default is 0.7.

Type:

float

Typical usage example: `python citation_parser = CitationValidation(seq_thresh=0.7, jaccard_thresh=0.7, token_overlap=0.7) result = citation_parser.parse_output(generated_text, list_of_resources) `

add_citation_to_sentence(sentence: str, resources: list[ActionResource])[source]#

Uses a list of resources to add citations to a sentence

Returns:

a list of citation identifiers citation_links: a list of citation links (URLs)

Return type:

citation_ids

add_citations(text: str, resources: list[dict]) ValidationResult[source]#
calculate_token_overlap(sentence1, sentence2) tuple[source]#

Calculates the percentage of token overlap between two sentences.

Tokenizes the input sentences and calculates the percentage of token overlap by finding the intersection of the token sets and dividing it by the length of each sentence’s token set.

Parameters:
  • sentence1 (str) – The first sentence for token overlap calculation.

  • sentence2 (str) – The second sentence for token overlap calculation.

Returns:

A tuple containing two float values representing the percentage of token overlap for sentence1 and sentence2, respectively.

Return type:

tuple

flatten_nested_list(nested_list: list[list[str]]) list[str][source]#

Flattens a nested list of strings into a single list of strings.

Parameters:

nested_list (list[list[str]]) – The nested list of strings to be flattened.

Returns:

A flat list containing all non-empty strings from the nested list.

Return type:

list[str]

format_sentence_with_citations(sentence, ids, links)[source]#

Appends citations to sentence

get_failure_message() str[source]#
jaccard_index(sentence1, sentence2) float[source]#

Calculates the Jaccard index between two sentences.

The Jaccard index is a measure of similarity between two sets, defined as the size of the intersection divided by the size of the union of the sets.

Parameters:
  • sentence1 (str) – The first sentence for Jaccard index calculation.

  • sentence2 (str) – The second sentence for Jaccard index calculation.

Returns:

The Jaccard index representing the similarity between the two sentences.

Return type:

float

longest_common_subsequence(text1: str, text2: str) int[source]#

Calculates the length of the longest common subsequence between two texts.

A subsequence of a string is a new string generated from the original string with some characters (can be none) deleted without changing the relative order of the remaining characters.

Args: - text1 (str): The first text for calculating the longest common subsequence. - text2 (str): The second text for calculating the longest common subsequence.

Returns: - int: The length of the longest common subsequence between the two texts.

process_output(text: str, belief: Belief, **kwargs) ValidationResult[source]#

Add citations to sentences in the generated text using resources based on fact checking model.

Args:

text (str): The text which needs citations/references added belief (Belief): Belief of the agent that generated text

Returns:

ValidationResult: The result of citation processing. is_valid is True when citation processing succeeds or no citation resources are provided, False otherwise. result contains the formatted text with citations. feedback providing additional optional information.

Typical usage example:

`python resources = ActionResource(source="http://example.com/source1", content="Some reference text.")] citation_parser = CitationValidation() result = citation_parser.parse_output("Text needing citations.", resources) `

resources_from_belief(belief: Belief) list[ActionResource][source]#

Returns a list of all resources within belief.actions.

split_paragraph_into_sentences(paragraph: str) list[str][source]#

Uses NLTK’s sent_tokenize to split the given paragraph into a list of sentences.

Parameters:

paragraph (str) – The input paragraph to be tokenized into sentences.

Returns:

A list of sentences extracted from the input paragraph.

Return type:

list[str]

class sherpa_ai.output_parsers.EntityValidation[source]#

Bases: BaseOutputProcessor

Process and validate the presence of entities in the generated text.

This class inherits from the BaseOutputProcessor and provides a method to process the generated text and validate the presence of entities based on a specified source.

Methods: - process_output(text: str, belief: Belief) -> ValidationResult:

Process the generated text and validate the presence of entities.

  • get_failure_message() -> str:

    Returns a failure message to be displayed when the validation fails.

check_entities_match(result: str, source: str, stage: TextSimilarityMethod, llm: BaseLanguageModel)[source]#

Check if entities extracted from a question are present in an answer.

Args: - result (str): Answer text. - source (str): Question text. - stage (int): Stage of the check (0, 1, or 2).

Returns: dict: Result of the check containing

get_failure_message() str[source]#
process_output(text: str, belief: Belief, llm: BaseLanguageModel = None, **kwargs) ValidationResult[source]#

Verifies that entities within text exist in the belief source text. :param text: The text to be processed :param belief: The belief object of the agent that generated the output :param iteration_count: The iteration count for validation processing.

1. means basic text similarity. 2 means text similarity by metrics. 3 means text similarity by llm.

Returns:

The result of the validation. If any entity in the text to be processed doesn’t exist in the source text, validation is invalid and contains a feedback string. Otherwise, validation is valid.

Return type:

ValidationResult

similarity_picker(value: int)[source]#

Picks a text similarity state based on the provided iteration count value.

Parameters:

value (int) – The iteration count value used to determine the text similarity state. - 0: Use BASIC text similarity. - 1: Use text similarity BY_METRICS. - Default: Use text similarity BY_LLM.

Returns:

The selected text similarity state.

Return type:

TextSimilarityState

class sherpa_ai.output_parsers.LinkParser[source]#

Bases: BaseOutputParser

A class for parsing and modifying links in text using specified patterns.

This class inherits from the abstract class BaseOutputParser and provides methods to parse and modify links in the input text. It includes functionality to replace links with symbols and symbols with links based on predefined patterns.

Attributes: - links (list): A list to store unique links encountered during parsing. - link_to_id (dict): A dictionary mapping links to their corresponding symbols. - count (int): Counter for generating unique symbols for new links. - output_counter (int): Counter for reindexing output. - reindex_mapping (dict): A mapping of original document IDs to reindexed IDs. - url_pattern (str): Regular expression pattern for identifying links in the input text. - doc_id_pattern (str): Regular expression pattern for identifying document IDs in the input text. - link_symbol (str): Format string for representing link symbols.

Methods: - parse_output(text: str, tool_output: bool = False) -> str:

Parses and modifies links in the input text based on the specified patterns.

parse_output(text: str, tool_output=False) str[source]#

Parses and modifies links in the input text based on the specified patterns.

Args: - text (str): The input text containing links or symbols to be parsed. - tool_output (bool): A flag indicating whether the input text is tool-generated. Default is False.

Returns: - str: The modified text with links replaced by symbols or symbols replaced by links.

class sherpa_ai.output_parsers.MDToSlackParse[source]#

Bases: BaseOutputParser

A post-processor for converting Markdown links to Slack-compatible format.

This class inherits from the BaseOutputParser and provides a method to parse and convert Markdown-style links to Slack-compatible format in the input text.

Attributes: - pattern (str): Regular expression pattern for identifying Markdown links.

Methods: - parse_output(text: str) -> str:

Parses and converts Markdown links to Slack-compatible format in the input text.

Example Usage: `python md_to_slack_parser = MDToSlackParse() result = md_to_slack_parser.parse_output("Check out [this link](http://example.com)!") `

parse_output(text: str) str[source]#

Parses and converts Markdown links to Slack-compatible format in the input text. Replace with Slack link

Args: - text (str): The input text containing Markdown-style links.

Returns: - str: The modified text with Markdown links replaced by Slack-compatible links.

class sherpa_ai.output_parsers.NumberValidation[source]#

Bases: BaseOutputProcessor

Validates the presence or absence of numerical information in a given piece of text.

Typical usage example:

`python number_validator = NumberValidation(source="document") result = number_validator.process_output("The document contains important numbers: 123, 456.") `

get_failure_message() str[source]#
process_output(text: str, belief: Belief, **kwargs) ValidationResult[source]#

Verifies that all numbers within text exist in the belief source text.

Parameters:
  • text – The text to be analyzed

  • belief – Belief of the Agent that generated text

Returns:

The result of number validation. If any number in the text to be processed doesn’t exist in the source text, validation is invalid and contains a feedback string. Otherwise validation is valid.

Return type:

ValidationResult