3. Add Citation to Sherpa Response#

Note

This tutorial assumes that you have already finished the Create a PDF Reader with Sherpa tutorial.

In this tutorial, we will add a citation to the Sherpa response. This is a great way to validate the response from Sherpa and to give credit to the source of the data.

3.2. Add Citation to Customized Actions#

The above example shows how to add citations to the Google search action. However, sometimes we may also want to add citations to the responses from the document search action. In this case, we need to manually add the citation to the response.

The DocumentSearch action inherit from the BaseAction class, which has a method add_resources that can be used to add a citation to the response. The add_resources method takes a list of dictionaries, each dictionary should contain the following keys:

  • Document: Content of the resource. In this case, it is the chunk of the document.

  • Source: Source of the resource, such as the URL or paragraph number. In this case, it is the chunk id.

To include citations in the response, lets’s first add the source to each chunk of the document in the metadata. For this, we want to modify the __init__ method of the DocumentSearch action to include the source in the metadata. The DocumentSearch action will also need to inherit from the BaseRetrievalAction class to use the add_resources method.

# New optional import if you want to save the resources to a file
import json
from sherpa_ai.actions.base import BaseRetrievalAction
# End of the new optional import

class DocumentSearch(BaseRetrievalAction):  # Note the parent class is now BaseRetrievalAction
    def __init__(self, **kwargs):
        super().__init__(**kwargs)

        # load the pdf and create the vector store
        self._chroma = Chroma(embedding_function = self.embedding_function)
        documents = PDFMinerLoader(self.filename).load()
        documents = SentenceTransformersTokenTextSplitter(chunk_overlap=0).split_documents(documents)

        documents_to_save = []
        # This is the new code to add the source to the metadata
        for i in range(len(documents)):
            documents[i].metadata["chunk_id"] = f"chunk_{i}"
            documents_to_save.append(
                {
                    "Document": documents[i].page_content,
                    "Source": documents[i].metadata["chunk_id"],
                }
            )

        with open("resources.json", "w") as f:
            json.dump(documents_to_save, f)
        # End of the new code

        logger.info(f"Adding {len(documents)} documents to the vector store")
        self._chroma.add_documents(documents)
        logger.info("Finished adding documents to the vector store")

In the above code, we also save the resources to a file called resources.json. This is not necessary, but it can be helpful so that you can use the cited chunk id to check the source of the citation.

Next, when we execute the search, we will add the resources using the add_resources method so that later the CitationValidation can aware of these resources.

def execute(self, query):
    """
    Execute the action by searching the document store for the query

    Args:
        query (str): The query to search for

    Returns:
        str: The search results combined into a single string
    """

    results = self._chroma.search(query, search_type="mmr", k=self.k)

    # This is the new code to add the resources
    resources = [
        {"Document": result.page_content, "Source": result.metadata["chunk_id"]}
        for result in results
    ]
    self.add_resources(resources)
    # End of the new code

    return "\n\n".join([result.page_content for result in results])

We are done! Again, to test the citation validation, let’s remove the Google search action from the agent and run the agent. Ask a question that will trigger the document search action. The agent should return the response with the citation.

qa_agent:
    ...
    actions:
        - ${doc_search}
        # - ${google_search}
Ask me a question: What is data leakage
2024-05-09 00:24:57.552 | INFO     | sherpa_ai.agents.base:run:70 - Action selected: ('DocumentSearch', {'query': 'What is data leakage'})
Data leakage refers to the potential for data to be unintentionally exposed or disclosed to unauthorized parties [1](doc:chunk_5), [3](doc:chunk_45). In the context provided, data leakage is discussed in relation to the presence of inter-dataset code duplication and the implications for the evaluation of language models in software engineering research [1](doc:chunk_5). It is highlighted as a potential threat that researchers need to consider when working with pre-training and fine-tuning datasets for language models [1](doc:chunk_5). By acknowledging the risk of data leakage due to code duplication, researchers can enhance the robustness of their evaluation methodologies and improve the validity of their results [1](doc:chunk_5).

Note

Check the resources.json file to see the source of the citation from the chunk ids.

3.3. Conclusion#

Finally, lets add back both actions to the agent configuration and run the agent to test the citation validation.

qa_agent:
    ...
    actions:
        - ${doc_search}
        - ${google_search}
Ask me a question: What is data leakage in machine learning
2024-05-09 00:28:18.724 | INFO     | sherpa_ai.agents.base:run:70 - Action selected: ('DocumentSearch', {'query': 'data leakage in machine learning'})
2024-05-09 00:28:19.878 | INFO     | sherpa_ai.agents.base:run:70 - Action selected: ('Google Search', {'query': 'What is data leakage in machine learning'})
Data leakage in machine learning occurs when the training data includes information about the target that will not be available during prediction [1](doc:chunk_12), [5](https://machinelearningmastery.com/data-leakage-machine-learning/). This can lead to the model performing well on the training set but poorly in production [1](doc:chunk_12), [2](doc:chunk_30), [3](doc:chunk_41), [5](https://machinelearningmastery.com/data-leakage-machine-learning/). Leakage can affect the evaluation of machine learning models, especially in scenarios involving pre-training and fine-tuning, as it poses a threat to the validity of the evaluations [1](doc:chunk_12).

Important

Currently, the citation output is in markdown format, the first part is the id of the citation and the second part is the source of the citation. We will soon add the option to customize the citation output format.