sherpa_ai.scrape package#

Overview#

The scrape package provides utilities for extracting and processing information from external sources like files, websites, and repositories. These tools enable agents to gather relevant information from structured and unstructured data sources.

Key Components

  • FileScaper: Tools for extracting content from local files

  • GitHubReadmeExtractor: Utilities for retrieving README content from GitHub

  • PromptReconstructor: Tools for rebuilding prompts from extracted content

Example Usage#

from sherpa_ai.scrape.file_scraper import FileScraper
from sherpa_ai.scrape.extract_github_readme import GitHubReadmeExtractor

# Extract content from local files
scraper = FileScraper()
content = scraper.scrape_file("path/to/document.txt")

# Parse Python files for structured content
python_content = scraper.scrape_python_file("path/to/script.py")
print(f"Found {len(python_content['classes'])} classes")
print(f"Found {len(python_content['functions'])} functions")

# Extract README from a GitHub repository
github_extractor = GitHubReadmeExtractor()
readme = github_extractor.extract_readme("username", "repository")
print(readme)

Submodules#

Module

Description

sherpa_ai.scrape.extract_github_readme

Utilities for retrieving and processing README files from GitHub repositories.

sherpa_ai.scrape.file_scraper

Tools for extracting and parsing content from local files in various formats.

sherpa_ai.scrape.prompt_reconstructor

Functionality for rebuilding and formatting prompts from extracted content.

sherpa_ai.scrape.extract_github_readme module#

GitHub README extraction module for Sherpa AI.

This module provides functionality for extracting and processing README files from GitHub repositories. It handles authentication, content extraction, and storage of README content in vector databases.

sherpa_ai.scrape.extract_github_readme.get_owner_and_repo(url)[source]#

Extract owner and repository name from GitHub URL.

This function parses a GitHub repository URL to extract the owner’s username and repository name.

Parameters:

url (str) – GitHub repository URL (e.g., ‘owner/repo’).

Returns:

A tuple containing:
  • owner (str): Repository owner’s username

  • repo (str): Repository name

Return type:

tuple[str, str]

Example

>>> url = "https://github.com/openai/gpt-3"
>>> owner, repo = get_owner_and_repo(url)
>>> print(owner, repo)
'openai' 'gpt-3'
sherpa_ai.scrape.extract_github_readme.extract_github_readme(repo_url)[source]#

Extract README content from a GitHub repository.

This function downloads and extracts the content of a repository’s README file (either .md or .rst). It also saves the content to a vector store for future reference.

Parameters:

repo_url (str) – GitHub repository URL.

Returns:

README content if found and successfully extracted,

None otherwise.

Return type:

Optional[str]

Example

>>> url = "https://github.com/openai/gpt-3"
>>> content = extract_github_readme(url)
>>> if content:
...     print(content[:50])
'# GPT-3: Language Models are Few-Shot Learners...'
sherpa_ai.scrape.extract_github_readme.save_to_pine_cone(content, metadatas)[source]#

Save content to Pinecone vector store.

This function saves text content and associated metadata to a Pinecone vector store for efficient retrieval. It uses OpenAI embeddings for vectorization.

Parameters:
  • content (str) – Text content to be stored.

  • metadatas (list) – List of metadata dictionaries for the content.

Raises:

ImportError – If pinecone-client package is not installed.

Example

>>> content = "# Project Documentation\nThis is a guide..."
>>> metadata = [{"type": "github", "url": "https://github.com/org/repo"}]
>>> save_to_pine_cone(content, metadata)  # Saves to vector store

sherpa_ai.scrape.file_scraper module#

File scraping and handling module for Sherpa AI.

This module provides functionality for downloading, processing, and analyzing files attached to questions. It handles various file types including PDF, text, markdown, HTML, and XML files.

class sherpa_ai.scrape.file_scraper.QuestionWithFileHandler(question, files, token, user_id, team_id, llm)[source]#

Bases: object

Handler for questions with attached files.

This class manages the process of downloading, processing, and analyzing files attached to questions. It supports various file types and handles token limits and content summarization.

question#

The user’s question to be answered.

Type:

str

token#

OAuth token for file access.

Type:

str

files#

List of file information dictionaries.

Type:

list

user_id#

ID of the user asking the question.

Type:

str

llm#

Language model for text processing.

Type:

Any

Example

>>> handler = QuestionWithFileHandler(
...     question="What's in the document?",
...     files=[{"id": "123", "filetype": "pdf"}],
...     token="oauth_token",
...     user_id="user123",
...     llm=language_model
... )
>>> result = handler.reconstruct_prompt_with_file()
>>> print(result["status"])
'success'
reconstruct_prompt_with_file()[source]#

Reconstruct the prompt using the attached file.

This method downloads the file, processes its content, and combines it with the original question to create a more informed prompt.

Returns:

A dictionary containing:
  • status (str): ‘success’ or ‘error’

  • data (str): Reconstructed prompt if successful

  • message (str): Error message if failed

Return type:

dict

Example

>>> result = handler.reconstruct_prompt_with_file()
>>> if result["status"] == "success":
...     print(result["data"])
'Based on the PDF content...'
download_file(file)[source]#

Download and extract content from a file.

This method downloads a file using its URL and extracts its content based on the file type. Supports PDF, text, markdown, HTML, and XML.

Parameters:

file (dict) – File information dictionary containing: - id (str): File identifier - mimetype (str): MIME type - url_private_download (str): Download URL - filetype (str): File extension

Returns:

A dictionary containing:
  • status (str): ‘success’ or ‘error’

  • data (str): File content if successful

  • message (str): Error message if failed

Return type:

dict

Example

>>> file_info = {
...     "id": "123",
...     "filetype": "pdf",
...     "url_private_download": "https://example.com/doc.pdf"
... }
>>> result = handler.download_file(file_info)
>>> if result["status"] == "success":
...     print(len(result["data"]))
1024
prompt_reconstruct(file_info, data=<class 'str'>)[source]#

Reconstruct the prompt with file content.

This method processes the file content, handles token limits, and combines the content with the original question to create an enhanced prompt.

Parameters:
  • file_info (dict) – File information dictionary containing: - filetype (str): File extension - name (str): File name - title (str): File title

  • data (str) – Content of the file.

Returns:

A dictionary containing:
  • status (str): ‘success’ or ‘error’

  • data (str): Reconstructed prompt if successful

  • message (str): Error message if failed

Return type:

dict

Example

>>> file_info = {
...     "filetype": "pdf",
...     "name": "document.pdf",
...     "title": "Important Doc"
... }
>>> result = handler.prompt_reconstruct(file_info, "content...")
>>> if result["status"] == "success":
...     print(result["data"])
'Based on the PDF "Important Doc"...'

sherpa_ai.scrape.prompt_reconstructor module#

Prompt reconstruction module for Sherpa AI.

This module provides functionality for reconstructing prompts by incorporating content from URLs mentioned in Slack messages. It handles scraping, summarizing, and integrating web content into questions.

class sherpa_ai.scrape.prompt_reconstructor.PromptReconstructor(question, slack_message, llm)[source]#

Bases: object

Prompt reconstructor for URL-enhanced questions.

This class handles the process of enhancing questions by incorporating content from URLs mentioned in Slack messages. It scrapes the URLs, summarizes their content, and integrates the summaries into the original question.

question#

The original question to be enhanced.

Type:

str

slack_message#

Slack message containing URLs.

Type:

dict

llm#

Language model for text processing.

Type:

Any

Example

>>> reconstructor = PromptReconstructor(
...     question="How does this library work?",
...     slack_message={"text": "Check https://github.com/org/repo"},
...     llm=language_model
... )
>>> enhanced = reconstructor.reconstruct_prompt()
>>> print(enhanced)
'Based on the GitHub repo...'
reconstruct_prompt()[source]#

Reconstruct the prompt by incorporating URL content.

This method extracts URLs from the Slack message, scrapes their content, summarizes it while respecting token limits, and integrates the summaries into the original question.

Returns:

The enhanced question incorporating URL content summaries.

Return type:

str

Example

>>> reconstructor = PromptReconstructor(
...     question="How to use this?",
...     slack_message={"text": "See docs at https://docs.com"},
...     llm=language_model
... )
>>> enhanced = reconstructor.reconstruct_prompt()
>>> print(enhanced)
'Based on the documentation at docs.com...'

Module contents#