sherpa_ai.scrape package#
Overview#
The scrape
package provides utilities for extracting and processing information from
external sources like files, websites, and repositories. These tools enable agents to
gather relevant information from structured and unstructured data sources.
Key Components
FileScaper: Tools for extracting content from local files
GitHubReadmeExtractor: Utilities for retrieving README content from GitHub
PromptReconstructor: Tools for rebuilding prompts from extracted content
Example Usage#
from sherpa_ai.scrape.file_scraper import FileScraper
from sherpa_ai.scrape.extract_github_readme import GitHubReadmeExtractor
# Extract content from local files
scraper = FileScraper()
content = scraper.scrape_file("path/to/document.txt")
# Parse Python files for structured content
python_content = scraper.scrape_python_file("path/to/script.py")
print(f"Found {len(python_content['classes'])} classes")
print(f"Found {len(python_content['functions'])} functions")
# Extract README from a GitHub repository
github_extractor = GitHubReadmeExtractor()
readme = github_extractor.extract_readme("username", "repository")
print(readme)
Submodules#
Module |
Description |
---|---|
Utilities for retrieving and processing README files from GitHub repositories. |
|
Tools for extracting and parsing content from local files in various formats. |
|
Functionality for rebuilding and formatting prompts from extracted content. |
sherpa_ai.scrape.extract_github_readme module#
GitHub README extraction module for Sherpa AI.
This module provides functionality for extracting and processing README files from GitHub repositories. It handles authentication, content extraction, and storage of README content in vector databases.
- sherpa_ai.scrape.extract_github_readme.get_owner_and_repo(url)[source]#
Extract owner and repository name from GitHub URL.
This function parses a GitHub repository URL to extract the owner’s username and repository name.
- Parameters:
url (str) – GitHub repository URL (e.g., ‘owner/repo’).
- Returns:
- A tuple containing:
owner (str): Repository owner’s username
repo (str): Repository name
- Return type:
tuple[str, str]
Example
>>> url = "https://github.com/openai/gpt-3" >>> owner, repo = get_owner_and_repo(url) >>> print(owner, repo) 'openai' 'gpt-3'
- sherpa_ai.scrape.extract_github_readme.extract_github_readme(repo_url)[source]#
Extract README content from a GitHub repository.
This function downloads and extracts the content of a repository’s README file (either .md or .rst). It also saves the content to a vector store for future reference.
- Parameters:
repo_url (str) – GitHub repository URL.
- Returns:
- README content if found and successfully extracted,
None otherwise.
- Return type:
Optional[str]
Example
>>> url = "https://github.com/openai/gpt-3" >>> content = extract_github_readme(url) >>> if content: ... print(content[:50]) '# GPT-3: Language Models are Few-Shot Learners...'
- sherpa_ai.scrape.extract_github_readme.save_to_pine_cone(content, metadatas)[source]#
Save content to Pinecone vector store.
This function saves text content and associated metadata to a Pinecone vector store for efficient retrieval. It uses OpenAI embeddings for vectorization.
- Parameters:
content (str) – Text content to be stored.
metadatas (list) – List of metadata dictionaries for the content.
- Raises:
ImportError – If pinecone-client package is not installed.
Example
>>> content = "# Project Documentation\nThis is a guide..." >>> metadata = [{"type": "github", "url": "https://github.com/org/repo"}] >>> save_to_pine_cone(content, metadata) # Saves to vector store
sherpa_ai.scrape.file_scraper module#
File scraping and handling module for Sherpa AI.
This module provides functionality for downloading, processing, and analyzing files attached to questions. It handles various file types including PDF, text, markdown, HTML, and XML files.
- class sherpa_ai.scrape.file_scraper.QuestionWithFileHandler(question, files, token, user_id, team_id, llm)[source]#
Bases:
object
Handler for questions with attached files.
This class manages the process of downloading, processing, and analyzing files attached to questions. It supports various file types and handles token limits and content summarization.
- question#
The user’s question to be answered.
- Type:
str
- token#
OAuth token for file access.
- Type:
str
- files#
List of file information dictionaries.
- Type:
list
- user_id#
ID of the user asking the question.
- Type:
str
- llm#
Language model for text processing.
- Type:
Any
Example
>>> handler = QuestionWithFileHandler( ... question="What's in the document?", ... files=[{"id": "123", "filetype": "pdf"}], ... token="oauth_token", ... user_id="user123", ... llm=language_model ... ) >>> result = handler.reconstruct_prompt_with_file() >>> print(result["status"]) 'success'
- reconstruct_prompt_with_file()[source]#
Reconstruct the prompt using the attached file.
This method downloads the file, processes its content, and combines it with the original question to create a more informed prompt.
- Returns:
- A dictionary containing:
status (str): ‘success’ or ‘error’
data (str): Reconstructed prompt if successful
message (str): Error message if failed
- Return type:
dict
Example
>>> result = handler.reconstruct_prompt_with_file() >>> if result["status"] == "success": ... print(result["data"]) 'Based on the PDF content...'
- download_file(file)[source]#
Download and extract content from a file.
This method downloads a file using its URL and extracts its content based on the file type. Supports PDF, text, markdown, HTML, and XML.
- Parameters:
file (dict) – File information dictionary containing: - id (str): File identifier - mimetype (str): MIME type - url_private_download (str): Download URL - filetype (str): File extension
- Returns:
- A dictionary containing:
status (str): ‘success’ or ‘error’
data (str): File content if successful
message (str): Error message if failed
- Return type:
dict
Example
>>> file_info = { ... "id": "123", ... "filetype": "pdf", ... "url_private_download": "https://example.com/doc.pdf" ... } >>> result = handler.download_file(file_info) >>> if result["status"] == "success": ... print(len(result["data"])) 1024
- prompt_reconstruct(file_info, data=<class 'str'>)[source]#
Reconstruct the prompt with file content.
This method processes the file content, handles token limits, and combines the content with the original question to create an enhanced prompt.
- Parameters:
file_info (dict) – File information dictionary containing: - filetype (str): File extension - name (str): File name - title (str): File title
data (str) – Content of the file.
- Returns:
- A dictionary containing:
status (str): ‘success’ or ‘error’
data (str): Reconstructed prompt if successful
message (str): Error message if failed
- Return type:
dict
Example
>>> file_info = { ... "filetype": "pdf", ... "name": "document.pdf", ... "title": "Important Doc" ... } >>> result = handler.prompt_reconstruct(file_info, "content...") >>> if result["status"] == "success": ... print(result["data"]) 'Based on the PDF "Important Doc"...'
sherpa_ai.scrape.prompt_reconstructor module#
Prompt reconstruction module for Sherpa AI.
This module provides functionality for reconstructing prompts by incorporating content from URLs mentioned in Slack messages. It handles scraping, summarizing, and integrating web content into questions.
- class sherpa_ai.scrape.prompt_reconstructor.PromptReconstructor(question, slack_message, llm)[source]#
Bases:
object
Prompt reconstructor for URL-enhanced questions.
This class handles the process of enhancing questions by incorporating content from URLs mentioned in Slack messages. It scrapes the URLs, summarizes their content, and integrates the summaries into the original question.
- question#
The original question to be enhanced.
- Type:
str
- slack_message#
Slack message containing URLs.
- Type:
dict
- llm#
Language model for text processing.
- Type:
Any
Example
>>> reconstructor = PromptReconstructor( ... question="How does this library work?", ... slack_message={"text": "Check https://github.com/org/repo"}, ... llm=language_model ... ) >>> enhanced = reconstructor.reconstruct_prompt() >>> print(enhanced) 'Based on the GitHub repo...'
- reconstruct_prompt()[source]#
Reconstruct the prompt by incorporating URL content.
This method extracts URLs from the Slack message, scrapes their content, summarizes it while respecting token limits, and integrates the summaries into the original question.
- Returns:
The enhanced question incorporating URL content summaries.
- Return type:
str
Example
>>> reconstructor = PromptReconstructor( ... question="How to use this?", ... slack_message={"text": "See docs at https://docs.com"}, ... llm=language_model ... ) >>> enhanced = reconstructor.reconstruct_prompt() >>> print(enhanced) 'Based on the documentation at docs.com...'