Skip to main content

Cli Sw Search

sw-search CLI

Summary: Command-line tool for building, searching, and managing vector search indexes for AI agent knowledge bases.

Overview

The sw-search tool builds vector search indexes from documents for use with the native_vector_search skill.

Capabilities:

  • Build indexes from documents (MD, TXT, PDF, DOCX, RST, PY)
  • Multiple chunking strategies for different content types
  • SQLite and PostgreSQL/pgvector storage backends
  • Interactive search shell for index exploration
  • Export chunks to JSON for review or external processing
  • Migrate indexes between backends
  • Search via remote API endpoints

Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│ Documents │───▶│ Index Builder │───▶│ .swsearch DB │
│ (MD, PDF, etc.) │ │ │ │ │
└─────────────────┘ └──────────────────┘ └─────────────────┘


┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Agent │───▶│ Search Skill │───▶│ Search Engine │
│ │ │ │ │ │
└─────────────────┘ └──────────────────┘ └─────────────────┘

The system provides:

  • Offline Search: No external API calls or internet required
  • Hybrid Search: Combines vector similarity and keyword search
  • Smart Chunking: Intelligent document segmentation with context preservation
  • Advanced Query Processing: NLP-enhanced query understanding
  • Flexible Deployment: Local embedded mode or remote server mode
  • SQLite Storage: Portable .swsearch index files

Command Modes

sw-search operates in five modes:

ModeSyntaxPurpose
buildsw-search ./docsBuild search index
validatesw-search validate FILEValidate index integrity
migratesw-search migrate FILEMigrate between backends
remotesw-search remote URL QUERYSearch via remote API

Quick Start

## Build index from documentation
sw-search ./docs --output knowledge.swsearch

## Search the index
sw-search search knowledge.swsearch "how to create an agent"

## Interactive search shell
sw-search search knowledge.swsearch --shell

## Validate index
sw-search validate knowledge.swsearch

Building Indexes

Index Structure

Each .swsearch file is a SQLite database containing:

  • Document chunks with embeddings and metadata
  • Full-text search index (SQLite FTS5) for keyword search
  • Configuration and model information
  • Synonym cache for query expansion

This portable format allows you to build indexes once and distribute them with your agents.

Basic Usage

## Build from single directory
sw-search ./docs

## Build from multiple directories
sw-search ./docs ./examples --file-types md,txt,py

## Build from individual files
sw-search README.md ./docs/guide.md ./src/main.py

## Mixed sources (directories and files)
sw-search ./docs README.md ./examples specific_file.txt

## Specify output file
sw-search ./docs --output ./knowledge.swsearch

Build Options

OptionDefaultDescription
--output FILEsources.swsearchOutput file or collection
--output-dir DIR(none)Output directory
--output-formatindexOutput: index or json
--backendsqliteStorage: sqlite or pgvector
--file-typesmd,txt,rstComma-separated extensions
--exclude(none)Glob patterns to exclude
--languagesenLanguage codes
--tags(none)Tags for all chunks
--validatefalseValidate after building
--verbosefalseDetailed output

Chunking Strategies

Choose the right strategy for your content:

StrategyBest ForKey Options
sentenceGeneral prose, articles--max-sentences-per-chunk
slidingCode, technical documentation--chunk-size, --overlap-size
paragraphStructured documents(none)
pagePDFs with distinct pages(none)
semanticCoherent topic grouping--semantic-threshold
topicLong documents by subject--topic-threshold
qaQuestion-answering apps(none)
markdownDocumentation with code blocks(preserves structure)
jsonPre-chunked content(none)

Sentence Chunking (Default)

Groups sentences together:

## Default: 5 sentences per chunk
sw-search ./docs --chunking-strategy sentence

## Custom sentence count
sw-search ./docs \
--chunking-strategy sentence \
--max-sentences-per-chunk 10

## Split on multiple newlines
sw-search ./docs \
--chunking-strategy sentence \
--max-sentences-per-chunk 8 \
--split-newlines 2

Sliding Window Chunking

Fixed-size chunks with overlap:

sw-search ./docs \
--chunking-strategy sliding \
--chunk-size 100 \
--overlap-size 20

Paragraph Chunking

Splits on double newlines:

sw-search ./docs \
--chunking-strategy paragraph \
--file-types md,txt,rst

Page Chunking

Best for PDFs:

sw-search ./docs \
--chunking-strategy page \
--file-types pdf

Semantic Chunking

Groups semantically similar sentences:

sw-search ./docs \
--chunking-strategy semantic \
--semantic-threshold 0.6

Topic Chunking

Detects topic changes:

sw-search ./docs \
--chunking-strategy topic \
--topic-threshold 0.2

QA Chunking

Optimized for question-answering:

sw-search ./docs --chunking-strategy qa

Markdown Chunking

The markdown strategy is specifically designed for documentation that contains code examples. It understands markdown structure and adds rich metadata for better search results.

sw-search ./docs \
--chunking-strategy markdown \
--file-types md

Features:

  • Header-based chunking: Splits at markdown headers (h1, h2, h3...) for natural boundaries
  • Code block detection: Identifies fenced code blocks and extracts language (python, bash, etc.)
  • Smart tagging: Adds "code" tags to chunks with code, plus language-specific tags
  • Section hierarchy: Preserves full path (e.g., "API Reference > AgentBase > Methods")
  • Code protection: Never splits inside code blocks
  • Metadata enrichment: Header levels stored as searchable metadata

Example Metadata:

{
"chunk_type": "markdown",
"h1": "API Reference",
"h2": "AgentBase",
"h3": "add_skill Method",
"has_code": true,
"code_languages": ["python", "bash"],
"tags": ["code", "code:python", "code:bash", "depth:3"]
}

Search Benefits:

When users search for "example code Python":

  • Chunks with code blocks get automatic 20% boost
  • Python-specific code gets language match bonus
  • Vector similarity provides primary semantic ranking
  • Metadata tags provide confirmation signals
  • Results blend semantic + structural relevance

Best Used With:

  • API documentation with code examples
  • Tutorial content with inline code
  • Technical guides with multiple languages
  • README files with usage examples

Usage with pgvector:

sw-search ./docs \
--backend pgvector \
--connection-string "postgresql://user:pass@localhost:5432/db" \
--output docs_collection \
--chunking-strategy markdown

JSON Chunking

The json strategy allows you to provide pre-chunked content in a structured format. This is useful when you need custom control over how documents are split and indexed.

Expected JSON Format:

{
"chunks": [
{
"chunk_id": "unique_id",
"type": "content",
"content": "The actual text content",
"metadata": {
"section": "Introduction",
"url": "https://example.com/docs/intro",
"custom_field": "any_value"
},
"tags": ["intro", "getting-started"]
}
]
}

Usage:

## First preprocess your documents into JSON chunks
python your_preprocessor.py input.txt -o chunks.json

## Then build the index using JSON strategy
sw-search chunks.json --chunking-strategy json --file-types json

Best Used For:

  • API documentation with complex structure
  • Documents that need custom parsing logic
  • Preserving specific metadata relationships
  • Integration with external preprocessing tools

Model Selection

Choose embedding model based on speed vs quality:

AliasModelDimsSpeedQuality
miniall-MiniLM-L6-v2384~5xGood
baseall-mpnet-base-v27681xHigh
largeall-mpnet-base-v27681xHighest
## Fast model (default, recommended for most cases)
sw-search ./docs --model mini

## Balanced model
sw-search ./docs --model base

## Best quality
sw-search ./docs --model large

## Full model name
sw-search ./docs --model sentence-transformers/all-mpnet-base-v2

File Filtering

## Specific file types
sw-search ./docs --file-types md,txt,rst,py

## Exclude patterns
sw-search ./docs --exclude "**/test/**,**/__pycache__/**,**/.git/**"

## Language filtering
sw-search ./docs --languages en,es,fr

Tags and Metadata

Add tags during build for filtered searching:

## Add tags to all chunks
sw-search ./docs --tags documentation,api,v2

## Filter by tags when searching
sw-search search index.swsearch "query" --tags documentation

Searching Indexes

## Search with query
sw-search search knowledge.swsearch "how to create an agent"

## Limit results
sw-search search knowledge.swsearch "API reference" --count 3

## Verbose output with scores
sw-search search knowledge.swsearch "configuration" --verbose

Search Options

OptionDefaultDescription
--count5Number of results
--distance-threshold0.0Minimum similarity score
--tags(none)Filter by tags
--query-nlp-backendnltkNLP backend: nltk or spacy
--keyword-weight(auto)Manual keyword weight (0.0-1.0)
--model(index)Override embedding model
--jsonfalseOutput as JSON
--no-contentfalseHide content, show metadata only
--verbosefalseDetailed output

Output Formats

## Human-readable (default)
sw-search search knowledge.swsearch "query"

## JSON output
sw-search search knowledge.swsearch "query" --json

## Metadata only
sw-search search knowledge.swsearch "query" --no-content

## Full verbose output
sw-search search knowledge.swsearch "query" --verbose

Filter by Tags

## Single tag
sw-search search knowledge.swsearch "functions" --tags documentation

## Multiple tags
sw-search search knowledge.swsearch "API" --tags api,reference

Interactive Search Shell

Load index once and search multiple times:

sw-search search knowledge.swsearch --shell

Shell commands:

CommandDescription
helpShow help
exit/quit/qExit shell
count=NSet result count
tags=tag1,tag2Set tag filter
verboseToggle verbose output
<query>Search for query

Example session:

$ sw-search search knowledge.swsearch --shell
Search Shell - Index: knowledge.swsearch
Backend: sqlite
Index contains 1523 chunks from 47 files
Model: sentence-transformers/all-MiniLM-L6-v2
Type 'exit' or 'quit' to leave, 'help' for options
------------------------------------------------------------

search> how to create an agent
Found 5 result(s) for 'how to create an agent' (0.034s):
...

search> count=3
Result count set to: 3

search> SWAIG functions
Found 3 result(s) for 'SWAIG functions' (0.028s):
...

search> exit
Goodbye!

PostgreSQL/pgvector Backend

The search system supports multiple storage backends. Choose based on your deployment needs:

Backend Comparison

FeatureSQLitepgvector
Setup complexityNoneRequires PostgreSQL
ScalabilityLimitedExcellent
Concurrent accessPoorExcellent
Update capabilityRebuild requiredReal-time
Performance (small datasets)ExcellentGood
Performance (large datasets)PoorExcellent
DeploymentFile copyDatabase connection
Multi-agent supportSeparate copiesShared knowledge base

SQLite Backend (Default):

  • File-based .swsearch indexes
  • Portable single-file format
  • No external dependencies
  • Best for: Single-agent deployments, development, small to medium datasets

pgvector Backend:

  • Server-based PostgreSQL storage
  • Efficient similarity search with IVFFlat/HNSW indexes
  • Multiple agents can share the same knowledge base
  • Real-time updates without rebuilding
  • Best for: Production deployments, multi-agent systems, large datasets

Building with pgvector

## Build to pgvector
sw-search ./docs \
--backend pgvector \
--connection-string "postgresql://user:pass@localhost:5432/knowledge" \
--output docs_collection

## With markdown strategy
sw-search ./docs \
--backend pgvector \
--connection-string "postgresql://user:pass@localhost:5432/knowledge" \
--output docs_collection \
--chunking-strategy markdown

## Overwrite existing collection
sw-search ./docs \
--backend pgvector \
--connection-string "postgresql://user:pass@localhost:5432/knowledge" \
--output docs_collection \
--overwrite

Search pgvector Collection

sw-search search docs_collection "how to create an agent" \
--backend pgvector \
--connection-string "postgresql://user:pass@localhost/knowledge"

Migration

Migrate indexes between backends:

## Get index information
sw-search migrate --info ./docs.swsearch

## Migrate SQLite to pgvector
sw-search migrate ./docs.swsearch --to-pgvector \
--connection-string "postgresql://user:pass@localhost/db" \
--collection-name docs_collection

## Migrate with overwrite
sw-search migrate ./docs.swsearch --to-pgvector \
--connection-string "postgresql://user:pass@localhost/db" \
--collection-name docs_collection \
--overwrite

Migration Options

OptionDescription
--infoShow index information
--to-pgvectorMigrate SQLite to pgvector
--to-sqliteMigrate pgvector to SQLite (planned)
--connection-stringPostgreSQL connection string
--collection-nameTarget collection name
--overwriteOverwrite existing collection
--batch-sizeChunks per batch (default: 100)

Local vs Remote Modes

The search skill supports both local and remote operation modes.

Local Mode (Default)

Searches are performed directly in the agent process using the embedded search engine.

Pros:

  • Faster (no network latency)
  • Works offline
  • Simple deployment
  • Lower operational complexity

Cons:

  • Higher memory usage per agent
  • Index files must be distributed with each agent
  • Updates require redeploying agents

Configuration in Agent:

self.add_skill("native_vector_search", {
"tool_name": "search_docs",
"index_file": "docs.swsearch", # Local file
"nlp_backend": "nltk"
})

Remote Mode

Searches are performed via HTTP API to a centralized search server.

Pros:

  • Lower memory usage per agent
  • Centralized index management
  • Easy updates without redeploying agents
  • Better scalability for multiple agents
  • Shared resources

Cons:

  • Network dependency
  • Additional infrastructure complexity
  • Potential latency

Configuration in Agent:

self.add_skill("native_vector_search", {
"tool_name": "search_docs",
"remote_url": "http://localhost:8001", # Search server
"index_name": "docs",
"nlp_backend": "nltk"
})

Automatic Mode Detection

The skill automatically detects which mode to use:

  • If remote_url is provided → Remote mode
  • If index_file is provided → Local mode
  • Remote mode takes priority if both are specified

Running a Remote Search Server

  1. Start the search server:
python examples/search_server_standalone.py
  1. The server provides HTTP API:

    • POST /search - Search the indexes
    • GET /health - Health check and available indexes
    • POST /reload_index - Add or reload an index
  2. Test the API:

curl -X POST "http://localhost:8001/search" \
-H "Content-Type: application/json" \
-d '{"query": "how to create an agent", "index_name": "docs", "count": 3}'

Remote Search CLI

Search via remote API endpoint from the command line:

## Basic remote search
sw-search remote http://localhost:8001 "how to create an agent" \
--index-name docs

## With options
sw-search remote localhost:8001 "API reference" \
--index-name docs \
--count 3 \
--verbose

## JSON output
sw-search remote localhost:8001 "query" \
--index-name docs \
--json

Remote Options

OptionDefaultDescription
--index-name(required)Name of the index to search
--count5Number of results
--distance-threshold0.0Minimum similarity score
--tags(none)Filter by tags
--timeout30Request timeout in seconds
--jsonfalseOutput as JSON
--no-contentfalseHide content
--verbosefalseDetailed output

Validation

Verify index integrity:

## Validate index
sw-search validate ./docs.swsearch

## Verbose validation
sw-search validate ./docs.swsearch --verbose

Output:

✓ Index is valid: ./docs.swsearch
Chunks: 1523
Files: 47

Configuration:
embedding_model: sentence-transformers/all-MiniLM-L6-v2
embedding_dimensions: 384
chunking_strategy: markdown
created_at: 2025-01-15T10:30:00

JSON Export

Export chunks for review or external processing:

## Export to single JSON file
sw-search ./docs \
--output-format json \
--output all_chunks.json

## Export to directory (one file per source)
sw-search ./docs \
--output-format json \
--output-dir ./chunks/

## Build index from exported JSON
sw-search ./chunks/ \
--chunking-strategy json \
--file-types json \
--output final.swsearch

NLP Backend Selection

Choose NLP backend for processing:

BackendSpeedQualityInstall Size
nltkFastGoodIncluded
spacySlowerBetterRequires: pip install signalwire-agents[search-nlp]
## Index with NLTK (default)
sw-search ./docs --index-nlp-backend nltk

## Index with spaCy (better quality)
sw-search ./docs --index-nlp-backend spacy

## Query with NLTK
sw-search search index.swsearch "query" --query-nlp-backend nltk

## Query with spaCy
sw-search search index.swsearch "query" --query-nlp-backend spacy

Complete Configuration Example

sw-search ./docs ./examples README.md \
--output ./knowledge.swsearch \
--chunking-strategy sentence \
--max-sentences-per-chunk 8 \
--file-types md,txt,rst,py \
--exclude "**/test/**,**/__pycache__/**" \
--languages en,es,fr \
--model sentence-transformers/all-mpnet-base-v2 \
--tags documentation,api \
--index-nlp-backend nltk \
--validate \
--verbose

Using with Skills

After building an index, use it with the native_vector_search skill:

from signalwire_agents import AgentBase

agent = AgentBase(name="search-agent")

## Add search skill with built index
agent.add_skill("native_vector_search", {
"index_path": "./knowledge.swsearch",
"tool_name": "search_docs",
"tool_description": "Search the documentation"
})

Output Formats

FormatExtensionDescription
swsearch.swsearchSQLite-based portable index (default)
json.jsonJSON export of chunks
pgvector(database)PostgreSQL with pgvector extension

Installation Requirements

The search system uses optional dependencies to keep the base SDK lightweight. Choose the installation option that fits your needs:

Basic Search (~500MB)

pip install "signalwire-agents[search]"

Includes:

  • Core search functionality
  • Sentence transformers for embeddings
  • SQLite FTS5 for keyword search
  • Basic document processing (text, markdown)

Full Document Processing (~600MB)

pip install "signalwire-agents[search-full]"

Adds:

  • PDF processing (PyPDF2)
  • DOCX processing (python-docx)
  • HTML processing (BeautifulSoup4)
  • Additional file format support

Advanced NLP Features (~700MB)

pip install "signalwire-agents[search-nlp]"

Adds:

  • spaCy for advanced text processing
  • NLTK for linguistic analysis
  • Enhanced query preprocessing
  • Language detection

Additional Setup Required:

python -m spacy download en_core_web_sm

Performance Note: Advanced NLP features provide significantly better query understanding, synonym expansion, and search relevance, but are 2-3x slower than basic search. Only recommended if you have sufficient CPU power and can tolerate longer response times.

All Search Features (~700MB)

pip install "signalwire-agents[search-all]"

Includes everything above.

Additional Setup Required:

python -m spacy download en_core_web_sm

Query-Only Mode (~400MB)

pip install "signalwire-agents[search-queryonly]"

For agents that only need to query pre-built indexes without building new ones.

PostgreSQL Vector Support

pip install "signalwire-agents[pgvector]"

Adds PostgreSQL with pgvector extension support for production deployments.

NLP Backend Selection

You can choose which NLP backend to use for query processing:

BackendSpeedQualityNotes
nltkFast (~50-100ms)GoodDefault, good for most use cases
spacySlower (~150-300ms)BetterBetter POS tagging and entity recognition

Configure via --index-nlp-backend (build) or --query-nlp-backend (search) flags.

API Reference

For programmatic access to the search system, use the Python API directly.

SearchEngine Class

from signalwire_agents.search import SearchEngine

## Load an index
engine = SearchEngine("docs.swsearch")

## Perform search
results = engine.search(
query_vector=[...], # Optional: pre-computed query vector
enhanced_text="search query", # Enhanced query text
count=5, # Number of results
similarity_threshold=0.0, # Minimum similarity score
tags=["documentation"] # Filter by tags
)

## Get index statistics
stats = engine.get_stats()
print(f"Total chunks: {stats['total_chunks']}")
print(f"Total files: {stats['total_files']}")

IndexBuilder Class

from signalwire_agents.search import IndexBuilder

## Create index builder
builder = IndexBuilder(
model_name="sentence-transformers/all-mpnet-base-v2",
chunk_size=500,
chunk_overlap=50,
verbose=True
)

## Build index
builder.build_index(
source_dir="./docs",
output_file="docs.swsearch",
file_types=["md", "txt"],
exclude_patterns=["**/test/**"],
tags=["documentation"]
)

Troubleshooting

IssueSolution
pgvector errorspip install signalwire-agents[pgvector]
PDF processing failspip install signalwire-agents[search-full]
spaCy not foundpip install signalwire-agents[search-nlp]
No results foundTry different chunking strategy
Index too largeUse --model mini, reduce file types
Connection refused (remote)Check search server is running