Cli Sw Search

sw-search CLI

Summary: Command-line tool for building, searching, and managing vector search indexes for AI agent knowledge bases.

Overview

The sw-search tool builds vector search indexes from documents for use with the native_vector_search skill.

Capabilities:

Build indexes from documents (MD, TXT, PDF, DOCX, RST, PY)
Multiple chunking strategies for different content types
SQLite and PostgreSQL/pgvector storage backends
Interactive search shell for index exploration
Export chunks to JSON for review or external processing
Migrate indexes between backends
Search via remote API endpoints

Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Documents     │───▶│   Index Builder  │───▶│  .swsearch DB   │
│ (MD, PDF, etc.) │    │                  │    │                 │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                                                        │
                                                        ▼
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│     Agent       │───▶│  Search Skill    │───▶│  Search Engine  │
│                 │    │                  │    │                 │
└─────────────────┘    └──────────────────┘    └─────────────────┘

The system provides:

Offline Search: No external API calls or internet required
Hybrid Search: Combines vector similarity and keyword search
Smart Chunking: Intelligent document segmentation with context preservation
Advanced Query Processing: NLP-enhanced query understanding
Flexible Deployment: Local embedded mode or remote server mode
SQLite Storage: Portable .swsearch index files

Command Modes

sw-search operates in five modes:

Mode	Syntax	Purpose
build	`sw-search ./docs`	Build search index
search	`sw-search search FILE QUERY`	Search existing index
validate	`sw-search validate FILE`	Validate index integrity
migrate	`sw-search migrate FILE`	Migrate between backends
remote	`sw-search remote URL QUERY`	Search via remote API

Quick Start

## Build index from documentation
sw-search ./docs --output knowledge.swsearch

## Search the index
sw-search search knowledge.swsearch "how to create an agent"

## Interactive search shell
sw-search search knowledge.swsearch --shell

## Validate index
sw-search validate knowledge.swsearch

Building Indexes

Index Structure

Each .swsearch file is a SQLite database containing:

Document chunks with embeddings and metadata
Full-text search index (SQLite FTS5) for keyword search
Configuration and model information
Synonym cache for query expansion

This portable format allows you to build indexes once and distribute them with your agents.

Basic Usage

## Build from single directory
sw-search ./docs

## Build from multiple directories
sw-search ./docs ./examples --file-types md,txt,py

## Build from individual files
sw-search README.md ./docs/guide.md ./src/main.py

## Mixed sources (directories and files)
sw-search ./docs README.md ./examples specific_file.txt

## Specify output file
sw-search ./docs --output ./knowledge.swsearch

Build Options

Option	Default	Description
`--output FILE`	sources.swsearch	Output file or collection
`--output-dir DIR`	(none)	Output directory
`--output-format`	index	Output: index or json
`--backend`	sqlite	Storage: sqlite or pgvector
`--file-types`	md,txt,rst	Comma-separated extensions
`--exclude`	(none)	Glob patterns to exclude
`--languages`	en	Language codes
`--tags`	(none)	Tags for all chunks
`--validate`	false	Validate after building
`--verbose`	false	Detailed output

Chunking Strategies

Choose the right strategy for your content:

Strategy	Best For	Key Options
sentence	General prose, articles	`--max-sentences-per-chunk`
sliding	Code, technical documentation	`--chunk-size`, `--overlap-size`
paragraph	Structured documents	(none)
page	PDFs with distinct pages	(none)
semantic	Coherent topic grouping	`--semantic-threshold`
topic	Long documents by subject	`--topic-threshold`
qa	Question-answering apps	(none)
markdown	Documentation with code blocks	(preserves structure)
json	Pre-chunked content	(none)

Sentence Chunking (Default)

Groups sentences together:

## Default: 5 sentences per chunk
sw-search ./docs --chunking-strategy sentence

## Custom sentence count
sw-search ./docs \
  --chunking-strategy sentence \
  --max-sentences-per-chunk 10

## Split on multiple newlines
sw-search ./docs \
  --chunking-strategy sentence \
  --max-sentences-per-chunk 8 \
  --split-newlines 2

Sliding Window Chunking

Fixed-size chunks with overlap:

sw-search ./docs \
  --chunking-strategy sliding \
  --chunk-size 100 \
  --overlap-size 20

Paragraph Chunking

Splits on double newlines:

sw-search ./docs \
  --chunking-strategy paragraph \
  --file-types md,txt,rst

Page Chunking

Best for PDFs:

sw-search ./docs \
  --chunking-strategy page \
  --file-types pdf

Semantic Chunking

Groups semantically similar sentences:

sw-search ./docs \
  --chunking-strategy semantic \
  --semantic-threshold 0.6

Topic Chunking

Detects topic changes:

sw-search ./docs \
  --chunking-strategy topic \
  --topic-threshold 0.2

QA Chunking

Optimized for question-answering:

sw-search ./docs --chunking-strategy qa

Markdown Chunking

The markdown strategy is specifically designed for documentation that contains code examples. It understands markdown structure and adds rich metadata for better search results.

sw-search ./docs \
  --chunking-strategy markdown \
  --file-types md

Features:

Header-based chunking: Splits at markdown headers (h1, h2, h3...) for natural boundaries
Code block detection: Identifies fenced code blocks and extracts language (python, bash, etc.)
Smart tagging: Adds "code" tags to chunks with code, plus language-specific tags
Section hierarchy: Preserves full path (e.g., "API Reference > AgentBase > Methods")
Code protection: Never splits inside code blocks
Metadata enrichment: Header levels stored as searchable metadata

Example Metadata:

{
  "chunk_type": "markdown",
  "h1": "API Reference",
  "h2": "AgentBase",
  "h3": "add_skill Method",
  "has_code": true,
  "code_languages": ["python", "bash"],
  "tags": ["code", "code:python", "code:bash", "depth:3"]
}

Search Benefits:

When users search for "example code Python":

Chunks with code blocks get automatic 20% boost
Python-specific code gets language match bonus
Vector similarity provides primary semantic ranking
Metadata tags provide confirmation signals
Results blend semantic + structural relevance

Best Used With:

API documentation with code examples
Tutorial content with inline code
Technical guides with multiple languages
README files with usage examples

Usage with pgvector:

sw-search ./docs \
  --backend pgvector \
  --connection-string "postgresql://user:pass@localhost:5432/db" \
  --output docs_collection \
  --chunking-strategy markdown

JSON Chunking

The json strategy allows you to provide pre-chunked content in a structured format. This is useful when you need custom control over how documents are split and indexed.

Expected JSON Format:

{
  "chunks": [
    {
      "chunk_id": "unique_id",
      "type": "content",
      "content": "The actual text content",
      "metadata": {
        "section": "Introduction",
        "url": "https://example.com/docs/intro",
        "custom_field": "any_value"
      },
      "tags": ["intro", "getting-started"]
    }
  ]
}

Usage:

## First preprocess your documents into JSON chunks
python your_preprocessor.py input.txt -o chunks.json

## Then build the index using JSON strategy
sw-search chunks.json --chunking-strategy json --file-types json

Best Used For:

API documentation with complex structure
Documents that need custom parsing logic
Preserving specific metadata relationships
Integration with external preprocessing tools

Model Selection

Choose embedding model based on speed vs quality:

Alias	Model	Dims	Speed	Quality
mini	all-MiniLM-L6-v2	384	~5x	Good
base	all-mpnet-base-v2	768	1x	High
large	all-mpnet-base-v2	768	1x	Highest

## Fast model (default, recommended for most cases)
sw-search ./docs --model mini

## Balanced model
sw-search ./docs --model base

## Best quality
sw-search ./docs --model large

## Full model name
sw-search ./docs --model sentence-transformers/all-mpnet-base-v2

File Filtering

## Specific file types
sw-search ./docs --file-types md,txt,rst,py

## Exclude patterns
sw-search ./docs --exclude "**/test/**,**/__pycache__/**,**/.git/**"

## Language filtering
sw-search ./docs --languages en,es,fr

Tags and Metadata

Add tags during build for filtered searching:

## Add tags to all chunks
sw-search ./docs --tags documentation,api,v2

## Filter by tags when searching
sw-search search index.swsearch "query" --tags documentation

Searching Indexes

Basic Search

## Search with query
sw-search search knowledge.swsearch "how to create an agent"

## Limit results
sw-search search knowledge.swsearch "API reference" --count 3

## Verbose output with scores
sw-search search knowledge.swsearch "configuration" --verbose

Search Options

Option	Default	Description
`--count`	5	Number of results
`--distance-threshold`	0.0	Minimum similarity score
`--tags`	(none)	Filter by tags
`--query-nlp-backend`	nltk	NLP backend: nltk or spacy
`--keyword-weight`	(auto)	Manual keyword weight (0.0-1.0)
`--model`	(index)	Override embedding model
`--json`	false	Output as JSON
`--no-content`	false	Hide content, show metadata only
`--verbose`	false	Detailed output

Output Formats

## Human-readable (default)
sw-search search knowledge.swsearch "query"

## JSON output
sw-search search knowledge.swsearch "query" --json

## Metadata only
sw-search search knowledge.swsearch "query" --no-content

## Full verbose output
sw-search search knowledge.swsearch "query" --verbose

Filter by Tags

## Single tag
sw-search search knowledge.swsearch "functions" --tags documentation

## Multiple tags
sw-search search knowledge.swsearch "API" --tags api,reference

Interactive Search Shell

Load index once and search multiple times:

sw-search search knowledge.swsearch --shell

Shell commands:

Command	Description
`help`	Show help
`exit`/`quit`/`q`	Exit shell
`count=N`	Set result count
`tags=tag1,tag2`	Set tag filter
`verbose`	Toggle verbose output
`<query>`	Search for query

Example session:

$ sw-search search knowledge.swsearch --shell
Search Shell - Index: knowledge.swsearch
Backend: sqlite
Index contains 1523 chunks from 47 files
Model: sentence-transformers/all-MiniLM-L6-v2
Type 'exit' or 'quit' to leave, 'help' for options
------------------------------------------------------------

search> how to create an agent
Found 5 result(s) for 'how to create an agent' (0.034s):
...

search> count=3
Result count set to: 3

search> SWAIG functions
Found 3 result(s) for 'SWAIG functions' (0.028s):
...

search> exit
Goodbye!

PostgreSQL/pgvector Backend

The search system supports multiple storage backends. Choose based on your deployment needs:

Backend Comparison

Feature	SQLite	pgvector
Setup complexity	None	Requires PostgreSQL
Scalability	Limited	Excellent
Concurrent access	Poor	Excellent
Update capability	Rebuild required	Real-time
Performance (small datasets)	Excellent	Good
Performance (large datasets)	Poor	Excellent
Deployment	File copy	Database connection
Multi-agent support	Separate copies	Shared knowledge base

SQLite Backend (Default):

File-based .swsearch indexes
Portable single-file format
No external dependencies
Best for: Single-agent deployments, development, small to medium datasets

pgvector Backend:

Server-based PostgreSQL storage
Efficient similarity search with IVFFlat/HNSW indexes
Multiple agents can share the same knowledge base
Real-time updates without rebuilding
Best for: Production deployments, multi-agent systems, large datasets

Building with pgvector

## Build to pgvector
sw-search ./docs \
  --backend pgvector \
  --connection-string "postgresql://user:pass@localhost:5432/knowledge" \
  --output docs_collection

## With markdown strategy
sw-search ./docs \
  --backend pgvector \
  --connection-string "postgresql://user:pass@localhost:5432/knowledge" \
  --output docs_collection \
  --chunking-strategy markdown

## Overwrite existing collection
sw-search ./docs \
  --backend pgvector \
  --connection-string "postgresql://user:pass@localhost:5432/knowledge" \
  --output docs_collection \
  --overwrite

Search pgvector Collection

sw-search search docs_collection "how to create an agent" \
  --backend pgvector \
  --connection-string "postgresql://user:pass@localhost/knowledge"

Migration

Migrate indexes between backends:

## Get index information
sw-search migrate --info ./docs.swsearch

## Migrate SQLite to pgvector
sw-search migrate ./docs.swsearch --to-pgvector \
  --connection-string "postgresql://user:pass@localhost/db" \
  --collection-name docs_collection

## Migrate with overwrite
sw-search migrate ./docs.swsearch --to-pgvector \
  --connection-string "postgresql://user:pass@localhost/db" \
  --collection-name docs_collection \
  --overwrite

Migration Options

Option	Description
`--info`	Show index information
`--to-pgvector`	Migrate SQLite to pgvector
`--to-sqlite`	Migrate pgvector to SQLite (planned)
`--connection-string`	PostgreSQL connection string
`--collection-name`	Target collection name
`--overwrite`	Overwrite existing collection
`--batch-size`	Chunks per batch (default: 100)

Local vs Remote Modes

The search skill supports both local and remote operation modes.

Local Mode (Default)

Searches are performed directly in the agent process using the embedded search engine.

Pros:

Faster (no network latency)
Works offline
Simple deployment
Lower operational complexity

Cons:

Higher memory usage per agent
Index files must be distributed with each agent
Updates require redeploying agents

Configuration in Agent:

self.add_skill("native_vector_search", {
    "tool_name": "search_docs",
    "index_file": "docs.swsearch",  # Local file
    "nlp_backend": "nltk"
})

Remote Mode

Searches are performed via HTTP API to a centralized search server.

Pros:

Lower memory usage per agent
Centralized index management
Easy updates without redeploying agents
Better scalability for multiple agents
Shared resources

Cons:

Network dependency
Additional infrastructure complexity
Potential latency

Configuration in Agent:

self.add_skill("native_vector_search", {
    "tool_name": "search_docs",
    "remote_url": "http://localhost:8001",  # Search server
    "index_name": "docs",
    "nlp_backend": "nltk"
})

Automatic Mode Detection

The skill automatically detects which mode to use:

If remote_url is provided → Remote mode
If index_file is provided → Local mode
Remote mode takes priority if both are specified

Running a Remote Search Server

Start the search server:

python examples/search_server_standalone.py

The server provides HTTP API:
- POST /search - Search the indexes
- GET /health - Health check and available indexes
- POST /reload_index - Add or reload an index
Test the API:

curl -X POST "http://localhost:8001/search" \
     -H "Content-Type: application/json" \
     -d '{"query": "how to create an agent", "index_name": "docs", "count": 3}'

Remote Search CLI

Search via remote API endpoint from the command line:

## Basic remote search
sw-search remote http://localhost:8001 "how to create an agent" \
  --index-name docs

## With options
sw-search remote localhost:8001 "API reference" \
  --index-name docs \
  --count 3 \
  --verbose

## JSON output
sw-search remote localhost:8001 "query" \
  --index-name docs \
  --json

Remote Options

Option	Default	Description
`--index-name`	(required)	Name of the index to search
`--count`	5	Number of results
`--distance-threshold`	0.0	Minimum similarity score
`--tags`	(none)	Filter by tags
`--timeout`	30	Request timeout in seconds
`--json`	false	Output as JSON
`--no-content`	false	Hide content
`--verbose`	false	Detailed output

Validation

Verify index integrity:

## Validate index
sw-search validate ./docs.swsearch

## Verbose validation
sw-search validate ./docs.swsearch --verbose

Output:

✓ Index is valid: ./docs.swsearch
  Chunks: 1523
  Files: 47

Configuration:
  embedding_model: sentence-transformers/all-MiniLM-L6-v2
  embedding_dimensions: 384
  chunking_strategy: markdown
  created_at: 2025-01-15T10:30:00

JSON Export

Export chunks for review or external processing:

## Export to single JSON file
sw-search ./docs \
  --output-format json \
  --output all_chunks.json

## Export to directory (one file per source)
sw-search ./docs \
  --output-format json \
  --output-dir ./chunks/

## Build index from exported JSON
sw-search ./chunks/ \
  --chunking-strategy json \
  --file-types json \
  --output final.swsearch

NLP Backend Selection

Choose NLP backend for processing:

Backend	Speed	Quality	Install Size
nltk	Fast	Good	Included
spacy	Slower	Better	Requires: `pip install signalwire-agents[search-nlp]`

## Index with NLTK (default)
sw-search ./docs --index-nlp-backend nltk

## Index with spaCy (better quality)
sw-search ./docs --index-nlp-backend spacy

## Query with NLTK
sw-search search index.swsearch "query" --query-nlp-backend nltk

## Query with spaCy
sw-search search index.swsearch "query" --query-nlp-backend spacy

Complete Configuration Example

sw-search ./docs ./examples README.md \
  --output ./knowledge.swsearch \
  --chunking-strategy sentence \
  --max-sentences-per-chunk 8 \
  --file-types md,txt,rst,py \
  --exclude "**/test/**,**/__pycache__/**" \
  --languages en,es,fr \
  --model sentence-transformers/all-mpnet-base-v2 \
  --tags documentation,api \
  --index-nlp-backend nltk \
  --validate \
  --verbose

Using with Skills

After building an index, use it with the native_vector_search skill:

from signalwire_agents import AgentBase

agent = AgentBase(name="search-agent")

## Add search skill with built index
agent.add_skill("native_vector_search", {
    "index_path": "./knowledge.swsearch",
    "tool_name": "search_docs",
    "tool_description": "Search the documentation"
})

Output Formats

Format	Extension	Description
swsearch	.swsearch	SQLite-based portable index (default)
json	.json	JSON export of chunks
pgvector	(database)	PostgreSQL with pgvector extension

Installation Requirements

The search system uses optional dependencies to keep the base SDK lightweight. Choose the installation option that fits your needs:

Basic Search (~500MB)

pip install "signalwire-agents[search]"

Includes:

Core search functionality
Sentence transformers for embeddings
SQLite FTS5 for keyword search
Basic document processing (text, markdown)

Full Document Processing (~600MB)

pip install "signalwire-agents[search-full]"

Adds:

PDF processing (PyPDF2)
DOCX processing (python-docx)
HTML processing (BeautifulSoup4)
Additional file format support

Advanced NLP Features (~700MB)

pip install "signalwire-agents[search-nlp]"

Adds:

spaCy for advanced text processing
NLTK for linguistic analysis
Enhanced query preprocessing
Language detection

Additional Setup Required:

python -m spacy download en_core_web_sm

Performance Note: Advanced NLP features provide significantly better query understanding, synonym expansion, and search relevance, but are 2-3x slower than basic search. Only recommended if you have sufficient CPU power and can tolerate longer response times.

All Search Features (~700MB)

pip install "signalwire-agents[search-all]"

Includes everything above.

Additional Setup Required:

python -m spacy download en_core_web_sm

Query-Only Mode (~400MB)

pip install "signalwire-agents[search-queryonly]"

For agents that only need to query pre-built indexes without building new ones.

PostgreSQL Vector Support

pip install "signalwire-agents[pgvector]"

Adds PostgreSQL with pgvector extension support for production deployments.

NLP Backend Selection

You can choose which NLP backend to use for query processing:

Backend	Speed	Quality	Notes
nltk	Fast (~50-100ms)	Good	Default, good for most use cases
spacy	Slower (~150-300ms)	Better	Better POS tagging and entity recognition

Configure via --index-nlp-backend (build) or --query-nlp-backend (search) flags.

API Reference

For programmatic access to the search system, use the Python API directly.

SearchEngine Class

from signalwire_agents.search import SearchEngine

## Load an index
engine = SearchEngine("docs.swsearch")

## Perform search
results = engine.search(
    query_vector=[...],  # Optional: pre-computed query vector
    enhanced_text="search query",  # Enhanced query text
    count=5,  # Number of results
    similarity_threshold=0.0,  # Minimum similarity score
    tags=["documentation"]  # Filter by tags
)

## Get index statistics
stats = engine.get_stats()
print(f"Total chunks: {stats['total_chunks']}")
print(f"Total files: {stats['total_files']}")

IndexBuilder Class

from signalwire_agents.search import IndexBuilder

## Create index builder
builder = IndexBuilder(
    model_name="sentence-transformers/all-mpnet-base-v2",
    chunk_size=500,
    chunk_overlap=50,
    verbose=True
)

## Build index
builder.build_index(
    source_dir="./docs",
    output_file="docs.swsearch",
    file_types=["md", "txt"],
    exclude_patterns=["**/test/**"],
    tags=["documentation"]
)

Troubleshooting

Issue	Solution
Search not available	`pip install signalwire-agents[search]`
pgvector errors	`pip install signalwire-agents[pgvector]`
PDF processing fails	`pip install signalwire-agents[search-full]`
spaCy not found	`pip install signalwire-agents[search-nlp]`
No results found	Try different chunking strategy
Poor search quality	Use `--model base` or larger chunks
Index too large	Use `--model mini`, reduce file types
Connection refused (remote)	Check search server is running

native_vector_search Skill - Using search indexes in agents
Skills Overview - Adding skills to agents
DataSphere Integration - Cloud-based search alternative

sw-search CLI​

Overview​

Architecture​

Command Modes​

Quick Start​

Building Indexes​

Index Structure​

Basic Usage​

Build Options​

Chunking Strategies​

Sentence Chunking (Default)​

Sliding Window Chunking​

Paragraph Chunking​

Page Chunking​

Semantic Chunking​

Topic Chunking​

QA Chunking​

Markdown Chunking​

JSON Chunking​

Model Selection​

File Filtering​

Tags and Metadata​

Searching Indexes​

Basic Search​

Search Options​

Output Formats​

Filter by Tags​

Interactive Search Shell​

PostgreSQL/pgvector Backend​

Backend Comparison​

Building with pgvector​

Search pgvector Collection​

Migration​

Migration Options​

Local vs Remote Modes​

Local Mode (Default)​

Remote Mode​

Automatic Mode Detection​

Running a Remote Search Server​

Remote Search CLI​

Remote Options​

Validation​

JSON Export​

NLP Backend Selection​

Complete Configuration Example​

Using with Skills​

Output Formats​

Installation Requirements​

Basic Search (~500MB)​

Full Document Processing (~600MB)​

Advanced NLP Features (~700MB)​

All Search Features (~700MB)​

Query-Only Mode (~400MB)​

PostgreSQL Vector Support​

NLP Backend Selection​

API Reference​

SearchEngine Class​

IndexBuilder Class​

Troubleshooting​

Related Documentation​

sw-search CLI

Overview

Architecture

Command Modes

Quick Start

Building Indexes

Index Structure

Basic Usage

Build Options

Chunking Strategies

Sentence Chunking (Default)

Sliding Window Chunking

Paragraph Chunking

Page Chunking

Semantic Chunking

Topic Chunking

QA Chunking

Markdown Chunking

JSON Chunking

Model Selection

File Filtering

Tags and Metadata

Searching Indexes

Basic Search

Search Options

Output Formats

Filter by Tags

Interactive Search Shell

PostgreSQL/pgvector Backend

Backend Comparison

Building with pgvector

Search pgvector Collection

Migration

Migration Options

Local vs Remote Modes

Local Mode (Default)

Remote Mode

Automatic Mode Detection

Running a Remote Search Server

Remote Search CLI

Remote Options

Validation

JSON Export

NLP Backend Selection

Complete Configuration Example

Using with Skills

Output Formats

Installation Requirements

Basic Search (~500MB)

Full Document Processing (~600MB)

Advanced NLP Features (~700MB)

All Search Features (~700MB)

Query-Only Mode (~400MB)

PostgreSQL Vector Support

NLP Backend Selection

API Reference

SearchEngine Class

IndexBuilder Class

Troubleshooting

Related Documentation