Elasticsearch – DECODE X NET

LLMやRAGのテストをしていると、ドキュメント検索の基本から学ぶ必要性を感じ、サーチエンジンであるElasticsearchを試してみました。

環境）Python, Docker / WSL2 / Windows 11

docker-compose.yml

version: '3.8'

services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
    container_name: elasticsearch
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ports:
      - "9200:9200"
      - "9300:9300"
    volumes:
      - es_data:/usr/share/elasticsearch/data
    networks:
      - elastic

volumes:
  es_data:
    driver: local

networks:
  elastic:
    driver: bridge

version: '3.8'

services:

elasticsearch:

image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0

container_name: elasticsearch

environment:

- discovery.type=single-node

- xpack.security.enabled=false

- "ES_JAVA_OPTS=-Xms512m -Xmx512m"

ports:

- "9200:9200"

- "9300:9300"

volumes:

- es_data:/usr/share/elasticsearch/data

networks:

- elastic

volumes:

es_data:

driver: local

networks:

elastic:

driver: bridge

docker-compose up -d

Claude-Codeを駆使して、作業をしました。
まず基本的な操作です。

#!/usr/bin/env python3
"""
Elasticsearch サンプルプログラム
簡単なドキュメントの登録、検索、削除を行います
"""

from elasticsearch import Elasticsearch
import json
import time

def connect_to_elasticsearch():
    """Elasticsearchに接続"""
    es = Elasticsearch(
        [{'host': 'localhost', 'port': 9200, 'scheme': 'http'}],
        timeout=30,
        max_retries=10,
        retry_on_timeout=True
    )
    
    # 接続確認
    try:
        # Web API: GET /
        # curl -X GET "localhost:9200/"
        info = es.info()
        print(f"✅ Elasticsearchに接続しました: {info['version']['number']}")
        return es
    except Exception as e:
        print(f"❌ Elasticsearchへの接続に失敗しました: {e}")
        return None

def create_sample_data():
    """サンプルデータを作成"""
    return [
        {
            "id": 1,
            "title": "Python入門",
            "content": "Pythonは簡単で強力なプログラミング言語です",
            "author": "田中太郎",
            "category": "プログラミング",
            "tags": ["python", "初心者", "プログラミング"]
        },
        {
            "id": 2,
            "title": "Elasticsearch入門",
            "content": "Elasticsearchは高速な全文検索エンジンです",
            "author": "佐藤花子",
            "category": "データベース",
            "tags": ["elasticsearch", "検索", "データベース"]
        },
        {
            "id": 3,
            "title": "Docker活用法",
            "content": "Dockerでアプリケーションを効率的にデプロイする方法",
            "author": "鈴木次郎",
            "category": "インフラ",
            "tags": ["docker", "デプロイ", "コンテナ"]
        },
        {
            "id": 4,
            "content": "一太郎と花子"
        }
    ]

def index_documents(es, index_name, documents):
    """ドキュメントをインデックスに登録"""
    print(f"\n📝 ドキュメントを '{index_name}' インデックスに登録中...")
    
    for doc in documents:
        try:
            # Web API: PUT /{index}/_doc/{id}
            # curl -X PUT "localhost:9200/{index}/_doc/{id}" -H 'Content-Type: application/json' -d '{"field":"value"}'
            response = es.index(
                index=index_name,
                id=doc['id'],
                body=doc
            )
            title = doc.get('title', 'タイトルなし')
            print(f"   ✅ ドキュメント ID {doc['id']}: {title}")
        except Exception as e:
            print(f"   ❌ ドキュメント ID {doc['id']} の登録に失敗: {e}")
    
    # インデックスを更新
    # Web API: POST /{index}/_refresh
    # curl -X POST "localhost:9200/{index}/_refresh"
    es.indices.refresh(index=index_name)
    print(f"✅ 全ドキュメントの登録完了")

def search_documents(es, index_name, query):
    """ドキュメントを検索"""
    print(f"\n🔍 検索クエリ: '{query}'")
    
    try:
        search_body = {
            "query": {
                "multi_match": {
                    "query": query,
                    "fields": ["title^2", "content", "author", "tags"]
                }
            },
            "highlight": {
                "fields": {
                    "title": {},
                    "content": {}
                }
            }
        }
        
        # Web API: GET /{index}/_search
        # curl -X GET "localhost:9200/{index}/_search" -H 'Content-Type: application/json' -d '{"query":{...}}'
        response = es.search(
            index=index_name,
            body=search_body
        )
        
        hits = response['hits']['hits']
        total = response['hits']['total']['value']
        
        print(f"📊 検索結果: {total}件")
        
        for i, hit in enumerate(hits, 1):
            source = hit['_source']
            score = hit['_score']
            
            title = source.get('title', 'タイトルなし')
            author = source.get('author', '著者なし')
            category = source.get('category', 'カテゴリなし')
            content = source.get('content', '内容なし')
            
            print(f"\n{i}. [{score:.2f}] {title}")
            print(f"   著者: {author}")
            print(f"   カテゴリ: {category}")
            print(f"   内容: {content}")
            
            # ハイライト表示
            if 'highlight' in hit:
                for field, highlights in hit['highlight'].items():
                    print(f"   🔍 {field}: {highlights[0]}")
                    
        return hits
        
    except Exception as e:
        print(f"❌ 検索に失敗しました: {e}")
        return []

def get_all_documents(es, index_name):
    """全ドキュメントを取得"""
    print(f"\n📋 '{index_name}' の全ドキュメント:")
    
    try:
        # Web API: GET /{index}/_search?size=100
        # curl -X GET "localhost:9200/{index}/_search?size=100" -H 'Content-Type: application/json' -d '{"query":{"match_all":{}}}'
        response = es.search(
            index=index_name,
            body={"query": {"match_all": {}}},
            size=100
        )
        
        hits = response['hits']['hits']
        total = response['hits']['total']['value']
        
        print(f"📊 総ドキュメント数: {total}件")
        
        for hit in hits:
            source = hit['_source']
            title = source.get('title', 'タイトルなし')
            author = source.get('author', '著者なし')
            print(f"   ID {source['id']}: {title} ({author})")
            
    except Exception as e:
        print(f"❌ ドキュメント取得に失敗しました: {e}")

def delete_index(es, index_name):
    """インデックスを削除"""
    try:
        # Web API: HEAD /{index}
        # curl -I "localhost:9200/{index}"
        if es.indices.exists(index=index_name):
            # Web API: DELETE /{index}
            # curl -X DELETE "localhost:9200/{index}"
            es.indices.delete(index=index_name)
            print(f"🗑️  インデックス '{index_name}' を削除しました")
        else:
            print(f"ℹ️  インデックス '{index_name}' は存在しません")
    except Exception as e:
        print(f"❌ インデックス削除に失敗しました: {e}")

def main():
    """メイン処理"""
    print("🚀 Elasticsearch サンプルプログラム開始")
    print("=" * 50)
    
    # Elasticsearchに接続
    es = connect_to_elasticsearch()
    if not es:
        return
    
    index_name = "sample_documents"
    
    # 既存のインデックスを削除（クリーンスタート）
    delete_index(es, index_name)
    
    # サンプルデータを準備
    documents = create_sample_data()
    
    # ドキュメントをインデックスに登録
    index_documents(es, index_name, documents)
    
    # 少し待つ（インデックス更新のため）
    time.sleep(1)
    
    # 全ドキュメントを表示
    get_all_documents(es, index_name)
    
    # 検索テスト
    search_queries = [
        "花子"
    ]
    
    for query in search_queries:
        search_documents(es, index_name, query)
    
    print("\n" + "=" * 50)
    print("✅ サンプルプログラム完了")

if __name__ == "__main__":
    main()

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

#!/usr/bin/env python3

"""

Elasticsearch サンプルプログラム

簡単なドキュメントの登録、検索、削除を行います

"""

from elasticsearch import Elasticsearch

import json

import time

def connect_to_elasticsearch():

"""Elasticsearchに接続"""

es = Elasticsearch(

[{'host': 'localhost', 'port': 9200, 'scheme': 'http'}],

timeout=30,

max_retries=10,

retry_on_timeout=True

)

# 接続確認

try:

# Web API: GET /

# curl -X GET "localhost:9200/"

info = es.info()

print(f"✅ Elasticsearchに接続しました: {info['version']['number']}")

return es

except Exception as e:

print(f"❌ Elasticsearchへの接続に失敗しました: {e}")

return None

def create_sample_data():

"""サンプルデータを作成"""

return [

{

"id": 1,

"title": "Python入門",

"content": "Pythonは簡単で強力なプログラミング言語です",

"author": "田中太郎",

"category": "プログラミング",

"tags": ["python", "初心者", "プログラミング"]

{

"id": 2,

"title": "Elasticsearch入門",

"content": "Elasticsearchは高速な全文検索エンジンです",

"author": "佐藤花子",

"category": "データベース",

"tags": ["elasticsearch", "検索", "データベース"]

{

"id": 3,

"title": "Docker活用法",

"content": "Dockerでアプリケーションを効率的にデプロイする方法",

"author": "鈴木次郎",

"category": "インフラ",

"tags": ["docker", "デプロイ", "コンテナ"]

{

"id": 4,

"content": "一太郎と花子"

}

]

def index_documents(es, index_name, documents):

"""ドキュメントをインデックスに登録"""

print(f"\n📝 ドキュメントを '{index_name}' インデックスに登録中...")

for doc in documents:

try:

# Web API: PUT /{index}/_doc/{id}

# curl -X PUT "localhost:9200/{index}/_doc/{id}" -H 'Content-Type: application/json' -d '{"field":"value"}'

response = es.index(

index=index_name,

id=doc['id'],

body=doc

)

title = doc.get('title', 'タイトルなし')

print(f" ✅ ドキュメント ID {doc['id']}: {title}")

except Exception as e:

print(f" ❌ ドキュメント ID {doc['id']} の登録に失敗: {e}")

# インデックスを更新

# Web API: POST /{index}/_refresh

# curl -X POST "localhost:9200/{index}/_refresh"

es.indices.refresh(index=index_name)

print(f"✅ 全ドキュメントの登録完了")

def search_documents(es, index_name, query):

"""ドキュメントを検索"""

print(f"\n🔍 検索クエリ: '{query}'")

try:

search_body = {

"query": {

"multi_match": {

"query": query,

"fields": ["title^2", "content", "author", "tags"]

}

"highlight": {

"fields": {

"title": {},

"content": {}

}

# Web API: GET /{index}/_search

# curl -X GET "localhost:9200/{index}/_search" -H 'Content-Type: application/json' -d '{"query":{...}}'

response = es.search(

index=index_name,

body=search_body

)

hits = response['hits']['hits']

total = response['hits']['total']['value']

print(f"📊 検索結果: {total}件")

for i, hit in enumerate(hits, 1):

source = hit['_source']

score = hit['_score']

title = source.get('title', 'タイトルなし')

author = source.get('author', '著者なし')

category = source.get('category', 'カテゴリなし')

content = source.get('content', '内容なし')

print(f"\n{i}. [{score:.2f}] {title}")

print(f" 著者: {author}")

print(f" カテゴリ: {category}")

print(f" 内容: {content}")

# ハイライト表示

if 'highlight' in hit:

for field, highlights in hit['highlight'].items():

print(f" 🔍 {field}: {highlights[0]}")

return hits

except Exception as e:

print(f"❌ 検索に失敗しました: {e}")

return []

def get_all_documents(es, index_name):

"""全ドキュメントを取得"""

print(f"\n📋 '{index_name}' の全ドキュメント:")

try:

# Web API: GET /{index}/_search?size=100

# curl -X GET "localhost:9200/{index}/_search?size=100" -H 'Content-Type: application/json' -d '{"query":{"match_all":{}}}'

response = es.search(

index=index_name,

body={"query": {"match_all": {}}},

size=100

)

hits = response['hits']['hits']

total = response['hits']['total']['value']

print(f"📊 総ドキュメント数: {total}件")

for hit in hits:

source = hit['_source']

title = source.get('title', 'タイトルなし')

author = source.get('author', '著者なし')

print(f" ID {source['id']}: {title} ({author})")

except Exception as e:

print(f"❌ ドキュメント取得に失敗しました: {e}")

def delete_index(es, index_name):

"""インデックスを削除"""

try:

# Web API: HEAD /{index}

# curl -I "localhost:9200/{index}"

if es.indices.exists(index=index_name):

# Web API: DELETE /{index}

# curl -X DELETE "localhost:9200/{index}"

es.indices.delete(index=index_name)

print(f"🗑️ インデックス '{index_name}' を削除しました")

else:

print(f"ℹ️ インデックス '{index_name}' は存在しません")

except Exception as e:

print(f"❌ インデックス削除に失敗しました: {e}")

def main():

"""メイン処理"""

print("🚀 Elasticsearch サンプルプログラム開始")

print("=" * 50)

# Elasticsearchに接続

es = connect_to_elasticsearch()

if not es:

return

index_name = "sample_documents"

# 既存のインデックスを削除（クリーンスタート）

delete_index(es, index_name)

# サンプルデータを準備

documents = create_sample_data()

# ドキュメントをインデックスに登録

index_documents(es, index_name, documents)

# 少し待つ（インデックス更新のため）

time.sleep(1)

# 全ドキュメントを表示

get_all_documents(es, index_name)

# 検索テスト

search_queries = [

"花子"

]

for query in search_queries:

search_documents(es, index_name, query)

print("\n" + "=" * 50)

print("✅ サンプルプログラム完了")

if __name__ == "__main__":

main()

出力結果

🚀 Elasticsearch サンプルプログラム開始
==================================================
✅ Elasticsearchに接続しました: 8.11.0
🗑️  インデックス 'sample_documents' を削除しました

📝 ドキュメントを 'sample_documents' インデックスに登録中...
   ✅ ドキュメント ID 1: Python入門
   ✅ ドキュメント ID 2: Elasticsearch入門
   ✅ ドキュメント ID 3: Docker活用法
   ✅ ドキュメント ID 4: タイトルなし
✅ 全ドキュメントの登録完了

📋 'sample_documents' の全ドキュメント:
📊 総ドキュメント数: 4件
   ID 1: Python入門 (田中太郎)
   ID 2: Elasticsearch入門 (佐藤花子)
   ID 3: Docker活用法 (鈴木次郎)
   ID 4: タイトルなし (著者なし)

🔍 検索クエリ: '花子'
📊 検索結果: 2件

1. [2.96] タイトルなし
   著者: 著者なし
   カテゴリ: カテゴリなし
   内容: 一太郎と花子
   🔍 content: 一太郎と<em>花</em><em>子</em>

2. [1.96] Elasticsearch入門
   著者: 佐藤花子
   カテゴリ: データベース
   内容: Elasticsearchは高速な全文検索エンジンです

==================================================
✅ サンプルプログラム完了

🚀 Elasticsearch サンプルプログラム開始

==================================================

✅ Elasticsearchに接続しました: 8.11.0

🗑️ インデックス 'sample_documents' を削除しました

📝 ドキュメントを 'sample_documents' インデックスに登録中...

✅ ドキュメント ID 1: Python入門

✅ ドキュメント ID 2: Elasticsearch入門

✅ ドキュメント ID 3: Docker活用法

✅ ドキュメント ID 4: タイトルなし

✅ 全ドキュメントの登録完了

📋 'sample_documents' の全ドキュメント:

📊 総ドキュメント数: 4件

ID 1: Python入門 (田中太郎)

ID 2: Elasticsearch入門 (佐藤花子)

ID 3: Docker活用法 (鈴木次郎)

ID 4: タイトルなし (著者なし)

🔍 検索クエリ: '花子'

📊 検索結果: 2件

1. [2.96] タイトルなし

著者: 著者なし

カテゴリ: カテゴリなし

内容: 一太郎と花子

🔍 content: 一太郎と<em>花</em><em>子</em>

2. [1.96] Elasticsearch入門

著者: 佐藤花子

カテゴリ: データベース

内容: Elasticsearchは高速な全文検索エンジンです

==================================================

✅ サンプルプログラム完了

AIが生成するコードを改変し、また指示をだし、と繰り返していろいろなテストしました。

次も基本的な操作を試しています。

simple_search.py

#!/usr/bin/env python3
"""
Simple Elasticsearch Search Sample
基本的なテキスト検索のサンプル
"""

from elasticsearch import Elasticsearch
import time

def wait_for_elasticsearch(es_client, max_retries=30):
    """Elasticsearchが起動するまで待機"""
    for i in range(max_retries):
        try:
            if es_client.ping():
                print("Elasticsearch接続成功")
                return True
        except Exception as e:
            print(f"Elasticsearch接続待機中... ({i+1}/{max_retries})")
            time.sleep(2)
    return False

def setup_index(es_client, index_name):
    """インデックスの作成と設定"""
    if es_client.indices.exists(index=index_name):
        es_client.indices.delete(index=index_name)
    
    mapping = {
        "mappings": {
            "properties": {
                "title": {"type": "text", "analyzer": "standard"},
                "content": {"type": "text", "analyzer": "standard"},
                "category": {"type": "keyword"},
                "created_at": {"type": "date"}
            }
        }
    }
    
    es_client.indices.create(index=index_name, body=mapping)
    print(f"インデックス '{index_name}' を作成しました")

def insert_sample_data(es_client, index_name):
    """サンプルデータの挿入"""
    documents = [
        {
            "title": "Python入門",
            "content": "Pythonは初心者にも学びやすいプログラミング言語です。データ分析や機械学習にも使われます。",
            "category": "programming",
            "created_at": "2024-01-01"
        },
        {
            "title": "機械学習の基礎",
            "content": "機械学習はAIの一分野で、データからパターンを学習します。教師あり学習と教師なし学習があります。",
            "category": "ai",
            "created_at": "2024-01-02"
        },
        {
            "title": "Elasticsearchガイド",
            "content": "Elasticsearchは分散検索エンジンです。全文検索やデータ分析に優れています。",
            "category": "database",
            "created_at": "2024-01-03"
        },
        {
            "title": "Webアプリケーション開発",
            "content": "WebアプリケーションはHTTPプロトコルを使用してユーザーとやり取りします。フロントエンドとバックエンドで構成されます。",
            "category": "web",
            "created_at": "2024-01-04"
        },
        {
            "title": "データベース設計",
            "content": "データベース設計では正規化が重要です。適切な設計により性能とデータ整合性を保てます。",
            "category": "database",
            "created_at": "2024-01-05"
        }
    ]
    
    for i, doc in enumerate(documents):
        es_client.index(index=index_name, id=i+1, body=doc)
    
    es_client.indices.refresh(index=index_name)
    print(f"{len(documents)}件のドキュメントを挿入しました")

def simple_text_search(es_client, index_name, query_text):
    """シンプルなテキスト検索"""
    print(f"\n=== シンプル検索: '{query_text}' ===")
    
    query = {
        "query": {
            "multi_match": {
                "query": query_text,
                "fields": ["title", "content"],
                #"type": "best_fields"
                "type": "phrase"
            }
        }
    }
    
    response = es_client.search(index=index_name, body=query)
    
    print(f"検索結果: {response['hits']['total']['value']}件")
    for hit in response['hits']['hits']:
        print(f"スコア: {hit['_score']:.2f}")
        print(f"タイトル: {hit['_source']['title']}")
        print(f"カテゴリ: {hit['_source']['category']}")
        print(f"内容: {hit['_source']['content'][:50]}...")
        print("-" * 50)

def filtered_search(es_client, index_name, query_text, category):
    """フィルタ付き検索"""
    print(f"\n=== フィルタ付き検索: '{query_text}' (カテゴリ: {category}) ===")
    
    query = {
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query_text,
                        "fields": ["title", "content"]
                    }
                },
                "filter": {
                    "term": {"category": category}
                }
            }
        }
    }
    
    response = es_client.search(index=index_name, body=query)
    
    print(f"検索結果: {response['hits']['total']['value']}件")
    for hit in response['hits']['hits']:
        print(f"スコア: {hit['_score']:.2f}")
        print(f"タイトル: {hit['_source']['title']}")
        print(f"カテゴリ: {hit['_source']['category']}")
        print("-" * 50)

def range_search(es_client, index_name, start_date, end_date):
    """日付範囲検索"""
    print(f"\n=== 日付範囲検索: {start_date} ～ {end_date} ===")
    
    query = {
        "query": {
            "range": {
                "created_at": {
                    "gte": start_date,
                    "lte": end_date
                }
            }
        },
        "sort": [
            {"created_at": {"order": "desc"}}
        ]
    }
    
    response = es_client.search(index=index_name, body=query)
    
    print(f"検索結果: {response['hits']['total']['value']}件")
    for hit in response['hits']['hits']:
        print(f"タイトル: {hit['_source']['title']}")
        print(f"作成日: {hit['_source']['created_at']}")
        print("-" * 50)

def main():
    es_client = Elasticsearch([{'host': 'localhost', 'port': 9200, 'scheme': 'http'}])
    
    if not wait_for_elasticsearch(es_client):
        print("Elasticsearchに接続できませんでした")
        return
    
    index_name = "simple_search_demo"
    
    setup_index(es_client, index_name)
    insert_sample_data(es_client, index_name)
    
    simple_text_search(es_client, index_name, "性能")
    simple_text_search(es_client, index_name, "学習")
    simple_text_search(es_client, index_name, "教師なし")
    
    filtered_search(es_client, index_name, "データ", "database")
    
    range_search(es_client, index_name, "2024-01-01", "2024-01-03")

if __name__ == "__main__":
    main()

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

#!/usr/bin/env python3

"""

Simple Elasticsearch Search Sample

基本的なテキスト検索のサンプル

"""

from elasticsearch import Elasticsearch

import time

def wait_for_elasticsearch(es_client, max_retries=30):

"""Elasticsearchが起動するまで待機"""

for i in range(max_retries):

try:

if es_client.ping():

print("Elasticsearch接続成功")

return True

except Exception as e:

print(f"Elasticsearch接続待機中... ({i+1}/{max_retries})")

time.sleep(2)

return False

def setup_index(es_client, index_name):

"""インデックスの作成と設定"""

if es_client.indices.exists(index=index_name):

es_client.indices.delete(index=index_name)

mapping = {

"mappings": {

"properties": {

"title": {"type": "text", "analyzer": "standard"},

"content": {"type": "text", "analyzer": "standard"},

"category": {"type": "keyword"},

"created_at": {"type": "date"}

}

es_client.indices.create(index=index_name, body=mapping)

print(f"インデックス '{index_name}' を作成しました")

def insert_sample_data(es_client, index_name):

"""サンプルデータの挿入"""

documents = [

{

"title": "Python入門",

"content": "Pythonは初心者にも学びやすいプログラミング言語です。データ分析や機械学習にも使われます。",

"category": "programming",

"created_at": "2024-01-01"

{

"title": "機械学習の基礎",

"content": "機械学習はAIの一分野で、データからパターンを学習します。教師あり学習と教師なし学習があります。",

"category": "ai",

"created_at": "2024-01-02"

{

"title": "Elasticsearchガイド",

"content": "Elasticsearchは分散検索エンジンです。全文検索やデータ分析に優れています。",

"category": "database",

"created_at": "2024-01-03"

{

"title": "Webアプリケーション開発",

"content": "WebアプリケーションはHTTPプロトコルを使用してユーザーとやり取りします。フロントエンドとバックエンドで構成されます。",

"category": "web",

"created_at": "2024-01-04"

{

"title": "データベース設計",

"content": "データベース設計では正規化が重要です。適切な設計により性能とデータ整合性を保てます。",

"category": "database",

"created_at": "2024-01-05"

}

]

for i, doc in enumerate(documents):

es_client.index(index=index_name, id=i+1, body=doc)

es_client.indices.refresh(index=index_name)

print(f"{len(documents)}件のドキュメントを挿入しました")

def simple_text_search(es_client, index_name, query_text):

"""シンプルなテキスト検索"""

print(f"\n=== シンプル検索: '{query_text}' ===")

query = {

"query": {

"multi_match": {

"query": query_text,

"fields": ["title", "content"],

#"type": "best_fields"

"type": "phrase"

}

response = es_client.search(index=index_name, body=query)

print(f"検索結果: {response['hits']['total']['value']}件")

for hit in response['hits']['hits']:

print(f"スコア: {hit['_score']:.2f}")

print(f"タイトル: {hit['_source']['title']}")

print(f"カテゴリ: {hit['_source']['category']}")

print(f"内容: {hit['_source']['content'][:50]}...")

print("-" * 50)

def filtered_search(es_client, index_name, query_text, category):

"""フィルタ付き検索"""

print(f"\n=== フィルタ付き検索: '{query_text}' (カテゴリ: {category}) ===")

query = {

"query": {

"bool": {

"must": {

"multi_match": {

"query": query_text,

"fields": ["title", "content"]

}

"filter": {

"term": {"category": category}

}

response = es_client.search(index=index_name, body=query)

print(f"検索結果: {response['hits']['total']['value']}件")

for hit in response['hits']['hits']:

print(f"スコア: {hit['_score']:.2f}")

print(f"タイトル: {hit['_source']['title']}")

print(f"カテゴリ: {hit['_source']['category']}")

print("-" * 50)

def range_search(es_client, index_name, start_date, end_date):

"""日付範囲検索"""

print(f"\n=== 日付範囲検索: {start_date} ～ {end_date} ===")

query = {

"query": {

"range": {

"created_at": {

"gte": start_date,

"lte": end_date

}

"sort": [

{"created_at": {"order": "desc"}}

]

}

response = es_client.search(index=index_name, body=query)

print(f"検索結果: {response['hits']['total']['value']}件")

for hit in response['hits']['hits']:

print(f"タイトル: {hit['_source']['title']}")

print(f"作成日: {hit['_source']['created_at']}")

print("-" * 50)

def main():

es_client = Elasticsearch([{'host': 'localhost', 'port': 9200, 'scheme': 'http'}])

if not wait_for_elasticsearch(es_client):

print("Elasticsearchに接続できませんでした")

return

index_name = "simple_search_demo"

setup_index(es_client, index_name)

insert_sample_data(es_client, index_name)

simple_text_search(es_client, index_name, "性能")

simple_text_search(es_client, index_name, "学習")

simple_text_search(es_client, index_name, "教師なし")

filtered_search(es_client, index_name, "データ", "database")

range_search(es_client, index_name, "2024-01-01", "2024-01-03")

if __name__ == "__main__":

main()

出力結果

Elasticsearch接続成功
インデックス 'simple_search_demo' を作成しました
5件のドキュメントを挿入しました

=== シンプル検索: '性能' ===
検索結果: 1件
スコア: 2.72
タイトル: データベース設計
カテゴリ: database
内容: データベース設計では正規化が重要です。適切な設計により性能とデータ整合性を保てます。...
--------------------------------------------------

=== シンプル検索: '学習' ===
検索結果: 2件
スコア: 2.84
タイトル: 機械学習の基礎
カテゴリ: ai
内容: 機械学習はAIの一分野で、データからパターンを学習します。教師あり学習と教師なし学習があります。...
--------------------------------------------------
スコア: 1.74
タイトル: Python入門
カテゴリ: programming
内容: Pythonは初心者にも学びやすいプログラミング言語です。データ分析や機械学習にも使われます。...
--------------------------------------------------

=== シンプル検索: '教師なし' ===
検索結果: 1件
スコア: 4.12
タイトル: 機械学習の基礎
カテゴリ: ai
内容: 機械学習はAIの一分野で、データからパターンを学習します。教師あり学習と教師なし学習があります。...
--------------------------------------------------

=== フィルタ付き検索: 'データ' (カテゴリ: database) ===
検索結果: 2件
スコア: 0.32
タイトル: Elasticsearchガイド
カテゴリ: database
--------------------------------------------------
スコア: 0.28
タイトル: データベース設計
カテゴリ: database
--------------------------------------------------

=== 日付範囲検索: 2024-01-01 ～ 2024-01-03 ===
検索結果: 3件
タイトル: Elasticsearchガイド
作成日: 2024-01-03
--------------------------------------------------
タイトル: 機械学習の基礎
作成日: 2024-01-02
--------------------------------------------------
タイトル: Python入門
作成日: 2024-01-01
--------------------------------------------------

Elasticsearch接続成功

インデックス 'simple_search_demo' を作成しました

5件のドキュメントを挿入しました

=== シンプル検索: '性能' ===

検索結果: 1件

スコア: 2.72

タイトル: データベース設計

カテゴリ: database

内容: データベース設計では正規化が重要です。適切な設計により性能とデータ整合性を保てます。...

--------------------------------------------------

=== シンプル検索: '学習' ===

検索結果: 2件

スコア: 2.84

タイトル: 機械学習の基礎

カテゴリ: ai

内容: 機械学習はAIの一分野で、データからパターンを学習します。教師あり学習と教師なし学習があります。...

--------------------------------------------------

スコア: 1.74

タイトル: Python入門

カテゴリ: programming

内容: Pythonは初心者にも学びやすいプログラミング言語です。データ分析や機械学習にも使われます。...

--------------------------------------------------

=== シンプル検索: '教師なし' ===

検索結果: 1件

スコア: 4.12

タイトル: 機械学習の基礎

カテゴリ: ai

内容: 機械学習はAIの一分野で、データからパターンを学習します。教師あり学習と教師なし学習があります。...

--------------------------------------------------

=== フィルタ付き検索: 'データ' (カテゴリ: database) ===

検索結果: 2件

スコア: 0.32

タイトル: Elasticsearchガイド

カテゴリ: database

--------------------------------------------------

スコア: 0.28

タイトル: データベース設計

カテゴリ: database

--------------------------------------------------

=== 日付範囲検索: 2024-01-01 ～ 2024-01-03 ===

検索結果: 3件

タイトル: Elasticsearchガイド

作成日: 2024-01-03

--------------------------------------------------

タイトル: 機械学習の基礎

作成日: 2024-01-02

--------------------------------------------------

タイトル: Python入門

作成日: 2024-01-01

--------------------------------------------------

次は類似する文字列を検索するサンプルです。

similarity_search.py

#!/usr/bin/env python3
"""
Similarity Search Sample
類似度検索（ベクトル検索）のサンプル
"""

from elasticsearch import Elasticsearch
import numpy as np
import time

def wait_for_elasticsearch(es_client, max_retries=30):
    """Elasticsearchが起動するまで待機"""
    for i in range(max_retries):
        try:
            if es_client.ping():
                print("Elasticsearch接続成功")
                return True
        except Exception as e:
            print(f"Elasticsearch接続待機中... ({i+1}/{max_retries})")
            time.sleep(2)
    return False

def setup_vector_index(es_client, index_name):
    """ベクトル検索用のインデックス設定"""
    if es_client.indices.exists(index=index_name):
        es_client.indices.delete(index=index_name)
    
    settings = {
        "settings": {
            "index": {
                "number_of_shards": 1,
                "number_of_replicas": 0
            }
        },
        "mappings": {
            "properties": {
                "title": {
                    "type": "text",
                    "analyzer": "standard"
                },
                "content": {
                    "type": "text",
                    "analyzer": "standard"
                },
                "category": {
                    "type": "keyword"
                },
                "embedding": {
                    "type": "dense_vector",
                    "dims": 128,
                    "index": True,
                    "similarity": "cosine"
                },
                "tfidf_vector": {
                    "type": "dense_vector",
                    "dims": 50,
                    "index": True,
                    "similarity": "cosine"
                }
            }
        }
    }
    
    es_client.indices.create(index=index_name, body=settings)
    print(f"ベクトル検索用インデックス '{index_name}' を作成しました")

def generate_mock_embedding(text, dims=128):
    """疑似的な文章埋め込みベクトルの生成"""
    np.random.seed(hash(text) % 2**32)
    vector = np.random.normal(0, 1, dims)
    return (vector / np.linalg.norm(vector)).tolist()

def generate_tfidf_vector(text, vocabulary, dims=50):
    """疑似的なTF-IDF風ベクトルの生成"""
    words = text.lower().split()
    vector = np.zeros(dims)
    
    for word in words:
        if word in vocabulary:
            idx = vocabulary[word] % dims
            vector[idx] += 1
    
    # ゼロベクトルを避けるため、最低限の値を設定
    if np.linalg.norm(vector) == 0:
        vector[0] = 0.1
    
    # 正規化
    vector = vector / np.linalg.norm(vector)
    
    return vector.tolist()

def insert_vector_data(es_client, index_name):
    """ベクトル付きのサンプルデータの挿入"""
    documents = [
        {
            "title": "Python プログラミング入門",
            "content": "Python は初心者に優しいプログラミング言語です。シンプルな構文と豊富なライブラリが特徴で、データサイエンスやWeb開発に広く使われています。",
            "category": "programming"
        },
        {
            "title": "Java オブジェクト指向プログラミング",
            "content": "Java は強い型付けのオブジェクト指向プログラミング言語です。エンタープライズアプリケーションの開発に多く使用されています。",
            "category": "programming"
        },
        {
            "title": "機械学習の基礎理論",
            "content": "機械学習は人工知能の中核技術です。教師あり学習、教師なし学習、強化学習などの手法があり、データからパターンを学習します。",
            "category": "ai"
        },
        {
            "title": "深層学習とニューラルネットワーク",
            "content": "深層学習は多層のニューラルネットワークを使った機械学習手法です。画像認識や自然言語処理で優れた性能を発揮します。",
            "category": "ai"
        },
        {
            "title": "データベース管理システム",
            "content": "RDBMSはリレーショナルデータベース管理システムです。SQL言語を使ってデータの操作や管理を行います。",
            "category": "database"
        },
        {
            "title": "NoSQL データベース",
            "content": "NoSQLデータベースは従来のRDBMSとは異なる設計思想を持ちます。ドキュメント型、グラフ型、キーバリュー型などがあります。",
            "category": "database"
        },
        {
            "title": "Web フロントエンド開発",
            "content": "HTML、CSS、JavaScriptを使ったWebフロントエンド開発。ユーザーインターフェースの構築とユーザー体験の向上が重要です。",
            "category": "web"
        },
        {
            "title": "REST API の設計",
            "content": "RESTfulなAPIの設計原則。HTTPメソッドを適切に使い、リソース指向の設計を行うことが重要です。",
            "category": "web"
        }
    ]
    
    vocabulary = {}
    for doc in documents:
        words = (doc['title'] + ' ' + doc['content']).lower().split()
        for word in words:
            if word not in vocabulary:
                vocabulary[word] = len(vocabulary)
    
    for i, doc in enumerate(documents):
        full_text = doc['title'] + ' ' + doc['content']
        doc['embedding'] = generate_mock_embedding(full_text)
        doc['tfidf_vector'] = generate_tfidf_vector(full_text, vocabulary)
        
        es_client.index(index=index_name, id=i+1, body=doc)
    
    es_client.indices.refresh(index=index_name)
    print(f"{len(documents)}件のベクトル付きドキュメントを挿入しました")
    return vocabulary

def cosine_similarity_search(es_client, index_name, query_text, k=3):
    """コサイン類似度検索"""
    print(f"\n=== コサイン類似度検索: '{query_text}' (上位{k}件) ===")
    
    query_vector = generate_mock_embedding(query_text)
    
    query = {
        "knn": {
            "field": "embedding",
            "query_vector": query_vector,
            "k": k,
            "num_candidates": 100
        },
        "_source": ["title", "content", "category"]
    }
    
    response = es_client.search(index=index_name, body=query)
    
    print(f"類似文書: {len(response['hits']['hits'])}件")
    for hit in response['hits']['hits']:
        print(f"類似度スコア: {hit['_score']:.4f}")
        print(f"タイトル: {hit['_source']['title']}")
        print(f"カテゴリ: {hit['_source']['category']}")
        print(f"内容: {hit['_source']['content'][:80]}...")
        print("-" * 60)

def hybrid_search(es_client, index_name, query_text, k=3):
    """ハイブリッド検索（ベクトル検索＋テキスト検索）"""
    print(f"\n=== ハイブリッド検索: '{query_text}' ===")
    
    query_vector = generate_mock_embedding(query_text)
    
    query = {
        "query": {
            "bool": {
                "should": [
                    {
                        "multi_match": {
                            "query": query_text,
                            "fields": ["title^2", "content"],
                            "type": "best_fields"
                        }
                    }
                ]
            }
        },
        "knn": {
            "field": "embedding",
            "query_vector": query_vector,
            "k": k,
            "num_candidates": 50,
            "boost": 0.5
        },
        "_source": ["title", "content", "category"]
    }
    
    response = es_client.search(index=index_name, body=query)
    
    print(f"検索結果: {len(response['hits']['hits'])}件")
    for hit in response['hits']['hits']:
        print(f"総合スコア: {hit['_score']:.4f}")
        print(f"タイトル: {hit['_source']['title']}")
        print(f"カテゴリ: {hit['_source']['category']}")
        print("-" * 60)

def filtered_vector_search(es_client, index_name, query_text, category_filter, k=3):
    """フィルタ付きベクトル検索"""
    print(f"\n=== フィルタ付きベクトル検索: '{query_text}' (カテゴリ: {category_filter}) ===")
    
    query_vector = generate_mock_embedding(query_text)
    
    query = {
        "knn": {
            "field": "embedding",
            "query_vector": query_vector,
            "k": k,
            "num_candidates": 100,
            "filter": {
                "term": {
                    "category": category_filter
                }
            }
        },
        "_source": ["title", "content", "category"]
    }
    
    response = es_client.search(index=index_name, body=query)
    
    print(f"フィルタリング後の類似文書: {len(response['hits']['hits'])}件")
    for hit in response['hits']['hits']:
        print(f"類似度スコア: {hit['_score']:.4f}")
        print(f"タイトル: {hit['_source']['title']}")
        print(f"カテゴリ: {hit['_source']['category']}")
        print("-" * 60)

def tfidf_similarity_search(es_client, index_name, query_text, vocabulary, k=3):
    """TF-IDF風ベクトルでの類似度検索"""
    print(f"\n=== TF-IDF類似度検索: '{query_text}' ===")
    
    query_vector = generate_tfidf_vector(query_text, vocabulary)
    
    query = {
        "knn": {
            "field": "tfidf_vector",
            "query_vector": query_vector,
            "k": k,
            "num_candidates": 100
        },
        "_source": ["title", "content", "category"]
    }
    
    response = es_client.search(index=index_name, body=query)
    
    print(f"TF-IDF類似文書: {len(response['hits']['hits'])}件")
    for hit in response['hits']['hits']:
        print(f"類似度スコア: {hit['_score']:.4f}")
        print(f"タイトル: {hit['_source']['title']}")
        print(f"カテゴリ: {hit['_source']['category']}")
        print("-" * 60)

def multi_vector_search(es_client, index_name, query_text, vocabulary):
    """複数ベクトルを使った検索の比較"""
    print(f"\n=== 複数ベクトル検索比較: '{query_text}' ===")
    
    semantic_vector = generate_mock_embedding(query_text)
    tfidf_vector = generate_tfidf_vector(query_text, vocabulary)
    
    # セマンティックベクトル検索
    print("セマンティックベクトル検索:")
    semantic_query = {
        "knn": {
            "field": "embedding",
            "query_vector": semantic_vector,
            "k": 3,
            "num_candidates": 50
        },
        "_source": ["title", "content", "category"]
    }
    
    semantic_response = es_client.search(index=index_name, body=semantic_query)
    for hit in semantic_response['hits']['hits']:
        print(f"  スコア: {hit['_score']:.4f} - {hit['_source']['title']}")
    
    # TF-IDFベクトル検索
    print("\nTF-IDFベクトル検索:")
    tfidf_query = {
        "knn": {
            "field": "tfidf_vector",
            "query_vector": tfidf_vector,
            "k": 3,
            "num_candidates": 50
        },
        "_source": ["title", "content", "category"]
    }
    
    tfidf_response = es_client.search(index=index_name, body=tfidf_query)
    for hit in tfidf_response['hits']['hits']:
        print(f"  スコア: {hit['_score']:.4f} - {hit['_source']['title']}")
    
    print("-" * 60)

def similarity_aggregation(es_client, index_name):
    """類似度を使った集約分析"""
    print(f"\n=== 基本集約分析 ===")
    
    query = {
        "size": 0,
        "aggs": {
            "category_count": {
                "terms": {
                    "field": "category",
                    "size": 10
                }
            }
        }
    }
    
    response = es_client.search(index=index_name, body=query)
    
    print("カテゴリ別の文書数:")
    for bucket in response['aggregations']['category_count']['buckets']:
        print(f"  {bucket['key']}: {bucket['doc_count']}件")
    
    # 各カテゴリの代表的な文書を表示
    print("\n各カテゴリの代表文書:")
    for bucket in response['aggregations']['category_count']['buckets']:
        category = bucket['key']
        
        category_query = {
            "query": {
                "term": {"category": category}
            },
            "size": 1,
            "_source": ["title", "category"]
        }
        
        category_response = es_client.search(index=index_name, body=category_query)
        if category_response['hits']['hits']:
            doc = category_response['hits']['hits'][0]['_source']
            print(f"  {category}: {doc['title']}")

def main():
    es_client = Elasticsearch([{'host': 'localhost', 'port': 9200, 'scheme': 'http'}])
    
    if not wait_for_elasticsearch(es_client):
        print("Elasticsearchに接続できませんでした")
        return
    
    index_name = "similarity_search_demo"
    
    setup_vector_index(es_client, index_name)
    vocabulary = insert_vector_data(es_client, index_name)
    
    cosine_similarity_search(es_client, index_name, "プログラミング 学習")
    cosine_similarity_search(es_client, index_name, "アルゴリズム")
    cosine_similarity_search(es_client, index_name, "データ設計")
    
    hybrid_search(es_client, index_name, "Web開発")
    
    filtered_vector_search(es_client, index_name, "オブジェクト", "programming")
    
    tfidf_similarity_search(es_client, index_name, "データベース システム", vocabulary)
    
    multi_vector_search(es_client, index_name, "人工知能 技術", vocabulary)
    
    similarity_aggregation(es_client, index_name)

if __name__ == "__main__":
    main()

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

#!/usr/bin/env python3

"""

Similarity Search Sample

類似度検索（ベクトル検索）のサンプル

"""

from elasticsearch import Elasticsearch

import numpy as np

import time

def wait_for_elasticsearch(es_client, max_retries=30):

"""Elasticsearchが起動するまで待機"""

for i in range(max_retries):

try:

if es_client.ping():

print("Elasticsearch接続成功")

return True

except Exception as e:

print(f"Elasticsearch接続待機中... ({i+1}/{max_retries})")

time.sleep(2)

return False

def setup_vector_index(es_client, index_name):

"""ベクトル検索用のインデックス設定"""

if es_client.indices.exists(index=index_name):

es_client.indices.delete(index=index_name)

settings = {

"settings": {

"index": {

"number_of_shards": 1,

"number_of_replicas": 0

}

"mappings": {

"properties": {

"title": {

"type": "text",

"analyzer": "standard"

"content": {

"type": "text",

"analyzer": "standard"

"category": {

"type": "keyword"

"embedding": {

"type": "dense_vector",

"dims": 128,

"index": True,

"similarity": "cosine"

"tfidf_vector": {

"type": "dense_vector",

"dims": 50,

"index": True,

"similarity": "cosine"

}

es_client.indices.create(index=index_name, body=settings)

print(f"ベクトル検索用インデックス '{index_name}' を作成しました")

def generate_mock_embedding(text, dims=128):

"""疑似的な文章埋め込みベクトルの生成"""

np.random.seed(hash(text) % 2**32)

vector = np.random.normal(0, 1, dims)

return (vector / np.linalg.norm(vector)).tolist()

def generate_tfidf_vector(text, vocabulary, dims=50):

"""疑似的なTF-IDF風ベクトルの生成"""

words = text.lower().split()

vector = np.zeros(dims)

for word in words:

if word in vocabulary:

idx = vocabulary[word] % dims

vector[idx] += 1

# ゼロベクトルを避けるため、最低限の値を設定

if np.linalg.norm(vector) == 0:

vector[0] = 0.1

# 正規化

vector = vector / np.linalg.norm(vector)

return vector.tolist()

def insert_vector_data(es_client, index_name):

"""ベクトル付きのサンプルデータの挿入"""

documents = [

{

"title": "Python プログラミング入門",

"content": "Python は初心者に優しいプログラミング言語です。シンプルな構文と豊富なライブラリが特徴で、データサイエンスやWeb開発に広く使われています。",

"category": "programming"

{

"title": "Java オブジェクト指向プログラミング",

"content": "Java は強い型付けのオブジェクト指向プログラミング言語です。エンタープライズアプリケーションの開発に多く使用されています。",

"category": "programming"

{

"title": "機械学習の基礎理論",

"content": "機械学習は人工知能の中核技術です。教師あり学習、教師なし学習、強化学習などの手法があり、データからパターンを学習します。",

"category": "ai"

{

"title": "深層学習とニューラルネットワーク",

"content": "深層学習は多層のニューラルネットワークを使った機械学習手法です。画像認識や自然言語処理で優れた性能を発揮します。",

"category": "ai"

{

"title": "データベース管理システム",

"content": "RDBMSはリレーショナルデータベース管理システムです。SQL言語を使ってデータの操作や管理を行います。",

"category": "database"

{

"title": "NoSQL データベース",

"content": "NoSQLデータベースは従来のRDBMSとは異なる設計思想を持ちます。ドキュメント型、グラフ型、キーバリュー型などがあります。",

"category": "database"

{

"title": "Web フロントエンド開発",

"content": "HTML、CSS、JavaScriptを使ったWebフロントエンド開発。ユーザーインターフェースの構築とユーザー体験の向上が重要です。",

"category": "web"

{

"title": "REST API の設計",

"content": "RESTfulなAPIの設計原則。HTTPメソッドを適切に使い、リソース指向の設計を行うことが重要です。",

"category": "web"

}

]

vocabulary = {}

for doc in documents:

words = (doc['title'] + ' ' + doc['content']).lower().split()

for word in words:

if word not in vocabulary:

vocabulary[word] = len(vocabulary)

for i, doc in enumerate(documents):

full_text = doc['title'] + ' ' + doc['content']

doc['embedding'] = generate_mock_embedding(full_text)

doc['tfidf_vector'] = generate_tfidf_vector(full_text, vocabulary)

es_client.index(index=index_name, id=i+1, body=doc)

es_client.indices.refresh(index=index_name)

print(f"{len(documents)}件のベクトル付きドキュメントを挿入しました")

return vocabulary

def cosine_similarity_search(es_client, index_name, query_text, k=3):

"""コサイン類似度検索"""

print(f"\n=== コサイン類似度検索: '{query_text}' (上位{k}件) ===")

query_vector = generate_mock_embedding(query_text)

query = {

"knn": {

"field": "embedding",

"query_vector": query_vector,

"k": k,

"num_candidates": 100

"_source": ["title", "content", "category"]

}

response = es_client.search(index=index_name, body=query)

print(f"類似文書: {len(response['hits']['hits'])}件")

for hit in response['hits']['hits']:

print(f"類似度スコア: {hit['_score']:.4f}")

print(f"タイトル: {hit['_source']['title']}")

print(f"カテゴリ: {hit['_source']['category']}")

print(f"内容: {hit['_source']['content'][:80]}...")

print("-" * 60)

def hybrid_search(es_client, index_name, query_text, k=3):

"""ハイブリッド検索（ベクトル検索＋テキスト検索）"""

print(f"\n=== ハイブリッド検索: '{query_text}' ===")

query_vector = generate_mock_embedding(query_text)

query = {

"query": {

"bool": {

"should": [

{

"multi_match": {

"query": query_text,

"fields": ["title^2", "content"],

"type": "best_fields"

}

]

}

"knn": {

"field": "embedding",

"query_vector": query_vector,

"k": k,

"num_candidates": 50,

"boost": 0.5

"_source": ["title", "content", "category"]

}

response = es_client.search(index=index_name, body=query)

print(f"検索結果: {len(response['hits']['hits'])}件")

for hit in response['hits']['hits']:

print(f"総合スコア: {hit['_score']:.4f}")

print(f"タイトル: {hit['_source']['title']}")

print(f"カテゴリ: {hit['_source']['category']}")

print("-" * 60)

def filtered_vector_search(es_client, index_name, query_text, category_filter, k=3):

"""フィルタ付きベクトル検索"""

print(f"\n=== フィルタ付きベクトル検索: '{query_text}' (カテゴリ: {category_filter}) ===")

query_vector = generate_mock_embedding(query_text)

query = {

"knn": {

"field": "embedding",

"query_vector": query_vector,

"k": k,

"num_candidates": 100,

"filter": {

"term": {

"category": category_filter

}

"_source": ["title", "content", "category"]

}

response = es_client.search(index=index_name, body=query)

print(f"フィルタリング後の類似文書: {len(response['hits']['hits'])}件")

for hit in response['hits']['hits']:

print(f"類似度スコア: {hit['_score']:.4f}")

print(f"タイトル: {hit['_source']['title']}")

print(f"カテゴリ: {hit['_source']['category']}")

print("-" * 60)

def tfidf_similarity_search(es_client, index_name, query_text, vocabulary, k=3):

"""TF-IDF風ベクトルでの類似度検索"""

print(f"\n=== TF-IDF類似度検索: '{query_text}' ===")

query_vector = generate_tfidf_vector(query_text, vocabulary)

query = {

"knn": {

"field": "tfidf_vector",

"query_vector": query_vector,

"k": k,

"num_candidates": 100

"_source": ["title", "content", "category"]

}

response = es_client.search(index=index_name, body=query)

print(f"TF-IDF類似文書: {len(response['hits']['hits'])}件")

for hit in response['hits']['hits']:

print(f"類似度スコア: {hit['_score']:.4f}")

print(f"タイトル: {hit['_source']['title']}")

print(f"カテゴリ: {hit['_source']['category']}")

print("-" * 60)

def multi_vector_search(es_client, index_name, query_text, vocabulary):

"""複数ベクトルを使った検索の比較"""

print(f"\n=== 複数ベクトル検索比較: '{query_text}' ===")

semantic_vector = generate_mock_embedding(query_text)

tfidf_vector = generate_tfidf_vector(query_text, vocabulary)

# セマンティックベクトル検索

print("セマンティックベクトル検索:")

semantic_query = {

"knn": {

"field": "embedding",

"query_vector": semantic_vector,

"k": 3,

"num_candidates": 50

"_source": ["title", "content", "category"]

}

semantic_response = es_client.search(index=index_name, body=semantic_query)

for hit in semantic_response['hits']['hits']:

print(f" スコア: {hit['_score']:.4f} - {hit['_source']['title']}")

# TF-IDFベクトル検索

print("\nTF-IDFベクトル検索:")

tfidf_query = {

"knn": {

"field": "tfidf_vector",

"query_vector": tfidf_vector,

"k": 3,

"num_candidates": 50

"_source": ["title", "content", "category"]

}

tfidf_response = es_client.search(index=index_name, body=tfidf_query)

for hit in tfidf_response['hits']['hits']:

print(f" スコア: {hit['_score']:.4f} - {hit['_source']['title']}")

print("-" * 60)

def similarity_aggregation(es_client, index_name):

"""類似度を使った集約分析"""

print(f"\n=== 基本集約分析 ===")

query = {

"size": 0,

"aggs": {

"category_count": {

"terms": {

"field": "category",

"size": 10

}

response = es_client.search(index=index_name, body=query)

print("カテゴリ別の文書数:")

for bucket in response['aggregations']['category_count']['buckets']:

print(f" {bucket['key']}: {bucket['doc_count']}件")

# 各カテゴリの代表的な文書を表示

print("\n各カテゴリの代表文書:")

for bucket in response['aggregations']['category_count']['buckets']:

category = bucket['key']

category_query = {

"query": {

"term": {"category": category}

"size": 1,

"_source": ["title", "category"]

}

category_response = es_client.search(index=index_name, body=category_query)

if category_response['hits']['hits']:

doc = category_response['hits']['hits'][0]['_source']

print(f" {category}: {doc['title']}")

def main():

es_client = Elasticsearch([{'host': 'localhost', 'port': 9200, 'scheme': 'http'}])

if not wait_for_elasticsearch(es_client):

print("Elasticsearchに接続できませんでした")

return

index_name = "similarity_search_demo"

setup_vector_index(es_client, index_name)

vocabulary = insert_vector_data(es_client, index_name)

cosine_similarity_search(es_client, index_name, "プログラミング学習")

cosine_similarity_search(es_client, index_name, "アルゴリズム")

cosine_similarity_search(es_client, index_name, "データ設計")

hybrid_search(es_client, index_name, "Web開発")

filtered_vector_search(es_client, index_name, "オブジェクト", "programming")

tfidf_similarity_search(es_client, index_name, "データベースシステム", vocabulary)

multi_vector_search(es_client, index_name, "人工知能技術", vocabulary)

similarity_aggregation(es_client, index_name)

if __name__ == "__main__":

main()

出力結果

Elasticsearch接続成功
ベクトル検索用インデックス 'similarity_search_demo' を作成しました
8件のベクトル付きドキュメントを挿入しました

=== コサイン類似度検索: 'プログラミング 学習' (上位3件) ===
類似文書: 3件
類似度スコア: 0.6140
タイトル: データベース管理システム
カテゴリ: database
内容: RDBMSはリレーショナルデータベース管理システムです。SQL言語を使ってデータの操作や管理を行います。...
------------------------------------------------------------
類似度スコア: 0.5438
タイトル: 深層学習とニューラルネットワーク
カテゴリ: ai
内容: 深層学習は多層のニューラルネットワークを使った機械学習手法です。画像認識や自然言語処理で優れた性能を発揮します。...
------------------------------------------------------------
類似度スコア: 0.5413
タイトル: Python プログラミング入門
カテゴリ: programming
内容: Python は初心者に優しいプログラミング言語です。シンプルな構文と豊富なライブラリが特徴で、データサイエンスやWeb開発に広く使われています。...
------------------------------------------------------------

=== コサイン類似度検索: 'アルゴリズム' (上位3件) ===
類似文書: 3件
類似度スコア: 0.5529
タイトル: Java オブジェクト指向プログラミング
カテゴリ: programming
内容: Java は強い型付けのオブジェクト指向プログラミング言語です。エンタープライズアプリケーションの開発に多く使用されています。...
------------------------------------------------------------
類似度スコア: 0.5487
タイトル: 機械学習の基礎理論
カテゴリ: ai
内容: 機械学習は人工知能の中核技術です。教師あり学習、教師なし学習、強化学習などの手法があり、データからパターンを学習します。...
------------------------------------------------------------
類似度スコア: 0.5437
タイトル: Python プログラミング入門
カテゴリ: programming
内容: Python は初心者に優しいプログラミング言語です。シンプルな構文と豊富なライブラリが特徴で、データサイエンスやWeb開発に広く使われています。...
------------------------------------------------------------

=== コサイン類似度検索: 'データ設計' (上位3件) ===
類似文書: 3件
類似度スコア: 0.6146
タイトル: 深層学習とニューラルネットワーク
カテゴリ: ai
内容: 深層学習は多層のニューラルネットワークを使った機械学習手法です。画像認識や自然言語処理で優れた性能を発揮します。...
------------------------------------------------------------
類似度スコア: 0.5394
タイトル: Python プログラミング入門
カテゴリ: programming
内容: Python は初心者に優しいプログラミング言語です。シンプルな構文と豊富なライブラリが特徴で、データサイエンスやWeb開発に広く使われています。...
------------------------------------------------------------
類似度スコア: 0.5282
タイトル: NoSQL データベース
カテゴリ: database
内容: NoSQLデータベースは従来のRDBMSとは異なる設計思想を持ちます。ドキュメント型、グラフ型、キーバリュー型などがあります。...
------------------------------------------------------------

=== ハイブリッド検索: 'Web開発' ===
検索結果: 5件
総合スコア: 11.6025
タイトル: Web フロントエンド開発
カテゴリ: web
------------------------------------------------------------
総合スコア: 2.7280
タイトル: Python プログラミング入門
カテゴリ: programming
------------------------------------------------------------
総合スコア: 1.9928
タイトル: Java オブジェクト指向プログラミング
カテゴリ: programming
------------------------------------------------------------
総合スコア: 0.9029
タイトル: 深層学習とニューラルネットワーク
カテゴリ: ai
------------------------------------------------------------
総合スコア: 0.2708
タイトル: NoSQL データベース
カテゴリ: database
------------------------------------------------------------

=== フィルタ付きベクトル検索: 'オブジェクト' (カテゴリ: programming) ===
フィルタリング後の類似文書: 2件
類似度スコア: 0.4790
タイトル: Java オブジェクト指向プログラミング
カテゴリ: programming
------------------------------------------------------------
類似度スコア: 0.4639
タイトル: Python プログラミング入門
カテゴリ: programming
------------------------------------------------------------

=== TF-IDF類似度検索: 'データベース システム' ===
TF-IDF類似文書: 3件
類似度スコア: 0.7887
タイトル: NoSQL データベース
カテゴリ: database
------------------------------------------------------------
類似度スコア: 0.5000
タイトル: Python プログラミング入門
カテゴリ: programming
------------------------------------------------------------
類似度スコア: 0.5000
タイトル: Java オブジェクト指向プログラミング
カテゴリ: programming
------------------------------------------------------------

=== 複数ベクトル検索比較: '人工知能 技術' ===
セマンティックベクトル検索:
  スコア: 0.5625 - Web フロントエンド開発
  スコア: 0.5287 - NoSQL データベース
  スコア: 0.5149 - 深層学習とニューラルネットワーク

TF-IDFベクトル検索:
  スコア: 0.9082 - Python プログラミング入門
  スコア: 0.5000 - Java オブジェクト指向プログラミング
  スコア: 0.5000 - 機械学習の基礎理論
------------------------------------------------------------

=== 基本集約分析 ===
カテゴリ別の文書数:
  ai: 2件
  database: 2件
  programming: 2件
  web: 2件

各カテゴリの代表文書:
  ai: 機械学習の基礎理論
  database: データベース管理システム
  programming: Python プログラミング入門
  web: Web フロントエンド開発

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

Elasticsearch接続成功

ベクトル検索用インデックス 'similarity_search_demo' を作成しました

8件のベクトル付きドキュメントを挿入しました

=== コサイン類似度検索: 'プログラミング学習' (上位3件) ===

類似文書: 3件

類似度スコア: 0.6140

タイトル: データベース管理システム

カテゴリ: database

内容: RDBMSはリレーショナルデータベース管理システムです。SQL言語を使ってデータの操作や管理を行います。...

------------------------------------------------------------

類似度スコア: 0.5438

タイトル: 深層学習とニューラルネットワーク

カテゴリ: ai

内容: 深層学習は多層のニューラルネットワークを使った機械学習手法です。画像認識や自然言語処理で優れた性能を発揮します。...

------------------------------------------------------------

類似度スコア: 0.5413

タイトル: Python プログラミング入門

カテゴリ: programming

内容: Python は初心者に優しいプログラミング言語です。シンプルな構文と豊富なライブラリが特徴で、データサイエンスやWeb開発に広く使われています。...

------------------------------------------------------------

=== コサイン類似度検索: 'アルゴリズム' (上位3件) ===

類似文書: 3件

類似度スコア: 0.5529

タイトル: Java オブジェクト指向プログラミング

カテゴリ: programming

内容: Java は強い型付けのオブジェクト指向プログラミング言語です。エンタープライズアプリケーションの開発に多く使用されています。...

------------------------------------------------------------

類似度スコア: 0.5487

タイトル: 機械学習の基礎理論

カテゴリ: ai

内容: 機械学習は人工知能の中核技術です。教師あり学習、教師なし学習、強化学習などの手法があり、データからパターンを学習します。...

------------------------------------------------------------

類似度スコア: 0.5437

タイトル: Python プログラミング入門

カテゴリ: programming

------------------------------------------------------------

=== コサイン類似度検索: 'データ設計' (上位3件) ===

類似文書: 3件

類似度スコア: 0.6146

タイトル: 深層学習とニューラルネットワーク

カテゴリ: ai

内容: 深層学習は多層のニューラルネットワークを使った機械学習手法です。画像認識や自然言語処理で優れた性能を発揮します。...

------------------------------------------------------------

類似度スコア: 0.5394

タイトル: Python プログラミング入門

カテゴリ: programming

------------------------------------------------------------

類似度スコア: 0.5282

タイトル: NoSQL データベース

カテゴリ: database

内容: NoSQLデータベースは従来のRDBMSとは異なる設計思想を持ちます。ドキュメント型、グラフ型、キーバリュー型などがあります。...

------------------------------------------------------------

=== ハイブリッド検索: 'Web開発' ===

検索結果: 5件

総合スコア: 11.6025

タイトル: Web フロントエンド開発

カテゴリ: web

------------------------------------------------------------

総合スコア: 2.7280

タイトル: Python プログラミング入門

カテゴリ: programming

------------------------------------------------------------

総合スコア: 1.9928

タイトル: Java オブジェクト指向プログラミング

カテゴリ: programming

------------------------------------------------------------

総合スコア: 0.9029

タイトル: 深層学習とニューラルネットワーク

カテゴリ: ai

------------------------------------------------------------

総合スコア: 0.2708

タイトル: NoSQL データベース

カテゴリ: database

------------------------------------------------------------

=== フィルタ付きベクトル検索: 'オブジェクト' (カテゴリ: programming) ===

フィルタリング後の類似文書: 2件

類似度スコア: 0.4790

タイトル: Java オブジェクト指向プログラミング

カテゴリ: programming

------------------------------------------------------------

類似度スコア: 0.4639

タイトル: Python プログラミング入門

カテゴリ: programming

------------------------------------------------------------

=== TF-IDF類似度検索: 'データベースシステム' ===

TF-IDF類似文書: 3件

類似度スコア: 0.7887

タイトル: NoSQL データベース

カテゴリ: database

------------------------------------------------------------

類似度スコア: 0.5000

タイトル: Python プログラミング入門

カテゴリ: programming

------------------------------------------------------------

類似度スコア: 0.5000

タイトル: Java オブジェクト指向プログラミング

カテゴリ: programming

------------------------------------------------------------

=== 複数ベクトル検索比較: '人工知能技術' ===

セマンティックベクトル検索:

スコア: 0.5625 - Web フロントエンド開発

スコア: 0.5287 - NoSQL データベース

スコア: 0.5149 - 深層学習とニューラルネットワーク

TF-IDFベクトル検索:

スコア: 0.9082 - Python プログラミング入門

スコア: 0.5000 - Java オブジェクト指向プログラミング

スコア: 0.5000 - 機械学習の基礎理論

------------------------------------------------------------

=== 基本集約分析 ===

カテゴリ別の文書数:

ai: 2件

database: 2件

programming: 2件

web: 2件

各カテゴリの代表文書:

ai: 機械学習の基礎理論

database: データベース管理システム

programming: Python プログラミング入門

web: Web フロントエンド開発

結果に疑問があったため質問してみました。

● 「オブジェクト」クエリでPythonが上位に来る理由を説明します：

  ベクトル検索での類似度計算（220-248行目）

  1. 疑似的な埋め込みベクトル生成（67-71行目）：
  np.random.seed(hash(text) % 2**32)
  vector = np.random.normal(0, 1, dims)
  return (vector / np.linalg.norm(vector)).tolist()
  2. 各文書のベクトル化：
    - Python文書: "Python プログラミング入門 Python は初心者に優しい..."
    - Java文書: "Java オブジェクト指向プログラミング Java は強い型付けのオブジェクト指向..."
  3. なぜPythonが上位になるか：
    - ハッシュベースのシード生成により、「オブジェクト」クエリのベクトルと各文書のベクトルが計算される
    - コサイン類似度が0.5109 vs 0.4661となっている
    - Javaの方が「オブジェクト」を実際に含んでいるが、疑似ベクトルの計算結果でPythonが高い類似度になった
  4. 実際の埋め込みベクトルなら：
    - 「オブジェクト」を含むJava文書の方が高いスコアになるはず
    - 現在はgenerate_mock_embeddingでの擬似計算のため、ランダムな結果

  真の意味的類似度検索には：
  - OpenAI Embeddings
  - Sentence-BERT
  - 日本語BERT embeddings
  などの実際の言語モデルが必要です。

● 「オブジェクト」クエリでPythonが上位に来る理由を説明します：

ベクトル検索での類似度計算（220-248行目）

1. 疑似的な埋め込みベクトル生成（67-71行目）：

np.random.seed(hash(text) % 2**32)

vector = np.random.normal(0, 1, dims)

return (vector / np.linalg.norm(vector)).tolist()

2. 各文書のベクトル化：

- Python文書: "Python プログラミング入門 Python は初心者に優しい..."

- Java文書: "Java オブジェクト指向プログラミング Java は強い型付けのオブジェクト指向..."

3. なぜPythonが上位になるか：

- ハッシュベースのシード生成により、「オブジェクト」クエリのベクトルと各文書のベクトルが計算される

- コサイン類似度が0.5109 vs 0.4661となっている

- Javaの方が「オブジェクト」を実際に含んでいるが、疑似ベクトルの計算結果でPythonが高い類似度になった

4. 実際の埋め込みベクトルなら：

- 「オブジェクト」を含むJava文書の方が高いスコアになるはず

- 現在はgenerate_mock_embeddingでの擬似計算のため、ランダムな結果

真の意味的類似度検索には：

- OpenAI Embeddings

- Sentence-BERT

- 日本語BERT embeddings

などの実際の言語モデルが必要です。

※ランダムなので結果が変化

これはどういうことなのかとか、理解を深めながら学習しているところです。
あと倒置インデックスを使ったものだとか、Elasticsearch をつかわずとも、Pythonだけでやる方法だとか、生成AIだと次々と提案してくれるので、とても勉強になります。私のリファレンス用にほしいので、途中ですが上記をブログに残しました。
AIはすごい勢いでいろいろ教えてくれるので、人間も学ぶことがどんどん増えていきます。結局エンジニアの理解を超えることはできない気がしています。
（やると後から確認することが大変）
整理して少しずつブログに書いていくつもりです。