GPTCache

Semantic Caching

LLM에 질의를 할 때, 같은 질문에 대해서는 Cache로 빠르게 응답하고 싶었다. 정확히 동일한 문자열에 대해서 Cache를 이용하면, cache hit rate이 상당히 떨어져서 비효율적인 것이 자명했다. What is the happiest memory? 문자열에 대해서 Cache를 하고 있다라고 해도, What is happiest memory?라고 the 전치사를 빼먹으면 의미는 동일하지만 Cache hit을 하지 않을 것이다. 그래서 chatGPT나 Llama 2 같은 LLM을 사용할 때, 어떻게 cache를 할 수 있을지 궁금해졌다. 찾아보니 오픈소스 프로젝트인 GPTCache가 있었고, 아래처럼 설명하고 있다.

However, using an exact match approach for LLM caches is less effective due to the complexity and variability of LLM queries, resulting in a low cache hit rate. To address this issue, GPTCache adopt alternative strategies like semantic caching. Semantic caching identifies and stores similar or related queries, thereby increasing cache hit probability and enhancing overall caching efficiency.

Semantic caching이라는 용어가 나오는데, 의미적으로 같은 것을 cache할 수 있다는 것을 의미한다. 어떻게 Semantic caching이 되는지는 아래와 같이 설명되어 있다.

GPTCache employs embedding algorithms to convert queries into embeddings and uses a vector store for similarity search on these embeddings.

LLM에게 질문을 하면(Query) 그 질문을 컴퓨터가 이해할 수 있도록 vector 형식의 데이터로 만들고(Embeddings), 그것을 저장소에 저장한다. 이제 다른 질문을 했을 때 이미 유사한 질문을 했는지를 판단하게 되는데, 이것은 유사한 Embedding이 있는지 확인하는 작업이 진행된다. 최종적으로 유사한 Embedding 있다면 cache된 데이터를 사용하고, 없다면 LLM에서 새로운 답변을 생성하게 된다.

Testing with Langchain

GPTCache는 Langchain을 지원하고, 문서에서 쉽게 GPTCache를 연동하는 방법을 설명하고 있다. 그래서 Langchain으로 사용하여 테스트를 하게 되었다. Langchain 공식 문서의 Quickstart를 따라서 진행하였고, Mac에서 Ollama를 설치하여 로컬에서 llama2 모델을 사용하였다. langserve 라이브러리를 통해서 FastAPI framework로 API server를 만들 수 있었다.

필요한 모듈들을 virtualenv에 설치를 한다.

pyenv virtualenv 3.9 langchain
pyenv activate langchain
pip install langchain
pip install gptcache
pip install "langserve[server]"

그리고 아래처럼 코드를 작성하여 실행을 한다.

main.py

import hashlib
from fastapi import FastAPI
from langserve import add_routes
from gptcache import Cache
from gptcache.adapter.api import init_similar_cache
from langchain.cache import GPTCache
from langchain_community.llms import Ollama
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.globals import set_llm_cache

app = FastAPI(
    title="LangChain Server",
    version="1.0",
    description="A simple api server using Langchain's Runnable interfaces",
)

def get_hashed_name(name):
    return hashlib.sha256(name.encode()).hexdigest()


def init_gptcache(cache_obj: Cache, llm: str):
    hashed_llm = get_hashed_name(llm)
    init_similar_cache(cache_obj=cache_obj, data_dir=f"similar_cache_{hashed_llm}")

set_llm_cache(GPTCache(init_gptcache))

prompt = ChatPromptTemplate.from_messages([
  ("system", "You are world class technical documentation writer."),
  ("user", "{input}")
])
llm = Ollama(model="llama2")
output_parser = StrOutputParser()
chain = prompt | llm | output_parser

add_routes(
    app,
    chain,
    path="/myllm",
)

if __name__ == "__main__":
    import uvicorn

    uvicorn.run(app, host="localhost", port=8000)

이제 localhost에 실행중인 API 서버에 curl 질의를 해본다. 처음에는 시간이 좀 걸리지만, 두번 째 요청부터는 따르게 응답이 오는 것을 확인할 수 있다.

curl --location --request POST 'http://localhost:8000/myllm/invoke' \
    --header 'Content-Type: application/json' \
    --data-raw '{
        "input": {
            "input": "what is the best way to learn new languages?"
        }
    }'

그리고 이제 아래와 같은 query 문장을 해도 Cache 값으로 바로 응답하는 것도 확인할 수 있다.

what is the best way to learn new languages?
what is best way to learn new languages?
what is best way to run new languages?
what is best way to learn languages?

그런데 다른 답변을 기대할 만한 질문에서도 동일한 Cache 값으로 응답한다. False positive 결과도 쉽게 확인이 되었다.

what is the best way to learn Korean?
Where is the best place to study?
How do you study math?

이제 완전 다른 질문을 하면 LLM으로부터 새롭게 답변을 생성하고, 그다음에 유사한 질문에는 Cache 값을 다시 사용하는 것을 확인할 수 있었다.

what is embeddings?
please explain embeddings?

False Positive

위에서 How do you study math?를 했을 때도 what is the best way to learn new languages?의 대답 Cache 값을 전달하였다. How do you study math?는 의미적으로 다르기 때문에 새로 LLM에서 응답을 가져오길 기대했는데, Cache 값을 가져온 것이다. 🧐 그래서 어떻게 Cache가 되는 조건을 조절 할 수 있을지 더 자세히 알아보게 되었다.

init_similar_cache의 default 설정

GPTCache 문서에서 init_similar_cache method의 인자 값을 어떻게 줄 수 있는지 설명이 되어 있다. 이 문서에서 default로 onnx+sqlite+faiss라고 설명되어 있다.

The init_similar_cache method in the api package defaults to similar matching of onnx+sqlite+faiss

GPTCache source코드에서 해당 method를 확인해보았다. 그러니 각 구성요소가 아래처럼 default로 설정되어 있었다.

embedding: Onnx
data_manage: sqlite,faiss
evaluation: SearchDistanceEvaluation

def init_similar_cache(
    data_dir: str = "api_cache",
    cache_obj: Optional[Cache] = None,
    pre_func: Callable = get_prompt,
    embedding: Optional[BaseEmbedding] = None,
    data_manager: Optional[DataManager] = None,
    evaluation: Optional[SimilarityEvaluation] = None,
    post_func: Callable = temperature_softmax,
    config: Config = Config(),
):
    if not embedding:
        embedding = Onnx()
    if not data_manager:
        data_manager = manager_factory(
            "sqlite,faiss",
            data_dir=data_dir,
            vector_params={"dimension": embedding.dimension},
        )
    if not evaluation:
        evaluation = SearchDistanceEvaluation()
    cache_obj = cache_obj if cache_obj else cache
    cache_obj.init(
        pre_embedding_func=pre_func,
        embedding_func=embedding.to_embeddings,
        data_manager=data_manager,
        similarity_evaluation=evaluation,
        post_process_messages_func=post_func,
        config=config,
    )

그렇기 때문에 main.py를 실행했다면, similar_cache_{hash값} 디렉터리 안에는 아래와 같이 두개의 파일이 생성된다. default로 저장소를 sqlite3가 지정되어 있고, vector store로는 faiss가 설정되어 있다.

sqlite.db
faiss.index

문서에서 embeddings 종류들에 대한 설명과 예제 잘 정리 되어 있다. default로 설정된 Onnx는 아래처럼 Onnx model을 통해서 Embedding을 만들게 된다.

from gptcache.embedding import Onnx

test_sentence = 'Hello, world.'
encoder = Onnx(model='GPTCache/paraphrase-albert-onnx')
embed = encoder.to_embeddings(test_sentence)
print(embed)

evaluation은 SearchDistanceEvaluation로 default로 설정이 되었는데, 이에 대한 설명은 문서에서 확인할 수 있다. 그리고 Cache 설정을 잘 하는 방법에 대한 문서에서는 각 Similarity Evaluation 종류별로 특징을 설명하고 있다. SearchDistanceEvaluation는 빠르지만 정확도는 상당히 떨어진다.

SearchDistanceEvaluation, vector search distance, simple, fast, but not very accurate

SearchDistanceEvaluation는 search한 결과값의 score를 그대로 사용한다. 그래서 default로 설정된 Faiss의 검색결과에 따라서 score가 결정되게 된다. Faiss의 결색결과에 대하 예제 코드는 아래처럼 작성할 수 있다.

cached_msg_1는 GPTCache에 LLM 응답값이 Cache되어 있다. Vector store는 Faiss로 설정되어 있다.
query_msg_1는 Cache를 사용하지 않았던 Query 값을 설정하였다.
query_msg_2는 False Positive 결과가 나왔던 Query 값을 설정하였다.
이제 Onnx 모델로 Embedding들을 만들고, cached_msg_1의 embedding은 Faiss에 추가한다.
query_msg_1의 embedding으로 Faiss search를 해서 가장 가까운 값을 가져온다.
query_msg_2의 embedding으로 Faiss search를 해서 가장 가까운 값을 가져온다.

from gptcache.similarity_evaluation import KReciprocalEvaluation
from gptcache.similarity_evaluation import KReciprocalEvaluation
from gptcache.manager.vector_data.faiss import Faiss
from gptcache.manager.vector_data.base import VectorData
import numpy as np
from gptcache.embedding import Onnx

cached_msg_1 = '''
System: You are world class technical documentation writer.
Human: how is the best way to learn new languages?
'''

query_msg_1 = '''
System: You are world class technical documentation writer.
Human: what is embeddings?
'''

query_msg_2 = '''
System: You are world class technical documentation writer.
Human: How do you study math?
'''

encoder = Onnx(model='GPTCache/paraphrase-albert-onnx')
cached = encoder.to_embeddings(cached_msg_1)
query1 = encoder.to_embeddings(query_msg_1)
query2 = encoder.to_embeddings(query_msg_2)
faiss = Faiss('./none', encoder.dimension, 10)
cached_magnitude = np.linalg.norm(cached)
query1_magnitude = np.linalg.norm(query1)
query2_magnitude = np.linalg.norm(query2)
faiss.mul_add([VectorData(id=0, data=cached / cached_magnitude)])
print(faiss.search(query1 / query1_magnitude, 1))
print(faiss.search(query2 / query2_magnitude, 1))

결과는 what is embeddings?의 경우에는 0.98192865값이 나왔고, How do you study math?의 경우에는 0.700819 값이 나왔다. GPTcache source code를 살펴보니 similarity_threshold 값이 0.8로 설정되어 있다. 따라서 How do you study math?는 threshold보다 작기 때문에 해당 Cache 값을 사용한 것이다.

[(0.98192865, 0)]
[(0.700819, 0)]

Cache 값이 여러개일 때는 Faiss에서 search할 때 가장 가까운 값부터 가져오게 된다. 예제 코드를 아래처럼 작성할 수 있다.

Faiss에 두개의 vector 값이 존재하는 환경을 구성한다. faiss.mul_add를 통해서 두개의 vector값을 다른 id로 저장한다.
Faiss id 1에 더 의미적으로 가까운 please explain embeddings? 값으로 search 결과를 확인해본다.

from gptcache.similarity_evaluation import KReciprocalEvaluation
from gptcache.similarity_evaluation import KReciprocalEvaluation
from gptcache.manager.vector_data.faiss import Faiss
from gptcache.manager.vector_data.base import VectorData
import numpy as np
from gptcache.embedding import Onnx

cached_msg_1 = '''
System: You are world class technical documentation writer.
Human: how is the best way to learn new languages?
'''

cached_msg_2 = '''
System: You are world class technical documentation writer.
Human: what is embeddings?
'''

query_msg = '''
System: You are world class technical documentation writer.
Human: please explain embeddings?
'''

encoder = Onnx(model='GPTCache/paraphrase-albert-onnx')
cached_1 = encoder.to_embeddings(cached_msg_1)
cached_2 = encoder.to_embeddings(cached_msg_2)
new = encoder.to_embeddings(query_msg)
faiss = Faiss('./none', encoder.dimension, 10)
magnitude1 = np.linalg.norm(cached_1)
magnitude2 = np.linalg.norm(cached_2)
new_magnitude = np.linalg.norm(new)
faiss.mul_add([VectorData(id=0, data=cached_1 / magnitude1)])
faiss.mul_add([VectorData(id=1, data=cached_2 / magnitude2)])
print(faiss.search(new / new_magnitude, 1))

faiss.search(new / new_magnitude, 1)에서 1로 가장 가까운 값 하나만 보여주도록 했기 때문에 아래처럼 결과가 나온다. Tuple에서 두번째 값은 id값을 나타내고, id가 기대한것처럼 1로 나오는 것을 확인할 수 있다.

[(0.15015657, 1)]

similarity_threshold 적용

default에서 어떻게 evaluation이 진행되어서 score가 나오고, threshold 값을 통해서 Cache 값을 사용할지 말지 결정하는 로직을 확인하였다. 이제 langserve로 구성한 API server에 아래와 같이 Config(similarity_threshold=0.1)를 적용하여 위에서 발생한 False positive 발생을 개선할 수 있다. Cache hit rate는 줄어들겠지만, 더 유사도가 높은 Query만 Cache 값을 사용하게 된다.

import hashlib
from fastapi import FastAPI
from langserve import add_routes
from gptcache import Cache, Config
from gptcache.adapter.api import init_similar_cache
from langchain.cache import GPTCache
from langchain_community.llms import Ollama
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.globals import set_llm_cache

app = FastAPI(
    title="LangChain Server",
    version="1.0",
    description="A simple api server using Langchain's Runnable interfaces",
)

def get_hashed_name(name):
    return hashlib.sha256(name.encode()).hexdigest()


def init_gptcache(cache_obj: Cache, llm: str):
    hashed_llm = get_hashed_name(llm)
    init_similar_cache(cache_obj=cache_obj, data_dir=f"similar_cache_{hashed_llm}", config=Config(similarity_threshold=0.1))

set_llm_cache(GPTCache(init_gptcache))

prompt = ChatPromptTemplate.from_messages([
  ("system", "You are world class technical documentation writer."),
  ("user", "{input}")
])
llm = Ollama(model="llama2")
output_parser = StrOutputParser()
chain = prompt | llm | output_parser

add_routes(
    app,
    chain,
    path="/myllm",
)

if __name__ == "__main__":
    import uvicorn

    uvicorn.run(app, host="localhost", port=8000)

sqlite3 데이터 확인

Sqlite3에 생성된 테이블 구조와 데이터도 확인해보았다.

sqlite3 sqlite.db

테이블을 아래와 같이 생성된다.

sqlite> .tables
gptcache_answer        gptcache_question_dep  gptcache_session
gptcache_question      gptcache_report

gptcache_question 테이블에 embedding_data라는 column이 있고, 이제 Query가 Embedding으로 변환되어서 binary로 저장되고 있는 것을 확인할 수 있다.

sqlite> .schema gptcache_question
CREATE TABLE gptcache_question (
        id INTEGER NOT NULL,
        question VARCHAR(3000) NOT NULL,
        create_on DATETIME,
        last_access DATETIME,
        embedding_data BLOB,
        deleted INTEGER,
        PRIMARY KEY (id)
);

그리고 gptcache_report 테이블에서는 similarity라는 column이 있고, 이것은 유사도 값을 담고 있다. 위에서 How do you study math?도 동일한 답변을 얻었는데, 이 similarity 값이 3.2이다. what is best way to run new languages? learn을 run으로 잘못 적은 경우에는 이 값이 3.98이다. distance_max가 default로 4로 설정이 되어 있고, Faiss search로 가져온 결과를 뺀 결과이다. getcache_report 테이블 값을 통해서 Query마다 어떤 유사도로 어떤 cache 값을 사용했는지 확인할 수 있다.

sqlite> .schema gptcache_report
CREATE TABLE gptcache_report (
        id INTEGER NOT NULL,
        user_question VARCHAR(3000) NOT NULL,
        cache_question_id INTEGER NOT NULL,
        cache_question VARCHAR(3000) NOT NULL,
        cache_answer VARCHAR(3000) NOT NULL,
        similarity FLOAT NOT NULL,
        cache_delta_time FLOAT NOT NULL,
        cache_time DATETIME,
        extra VARCHAR(3000),
        PRIMARY KEY (id)
);

Faiss segmentation fault with torch

torch를 설치하고 Faiss를 사용해서 GPTCache를 사용할 때, Segmentation fault가 발생하였다. torch는 2.1.2 버전을 사용하고, faiss-cpu 1.7.4 버전을 사용하고 있었다. 왜 Segmentation fault가 나는지 이해할 수가 없었고, 삽질을 한 끝에 혹시나 하는 마음에 faiss-cpu를 1.7.0으로 설치하여 실행했다. 그랬더니 이제 해당 에러 없이 정상적으로 실행이 되었다.😭

HuggingFace

langchain에서 HuggingFace Local Pipeline도 연동할 수 있다. CPU에서 엄청 오래 걸렸지만 응답을 정상적으로 받았다.

import os
from transformers import AutoTokenizer, pipeline, AutoModelForCausalLM
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

model_id = "tiiuae/falcon-7b-instruct"
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device_map="auto",
    max_length=5000,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
)

llm = HuggingFacePipeline(pipeline = pipeline, model_kwargs = {'temperature':0})
template = """
You are an ethical hacker and programmer. Help the following question with brilliant answers.
Question: {question}
Answer:"""
prompt = PromptTemplate(template=template, input_variables=["question"])

llm_chain = LLMChain(prompt=prompt, llm=llm)

question = "Create a python script to send a DNS packet using scapy with a secret payload "

print(llm_chain.invoke(question))

결론

Langchain과 GPTCache를 통해서 간단하게 LLM으로 서비스하는 API 서버를 구성하고, Semantic caching을 적용할 수 있었다. 그 과정에서 Embeddings, Faiss 등을 새로 알게 되었다.