Building the knowledge base
We now define the process to extract text content from the PDF file and generate embeddings from it.
Text extraction
Deploy a text extraction service
We will use Apache Tika, a tool for extracting text from a variety of formats. Create the function, run it and obtain the URL of the service:
tika_function = project.new_function("tika", kind="container", image="apache/tika:latest-full")
tika_run = tika_function.run("serve", service_ports = [{"port": 9998, "target_port": 9998}], wait=True)
service = tika_run.refresh().status.service
print("Service status:", service)
TIKA_URL = tika_run.status.to_dict()["service"]["url"]
print(TIKA_URL)
Extract the text
We create a python function which will read an artifact from the platform's repository and leverage the Tika service to extract the textual content and write it to a HTML file.
extract_function = project.new_function(
name="extract",
kind="python",
python_version="PYTHON3_10",
code_src="src/extract.py",
handler="extract_text"
)
We store the PDF file as artifact and download it. You are free to change the address to whichever PDF file you would like.
pdf = project.new_artifact("document.pdf",kind="artifact", path="https://raw.githubusercontent.com/scc-digitalhub/digitalhub-tutorials/master/s7-rag/resources/document.pdf")
pdf.download("document.pdf")
Then, we run the function by passing it the artifact and the URL to Tika:
extract_run = extract_function.run("job", inputs={"artifact": pdf.key}, parameters={"tika_url": TIKA_URL}, wait=True)
Let's read the file and check the content is correct:
html_artifact = project.get_artifact("document.pdf_output.html")
html_artifact.download()
with open('./artifact/output.html', 'r') as file:
file_content = file.read()
print(file_content)
Embeddings
Embeddings are vectors of floating-point numbers that represent words and indicate how strong the connection between certain words is.
We need to deploy a suitable model to generate embeddings from the extracted text.
embed_function = project.new_function(
"embed",
kind="kubeai-text",
model_name="embmodel",
features=["TextEmbedding"],
engine="VLLM",
url="hf://thenlper/gte-base",
)
embed_run = embed_function.run("serve", wait=True)
status = embed_run.refresh().status
print("Service status:", status.state)
EMBED_URL = status.to_dict()["service"]["url"]
EMBED_MODEL = status.to_dict()["openai"]["model"]
print(f"service {EMBED_URL} with model {EMBED_MODEL}")
Let's check that the model is ready. We need the OpenAI client installed:
%pip install -qU openai
from openai import OpenAI
client = OpenAI(api_key="ignored", base_url=f"{EMBED_URL}/v1")
response = client.embeddings.create(
input="Your text goes here.",
model=EMBED_MODEL
)
response
Embedding generation
We define a function to read the text from the repository and push the data into the vector store.
embedder_function = project.new_function(
name="embedder",
kind="python",
python_version="PYTHON3_10",
requirements=[
"transformers==4.50.3",
"psycopg_binary",
"openai",
"langchain-text-splitters",
"langchain-community",
"langgraph",
"langchain-core",
"langchain-huggingface",
"langchain_postgres",
"langchain[openai]",
"beautifulsoup4",
],
code_src="src/embedder.py",
handler="process",
)
Parameters are as follows:
- Embed model is served at
EMBED_URLwithEMBED_MODEL. - Input artifact (HTML) is
html_artifact.
embedder_run = embedder_function.run(
"job",
inputs={"input": html_artifact.key},
envs=[
{
"name": "EMBEDDING_SERVICE_URL",
"value": EMBED_URL
},
{ "name": "EMBEDDING_MODEL_NAME",
"value": EMBED_MODEL,
}
],
wait=True,
)
Check that the run has completed:
embedder_run.status.state