Semantic search
Semantic search is a type of search technology that aims to understand the meaning behind a user’s query in order to provide more relevant results. It involves analyzing the relationship between words, concepts, and phrases to determine the user’s intent and the context of the search. The semantic search goes beyond traditional keyword-based search, using natural language processing and machine learning algorithms to return results that better match the user’s meaning.
Introduction
Semantic search is a search technology that goes beyond traditional keyword-based search methods to understand the context and meaning behind a user’s query. This technology uses natural language processing and machine learning algorithms to analyze the relationships between words, concepts, and phrases to provide more relevant results. The goal of semantic search is to help users find what they are truly looking for, rather than just matching keywords in a database. It considers the intent behind a query and provides results based on the context of the search, making it easier for users to find the information they need.
For example, let’s say a user searches for “best pizza in town”. In a traditional keyword-based search, the results may simply show websites containing the words “best”, “pizza”, and “town”. However, with semantic search, the technology would understand the user’s intent to find the best pizza restaurant in their area and would provide relevant results such as a list of popular pizza restaurants near the user’s location, or a review site listing the top-rated pizza places. This way, the user is more likely to find what they’re actually looking for, rather than having to sift through irrelevant results.
Libraries
Here are some popular libraries used for semantic search:
- Elasticsearch: An open-source, distributed search engine that supports complex search queries and offers a scalable solution for semantic search.
- Apache Lucene: A high-performance, full-featured search library that provides a solid foundation for building custom search engines and applications.
- Solr: An open-source enterprise search platform that provides advanced features for semantic search, including faceted search, geospatial search, and text analysis.
- Gensim: An open-source NLP library that provides tools for topic modeling, document similarity, and semantic indexing.
- Spacy: A popular NLP library that supports named entity recognition, text classification, and word vectors for semantic search.
Python Code Example
To perform a semantic search on a PDF file in Python, you can use the following steps:
- Extract the text from the PDF file using a library such as PyPDF2 or pdfminer.
- Pre-process the text to clean and prepare it for analysis, including removing stop words, stemming or lemmatizing the words, and creating a document-term matrix.
- Use a semantic search library such as gensim or spaCy to create word vectors or similarity scores based on the pre-processed text.
- Perform the search by comparing the query to the word vectors or similarity scores and returning the most relevant results.
Note that these steps are high-level and may require additional code to be written, depending on the specific requirements of your search application.
import json
import spacy
import PyPDF2
import requests
from io import BytesIO
from spacy.lang.en import English
nlp = spacy.load("en_core_web_md")
class search_engine():
def __init__(self, url):
self.url = url
def crawl_pdf_on_webpage(self):
# send a GET request to the URL and get the response
response = requests.get(self.url)
# parse the HTML content of the page
pdf_links = []
if response.status_code == 200:
try:
contents = json.loads(response.text)
for item in contents:
if item["name"].endswith(".pdf"):
pdf_links.append(item["download_url"])
except json.JSONDecodeError as e:
print("Error decoding JSON: ", e)
else:
print("Request failed with status code ", response.status_code)
print("links are Extracted")
return pdf_links
def extract_texts(self, pdf_num):
"""
pdf_num: int, defualt:2
Controls the number of pdf to crawl
"""
text = ""
i=0
for pdf_link in self.crawl_pdf_on_webpage():
response = requests.get(pdf_link)
pdf_content = response.content
pdf_file = PyPDF2.PdfReader(BytesIO(pdf_content))
# Open the PDF file
num_pages = len(pdf_file.pages)
# Extract the text from the PDF file
for page in range(num_pages):
text += pdf_file.pages[page].extract_text()
i+=1
if i >= pdf_num:
break
# Pre-process the text
doc = nlp(text)
return doc
def search(self, doc, query, score):
# Create a similarity matrix for the query and the document
query = nlp(query)
sim_score = query.similarity(doc)
# Return the most similar sentences in the document
sentences = [sent.text for sent in doc.sents if sim_score > score]
sentences = ["Item "+ str(i) + ": -->" + str(sen) for i, sen in enumerate(sentences)]
return sentences
If you want to run the code, please refer here
Conclusion
In conclusion, semantic search is a way to search for meaning and relationships in text data. It can be used to perform more advanced and accurate searches compared to simple keyword-based searches. To perform a semantic search on a PDF file in Python, you can extract the text from the PDF using a library such as PyPDF2, pre-process the text, and use a library such as spaCy to create word vectors or similarity scores based on the pre-processed text. Finally, you can compare the query to the word vectors or similarity scores and return the most relevant results. This example demonstrated how to perform a semantic search in a basic manner, and further customization may be required for specific use cases.