import jsonlines
from itertools import groupby
from operator import itemgetter
from ebooklib import epub, ITEM_DOCUMENT, ITEM_NAVIGATION
from bs4 import BeautifulSoup
from langchain_text_splitters import RecursiveCharacterTextSplitter
= 4000
MAX_CHUNK_CHARS
def main():
= "The_Lord_of_the_Rings.epub"
epub_file = "The_Lord_of_the_Rings.jsonl"
jsonl_file print(f"Process {epub_file}...")
= epub_text(epub_file)
chunks print(f"Identified {len(chunks)} textual elements.")
for ix, chunk in enumerate(chunks):
"index"] = ix
chunk["title"] = "The Lord of the Rings"
chunk[= path_to_volume_book(chunk["path"])
vol_book if vol_book:
chunk.update(vol_book)del chunk["path"]
print(f"Writing {jsonl_file}...")
with jsonlines.open(jsonl_file, "w") as out_file:
out_file.write_all(chunks)print("Done.")
def epub_text(epub_file):
= epub.read_epub(epub_file)
book = table_of_contents(book)
toc = []
contents for source in toc:
= toc[source]
node = book.get_item_with_href(source)
item = chapter_contents(item, node)
chunks = coalesce_pages(chunks)
page_chunks
contents.extend(page_chunks)return contents
def table_of_contents(book):
= book.get_items_of_type(ITEM_NAVIGATION)
nav_items = next(nav_items)
nav_item = BeautifulSoup(nav_item.get_content(), "html.parser")
ncx = []
np_nodes for np in ncx.find("navmap").find_all("navpoint", recursive=False):
= process_navpoint(np)
nodes
np_nodes.extend(nodes)= {}
toc for node in np_nodes:
"source"]] = node
toc[node[return toc
def process_navpoint(navpoint, path=[]):
= {
node "source": navpoint.content["src"],
"label": navpoint.find("navlabel").get_text().strip(),
"path": path,
}
node.update(attr_values(navpoint.attrs))= path + [node["label"]]
child_path = [node]
nodes for child_np in navpoint.find_all("navpoint", recursive=False):
= process_navpoint(child_np, child_path)
child_nodes
nodes.extend(child_nodes)return nodes
def attr_values(attrs):
"Book-specific interpretation of TOC attributes"
= {
vals "class": attrs["class"][0],
"id": attrs["id"],
"playorder": attrs["playorder"],
}return vals
def chapter_contents(item, node):
= []
chapter_chunks = BeautifulSoup(item.get_body_content(), "html.parser")
soup # Iterate over every tag
= "-"
page if soup.div:
= soup.div
root_tag else:
= soup
root_tag for tag in root_tag.find_all(True, recursive=False):
if ((tag.name == "a") and
"id" in tag.attrs) and
("id"].startswith("page")):
tag[= tag["id"]
page else:
= {
chunk "text": tag.get_text().strip(),
"page": page,
}
chunk.update(node)
chapter_chunks.append(chunk)return chapter_chunks
def coalesce_pages(chunks):
"""Combine the texts of items that share the same page."""
= ["page"]
keys = itemgetter(*keys)
key_func = RecursiveCharacterTextSplitter(
text_splitter = MAX_CHUNK_CHARS,
chunk_size = 0,
chunk_overlap = [".\n", "\n\n", "\r\n", "\n"],
separators
)= []
page_chunks for page_key, page_iter in groupby(chunks, key_func):
= [pi for pi in page_iter]
chunk_nodes = chunk_nodes[0]
page_chunk_proto = "\n".join(pi["text"] for pi in chunk_nodes)
page_text = text_splitter.split_text(page_text)
page_texts for text in page_texts:
if text:
= page_chunk_proto.copy()
page_chunk "text"] = text
page_chunk[
page_chunks.append(page_chunk)return page_chunks
def path_to_volume_book(path):
match path:
case []:= None
vol_book
case [volume]:= {"volume": volume.title()}
vol_book
case [volume, book]:= {
vol_book "volume": volume.title(),
"book": book.title(),
}return vol_book
if __name__ == "__main__":
main()
New generative AI technologies can be useful for delving into detailed, world-building stories like Tolkien’s The Lord of the Rings.
First, a few examples of some questions and responses.
Motivation
Readers of J.R.R. Tolkien’s books and watchers of the movie adaptations develop varying levels of understanding of the complex characters, geography, and lore of Middle Earth. Some dive in, head first, poring over the driest sections of The Silmarillion and other related texts and writings, while others enjoy the stories in a more transactional context, understanding only what is needed to follow the story.
For those looking for a more in-depth experience a discussion with a well-versed Tolkien expert or reading along with others in a book club can be an avenue towards deeper understanding and enjoyment. But we don’t always have someone around that fits the bill as a discussion partner.
My son has recently started delving into The Lord of the Rings again and it gave me the idea of using Generative AI as a tool for enhancing the Tolkien experience.
This is an example of a question I asked after indexing the textual content of The Lord of the Rings books and making them available through a custom conversational retrieval chatbot using Retrieval-Augmented Generation.
Retrieval-Augmented Generation
I am a big fan of using Retrieval-Augmented Generation or RAG as a way of using Large Language Models to interact, summarize, and answer questions about a set of texts. In my work in technology and learning at Harvard Business School we have been indexing the textual elements of our active, social, case-based online business courses to create course assistant chatbots and interactive teaching elements.
The basic gist of RAG is to divide up the source text into a set of chunks that are then indexed using vector embeddings, creating a numeric vector for each textual chunk that represents (at some level) the semantics of the text. The chunks and associated This database can then be used to find a set of documents related to a query or conversation that can be passed as context to a Large Language Model (LLM) to synthesize an answer.
The advantage of using RAG compared to using an LLM like ChatGPT without RAG, is that it focuses the conversation directly on the text, minimizes bias and hallucinations, and also provides the ability to show direct references and links to the textual chunks used to create the LLM response. The current architecture of LLMs does not allow them to provide direct references to source materials.
Indexing the text of The Lord of the Rings
I used an ePub version of The Lord of the Rings that included all three volumes and 6 books along with appendices. The custom chunking cusprogram (see Appendix) produces a JSONL
file with each line containing a chunk of text and associated metadata like this:
{"class": "appendix",
"id": "appe-1",
"index": 991,
"label": "APPENDIX A: ANNALS OF THE KINGS AND RULERS",
"page": "page1071",
"playorder": "79",
"source": "LordoftheRings_appe-1.html",
"text": ".\n"
"After the fall of Sauron, Gimli brought south a part of the "
"Dwarf-folk of Erebor, and he became Lord of the Glittering Caves.\r\n"
" He and his people did great works in Gondor and Rohan. For "
"Minas Tirith they forged gates of mithril and steel to replace those "
"broken by the Witch-king. Legolas his friend also brought south "
"Elves out of Greenwood, and they\r\n"
" dwelt in Ithilien, and it became once again the fairest "
"country in all the westlands.\n"
"But when King Elessar gave up his life Legolas followed at last the "
"desire of his heart and sailed over Sea.",
"title": "The Lord of the Rings"}
Other things to try
It would definitely make sense to add The Hobbit and The Silmarillion to the system to allow for a broader range of questions.
The use of a multi-modal embedding model would allow for indexing the maps of books as images, which might add interesting capabilities.
It would be useful to be able to do broader RAG searches to the end of creating different types of summarizations across topics, characters, or family lines.