What AI Actually Means in IP - Video 3

In this third episode of the 4-part Lighthouse IP AI series, Tim Lagemaat and Kacper Gorski explore what AI actually means in the context of intellectual property. We explain the difference between traditional rule-based systems and modern AI approaches such as machine learning and natural language processing. The discussion also highlights the limitations of large language models, including hallucination and the inability to reliably retrieve factual patent data. Finally, we introduce the concept of semantic search, showing how vectorisation and embeddings enable searching by meaning rather than keywords, and why high-quality, structured data is essential for making AI work in practice.

Read the transcript of this video below

Tim

“Everybody is talking about AI, but what does AI actually mean in this context?”

Kacper

“In the IP world, AI mostly means two things right now.

The first is natural language processing: the ability for machines to read and interpret text.

The second is machine learning: allowing systems to learn patterns from data instead of following manually written rules.

The older approach relied on expert systems, essentially encoding human rules into software.

For example: if a patent mentions X and is classified under Y, then it is relevant.

That works until the data becomes too large and too complex, which happened a long time ago.

The newer approach, and what everyone is excited about, involves models that learn directly from the data itself.

These models can identify patterns and relationships that no human could realistically code manually.”

Tim

“Okay, but what if I ask ChatGPT about a specific patent?”

Kacper

“You should try it.

Ask ChatGPT to find patents relevant to a specific technology.

It will often provide a confident and well-structured answer.

It may even cite patent numbers.

The problem is that those patent numbers may not exist.

The citations may be fabricated.

Descriptions may blend together details from multiple unrelated patents into something that sounds convincing but is factually incorrect.

This is hallucination.

The model generates statistically plausible text based on its training data, but it is not actually searching a patent database.”

Tim

“Why does it get things wrong?”

Kacper

“Because large language models like ChatGPT are trained to predict the next word in a sequence.

They are extremely good at generating fluent and coherent text, but they are not retrieval systems.

They do not truly “look things up.” They generate.

When you ask about something highly specific, such as a patent number or legal claim, the model fills in the gaps with what seems most statistically likely, not necessarily what is true.

In a domain where one incorrect patent number or one incorrect claim interpretation can affect legal strategy or investment decisions, that becomes a serious problem.

Tim

“Okay, so what is the difference between understanding words and understanding meaning?”

Kacper

“Traditional search matches the exact words you type.

If you search for “automobile,” you may not find patents using the word “vehicle” or “car,” unless you manually added those terms.

Understanding meaning is different.

It captures the concepts behind the words.

An AI system that understands meaning knows that “automobile,” “vehicle,” “car,” and “motorised transport” all refer to overlapping concepts.

That is what modern embeddings do.

They convert text into numerical representations where similar meanings cluster together, regardless of the specific wording or language.”

Tim

“Where does vector search fit into this? And what exactly is data vectorisation?”

Kacper

“Vectorisation is the process of converting text, in our case patent documents, into numerical representations called embeddings.

Each document becomes a point in a high-dimensional space.

The key insight is that documents with similar meanings end up close to one another in that space, even when they use completely different words or languages.

Vector search allows you to search by concept.

You can describe what you are looking for and retrieve patents that are semantically similar, not just keyword matches.

You can do this at document level, finding patents similar to another patent, or at sentence level, finding specific claims related to a concept.

Both approaches matter for different use cases.”

Tim

“How does this work with Lighthouse data and patent data?”

Kacper

“We have taken our structured patent data, consisting of 30 years of cleaned, normalised, and deduplicated records across 170 authorities, and vectorised it using domain-specific models trained on IP data.

That is the critical point.

We did not simply take an off-the-shelf model and point it at raw patent text.

We built embeddings designed to understand patent language, claim structures, and technical disclosures.

We effectively built patent-language models that understand the way IP professionals work and search.”

Tim

“What is at stake when the AI gets it wrong?”

Kacper

“Quite honestly, everything.

If a company is making freedom-to-operate decisions and the search misses a critical patent, that could mean lawsuits, injunctions, or millions in damages.

If you are conducting due diligence for an acquisition and the IP landscape analysis is incomplete, then you are making decisions based on flawed intelligence.

When companies spend millions or even billions on R&D, “good enough” search is simply not good enough.

You need precision.

You need recall.

And most importantly, you need to trust the data underneath the AI.”