Innovate Articles
Programmatic Patent Searches Using Google’s BigQuery & Public Patent Data
By: James H. Moeller
Information regarding patents and patent applications is important for a variety of business activities occurring in the intellectual property marketplace. For example, a prosecution-oriented prior art search, or a litigation-oriented infringement analysis, or even a research project focused on landscaping for strategic business intelligence, all require access to patent information resources. Google’s combination of its BigQuery data warehouse service along with its public patent datasets is providing a new type of patent information resource that’s better positioned for the growing trend of integrating patent information together with data science programmatic analysis for more customized solutions by data-savvy practitioners.
Google’s BigQuery and patent datasets are different from other resources because of its combination of cost and capabilities. It’s inexpensive, as no subscription is required to access the patent information beyond the basic BigQuery data access fees. In fact, for BigQuery the first 1 TB of access per user, per month, is free and then billed at only $5.00 per terabyte thereafter. In addition, the patent datasets are provided as ready-made SQL databases, through Google’s cloud services, and thus don’t require the user to import or manage their own database. BigQuery is also accessible via all the popular analytics analysis platforms such as Google Data Studio, Tableau, Looker, Excel, and others. Finally, BigQuery provides programmatic access to the patent data (via SQL queries and REST APIs for Java, .NET, and Python) as a valuable capability to enable customized data science applications such as user-defined semantic analysis and machine learning functions.
In contrast, other third-party resources that provide programmatic access to large patent databases for customized data science applications, or provide more ready-made functions for sophisticated analysis, are all more expensive subscription services. In addition, resources that provide free patent information, typically do so via a limited Web interface and / or via downloadable datasets where the user is required to manage their own database.
Google’s BigQuery and its patent datasets are thus a cost effective and powerful platform for patent research and analysis. The first steps toward utilizing this platform are to understand what’s included in the datasets and how to execute the fundamental SQL query methods of access. Overall there are 19 different datasets spanning information such as patent classifications, standards essential patents, chemical compounds, patented drugs, patent litigation, patent publications, and more. This report is a tutorial on exploring and characterizing specifically the “patents.publications” dataset and on exemplifying a simple keyword phrase SQL query as a basis for more sophisticated patent analysis.
Characterizing Google’s Public Patent Data
Most data science projects begin with an analysis of the problem or issue to be addressed and follow that with the preparatory data collecting, formatting and cleaning, all before any insightful analysis begins. But with Google’s BigQuery and the public patent datasets, that preliminary work is not needed. However, that still doesn’t mean a user can jump directly into insightful analysis. An understanding of the data that’s available is required.
Google’s “patents.publications” dataset, accessible via a Google Cloud Portal account, contains bibliographic information from a very broad set of worldwide patents as well as full-text information for U.S. patents. On the “Details” tab of the dataset description, you’ll find the size of the table, the number of rows, and the date when the table was last updated. On the “Schema” and “Preview” tabs you’ll find a brief description of every field in the dataset and an example record.
Characterizing the datasets further requires some basic data exploration via SQL queries. For the patents.publication dataset, its insightful to initially query for the date and geographic coverage to get a feel for the timeliness and global breadth of the information. Then, to enable the keyword phrase queries, it’s useful to explore some text fields on which those queries can be executed.
Query #1 below looks for the MIN and MAX patent publication dates, which shows the earliest publication date of July 4, 1782 and the most recent date of Sept 11, 2018. A similar query can be written for MIN and MAX patent grant dates. These are shown in Figure 1. Google Data Studio is used as the presentation medium, so the figures below are screen-shots of the report pages. The live embedded report can be view on the Moeller Ventures website at the following link. https://www.moellerventures.com/index.php/CharGPatPubDataPatentsPublications.
Query #1
#standardSQL
-- PublishedPatentApps_PerYear_PerCountry
SELECT
MIN(publication_date) AS Earliest_Patent_Publication_Date,
MAX(publication_date) AS MostRecent_Patent_Publication_Date
FROM
`patents-public-data.patents.publications` AS patentsdb
WHERE
publication_date > 0;
Figure 1
Query #2 below helps gain an understanding of the geographic coverage of the dataset by showing the total number of patent applications by country. A similar query can be used to list the number of granted patents. These tables are shown in Figure 1 and Figure 2. Note that the granted patents table includes both Utility and Design patents.
Query #2
#standardSQL
-- Applications_Per_Country
SELECT country_code AS Country_Code, COUNT(*) AS Number_of_Patent_Apps
FROM (
SELECT ANY_VALUE(country_code) AS Country_Code
FROM `patents-public-data.patents.publications` AS patentsdb
GROUP BY application_number
)
GROUP BY Country_Code
ORDER BY Number_of_Patent_Apps DESC;
Figure 2
Finally, Query #3 is used to find text fields on which keyword phrase queries can be executed. This query lists the total number of patents, by country, that had an English abstract that was not empty (i.e. Not NULL). That table is also shown in Figure 2. This table shows that there are English patent abstracts for ~49 million of the ~76 million patent applications present in the dataset. As a further verification of the data, a similar Not NULL query can be executed on the patent claims field and the patent description field. Those results are shown in Figure 3 and, as expected, only show a result for the U.S., since the dataset only includes bibliographic patent information (no claims or descriptions) for non-U.S. patents.
Query #3
#standardSQL
-- NotNULLAbstracts_PerCountry
SELECT COUNT(*) AS Number_of_Patents, country_code AS Country_Code
FROM (
SELECT ANY_VALUE(country_code) AS Country_Code
FROM `patents-public-data.patents.publications` AS patentsdb,
UNNEST(abstract_localized) AS abstract_info
WHERE
abstract_info.text IS NOT NULL
AND
abstract_info.language = 'en'
AND
CHARACTER_LENGTH(abstract_info.text) > 10
GROUP BY application_number
)
GROUP BY Country_Code
ORDER BY Number_of_Patents DESC;
Figure 3
Example Keyword Phrase Search Query
Now armed with a better understanding of the patents.publications dataset, the next objective is to work with some keyword phrase queries to derive some intelligence. From a keyword phrase perspective, the abstract is the only text field that spans the international patent applications in the dataset, so that will be the focus in order to provide an international perspective to the results. As noted above, there are ~49 million English abstracts spanning the patent applications from the various countries as listed in the right-hand table of Figure 2. The query chosen to exemplify a keyword phrase search is one that simply produces time-series data representing the number of patent applications that use a specified keyword phrase.
Query #4 implements that keyword phrase, time-series data search and uses the keyword phrase of “internet of things”. That keyword phrase was chosen because it’s a relative new patent literature term within the last decade, but the query can be modified to search for any keyword phrase. In addition, the WHERE clause of Query #4 can be used to limit the search to a particular country or it can be removed to show worldwide results.
So, Figure 4 shows the histogram of the phrase “internet of things” from a global patent application perspective and, while difficult to observe on the chart because of the scale, indicates that the earliest patent literature usage (at least in the abstract) was in December of 2007, but the term really started to get popular midyear 2010 and continues to ramp through 2017. Figure 5 shows the results specifically for the U.S. across the ~8.7 million U.S. patent applications and indicates peak usage approximately midyear 2016. The live embedded report can be viewed at the following link; https://www.moellerventures.com/index.php/GPatPubDataIoTKeyPhrase.
Query #4
#standardSQL
-- This counts the number of U.S. patents matching the phrase on a monthly basis.
WITH
Patent_Matches AS
(
SELECT
PARSE_DATE('%Y%m%d', SAFE_CAST(ANY_VALUE(patentsdb.filing_date) AS STRING)) AS Patent_Filing_Date,
patentsdb.application_number AS Patent_Application_Number,
ANY_VALUE(abstract_info.text) AS Patent_Title,
ANY_VALUE(abstract_info.language) AS Patent_Title_Language
FROM
`patents-public-data.patents.publications` AS patentsdb,
UNNEST(abstract_localized) AS abstract_info
WHERE
LOWER(abstract_info.text) LIKE '%internet of things%'
AND patentsdb.country_code = 'US'
GROUP BY
Patent_Application_Number
),
Date_Series_Table AS
(
SELECT day, 0 AS Number_of_Patents
FROM UNNEST (GENERATE_DATE_ARRAY(
(SELECT MIN(Patent_Filing_Date) FROM Patent_Matches),
(SELECT MAX(Patent_Filing_Date) FROM Patent_Matches)
)) AS day
)
SELECT SAFE_CAST(FORMAT_DATE('%Y-%m',Date_Series_Table.day) AS STRING) AS Patent_Date_YearMonth, COUNT(Patent_Matches.Patent_Application_Number) AS Number_of_Patent_Applications
FROM Patent_Matches
RIGHT JOIN Date_Series_Table
ON Patent_Matches.Patent_Filing_Date = Date_Series_Table.day
GROUP BY Patent_Date_YearMonth
ORDER BY Patent_Date_YearMonth;
Figure 4
Figure 5
As a comparison, Figure 6 shows the term’s usage in patent applications filed in China (queried across ~15 million patent applications) and shows the very high usage of “internet of things” in Chinese intellectual property over the last eight years. In fact, the China numbers are so dramatic that they really dwarf the term’s usage in patent literature from any other country. This trend also correlates with the dramatic rise in patent application filings in China over the last five to ten years.
Figure 6
The Take-Aways
- The combination of BigQuery and the patents.publications dataset, creates a platform that excels at the capability to quickly and inexpensively query information from a large number of patents and applications.
- As a dataset characterization example, the the BigQuery patent.publications dataset was explored via SQL queries and was shown to have a current date range coverage of July 4, 1782 through Sept 11, 2018. In addition, from a geographic standpoint, it was shown to contain bibliographic information for over 76 million patents and applications worldwide and information on 12 million U.S. patents and applications, including ~8.7 million U.S. patent and applications with English abstracts.
- As an analysis example, a keyword phrase-matching SQL query was utilized to find patents and patent applications of interest and present that information in a time-series form that can be plotted for better visualization and understanding. The keyword phrase, time-series data query exemplified in this report can be modified to search for different keyword phrases and different countries and can be used as a basis for more complex patent analysis.
Jim Moeller is a U.S. Registered Patent Agent and Intellectual property / competitive intelligence research consultant with an affinity to apply data science to projects where it can add real value. His experience spans 15 years of independent consulting, 5 years in the investment banking business, and 10 years with various technology companies. His domain expertise covers wide areas of electronics technologies, including Internet-of Things (IoT), wireless and mobile communications, broadband telecommunications, and components. https://www.MoellerVentures.com