Fetching the latest batch of news from the GDELT GKG
Fetching the latest batch of articles from the GDELT GKG.🔗
The GDELT Project which is dubbed "A Global Database of Society" is an excellent, freely available resource of world news.
The GKG🔗
In particular, I make regular use of the GDELT Knowledge Graph (GKG) raw data files for getting a snapshot of the current news. The GKG is updated every 15 minutes, and is essentially a dump of all the latest news, including URLs as well as a bunch of useful metadata.
Some caveats🔗
This notebook demonstrates the use of the GKG for obtaining a list of URLs representing the latest news snapshot, and downloading the content of those news article where available. In some cases, such as paywall situations, or attempts at bot thwarting, you may not get everything with this approach, but I've been able to obtain quite a representative bit of the news this way.
Project ideas🔗
The resulting text dumps from this notebook can be used for any number of news-related projects. Try building a classifier using the metadata provided, or build a topic model using BERTopic. In the future, I will publish examples of these ideas and more.
Imports🔗
import datetime
import logging
import os
import shutil
import tempfile
import urllib
import urllib.request
import zipfile
from pathlib import Path
try:
import goose3
except ModuleNotFoundError:
!pip install goose3 --quiet
import goose3
Fetch the latest GKG batch of news URLs🔗
GKG_URL = "http://data.gdeltproject.org/gdeltv2/%s"
DT_FORMAT = "%Y%m%d%H%M%S"
GKG_DATA_DIR = Path("gkg-data")
GKG_HEADER = [
"GKGRECORDID",
"DATE",
"SourceCollectionIdentifier",
"SourceCommonName",
"DocumentIdentifier",
"Counts",
"V2Counts",
"Themes",
"V2Themes",
"Locations",
"V2Locations",
"Persons",
"V2Persons",
"Organizations",
"V2Organizations",
"V2Tone",
"Dates",
"GCAM",
"SharingImage",
"RelatedImages",
"SocialImageEmbeds",
"SocialVideoEmbeds",
"Quotations",
"AllNames",
"Amounts",
"TranslationInfo",
"Extras"
]
Utilities🔗
Resources for fetching GKG data and extracting the URLs.
def download(url, tofile=None):
"""Fetch a file from `url` and save it to `tofile`."""
if tofile is None:
tofile = os.path.basename(urllib.parse.urlparse(url).path)
with urllib.request.urlopen(url) as response, open(tofile, "wb") as outfile:
shutil.copyfileobj(response, outfile)
return tofile
def gkg_filename(dt=None):
"""Returns the filename of the csv.zip GKG file associated with the latest
15 minute period before dt, or the current time if dt is None.
"""
if dt is None:
dt = datetime.datetime.utcnow()
minute = 15 * (dt.minute//15)
dt = dt.replace(minute=minute, second=0, microsecond=0)
dt_str = dt.strftime(DT_FORMAT)
return f"{dt_str}.gkg.csv.zip"
def fetch_gkg_urls(force=False, dt=None):
"""Yields an iterable of URLs fetched from the latest GKG or None if the most
recent GKG file has already been processed.
Optionally pass in a historical datetime to be used rather than the
current time for determining the latest 15-minute batch to fetch.
"""
fn = gkg_filename(dt)
download_file = GKG_DATA_DIR / fn
csv_file = GKG_DATA_DIR / fn[:-len(".zip")]
if download_file.exists() and not force:
print("Skipping file for previously processed period:", fn)
return
url = GKG_URL % fn
print("Fetching gkg file:", url)
try:
fn = download(url, tofile=download_file)
except Exception as e:
print("Unable to download GKG file:", e)
return
with zipfile.ZipFile(download_file, "r") as ref:
ref.extractall(GKG_DATA_DIR)
with open(csv_file, encoding="utf8", errors="surrogateescape") as f:
for i, line in enumerate(f):
yield line.split("\t")[4].strip()
GKG_DATA_DIR.mkdir(parents=True, exist_ok=True)
urls = list(fetch_gkg_urls(force=False))
Take a peek at the URLs🔗
urls[:10]
Log level ERROR🔗
Goose3 tends to log out a lot of warnings about not being able to resolve article publication dates to UTC. We are just looking at text, so setting log level to ERROR to squelch those warnings.
Skipping failures🔗
The code below does that bad thing of burying the general Exception, and just skips any URLs it cannot fetch. If you require squeezing out as many of these URLs as possible, you might want to dig into the specific errors.
Alternative article extractors🔗
There are a number of good libraries available for article text extraction. Experiment to see what works best for your needs. Some options include:
- goose3 (used here)
- BoilerPy3
- html-text (extracts all the text, not just the primary article)
- Newspaper3k
- trafilatura
Be prepared to wait🔗
It takes a while to fetch the whole batch of web pages. There is a crude dotted status indicator just to make things appear alive. In a production scenario, you will probably want to use an approach that spawns multiple threads or processes for fetching.
🔥 Pro tip: If you are trying to get an end-to-end run of the notebook just to see how things work, iterate just a slice of the urls (e.g.:
urls[:10]
) and carry on.
goose = goose3.Goose()
texts = []
crawler_logger = logging.getLogger("goose3.crawler")
crawler_logger.setLevel(logging.ERROR)
MIN_TEXT_LENGTH = 200 # Sometimes the extracted text just isn't worth keeping
for i, url in enumerate(urls):
if i % 10 == 0:
print(".", end=" ") # just trying to look alive
try:
article = goose.extract(url=url)
except Exception:
"""Warning: buried general Exception is considered bad practice.
There are various reasons a fetch might not succeed. If you are getting
a lot of skipped URLs, you may want to dig into this and handle each
error specifically.
"""
print("Error. Skipping URL:", url)
continue
title = article.title
descr = article.meta_description
text = article.cleaned_text
if len(text) >= MIN_TEXT_LENGTH:
texts.append(text)
print("Done!")
Take a peek at the texts🔗
len(texts)
texts[0]
Optionally save the texts🔗
You might do this above instead, depending on your goals. Here we save the acquired texts to the drive in anticipation of loading and using them later. Note, however, that the provided approach only saves to ephemeral space. If you need long term storage, you will want to mount your Drive account and save to an appropriate path.
TEXT_DIR = Path("texts")
TEXT_DIR.mkdir(parents=True, exist_ok=True)
for i, text in enumerate(texts):
fp = TEXT_DIR / f"{i:04}.txt"
with fp.open("w") as outfile:
outfile.write(text)
Reloading texts🔗
The texts can now be reloaded as follows:
texts = []
for fp in sorted(list(TEXT_DIR.iterdir())):
texts.append(fp.open().read())
texts[0]