Open in Colab View on GitHub

Fetching the latest batch of articles from the GDELT GKG.🔗

The GDELT Project which is dubbed "A Global Database of Society" is an excellent, freely available resource of world news.

The GKG🔗

In particular, I make regular use of the GDELT Knowledge Graph (GKG) raw data files for getting a snapshot of the current news. The GKG is updated every 15 minutes, and is essentially a dump of all the latest news, including URLs as well as a bunch of useful metadata.

Some caveats🔗

This notebook demonstrates the use of the GKG for obtaining a list of URLs representing the latest news snapshot, and downloading the content of those news article where available. In some cases, such as paywall situations, or attempts at bot thwarting, you may not get everything with this approach, but I've been able to obtain quite a representative bit of the news this way.

Project ideas🔗

The resulting text dumps from this notebook can be used for any number of news-related projects. Try building a classifier using the metadata provided, or build a topic model using BERTopic. In the future, I will publish examples of these ideas and more.

Imports🔗

In [1]:
import datetime
import logging
import os
import shutil
import tempfile
import urllib
import urllib.request
import zipfile
from pathlib import Path

try:
    import goose3
except ModuleNotFoundError:
    !pip install goose3 --quiet
    import goose3

Fetch the latest GKG batch of news URLs🔗

In [2]:
GKG_URL = "http://data.gdeltproject.org/gdeltv2/%s"
DT_FORMAT = "%Y%m%d%H%M%S"
GKG_DATA_DIR = Path("gkg-data")
GKG_HEADER = [
    "GKGRECORDID",
    "DATE",
    "SourceCollectionIdentifier",
    "SourceCommonName",
    "DocumentIdentifier",
    "Counts",
    "V2Counts",
    "Themes",
    "V2Themes",
    "Locations",
    "V2Locations",
    "Persons",
    "V2Persons",
    "Organizations",
    "V2Organizations",
    "V2Tone",
    "Dates",
    "GCAM",
    "SharingImage",
    "RelatedImages",
    "SocialImageEmbeds",
    "SocialVideoEmbeds",
    "Quotations",
    "AllNames",
    "Amounts",
    "TranslationInfo",
    "Extras"
]

Utilities🔗

Resources for fetching GKG data and extracting the URLs.

In [3]:
def download(url, tofile=None):
    """Fetch a file from `url` and save it to `tofile`."""
    if tofile is None:
        tofile = os.path.basename(urllib.parse.urlparse(url).path)
    with urllib.request.urlopen(url) as response, open(tofile, "wb") as outfile:
        shutil.copyfileobj(response, outfile)
    return tofile


def gkg_filename(dt=None):
    """Returns the filename of the csv.zip GKG file associated with the latest
    15 minute period before dt, or the current time if dt is None.
    """
    if dt is None:
        dt = datetime.datetime.utcnow()
    minute = 15 * (dt.minute//15)
    dt = dt.replace(minute=minute, second=0, microsecond=0)
    dt_str = dt.strftime(DT_FORMAT)
    return f"{dt_str}.gkg.csv.zip"


def fetch_gkg_urls(force=False, dt=None):
    """Yields an iterable of URLs fetched from the latest GKG or None if the most
    recent GKG file has already been processed.

    Optionally pass in a historical datetime to be used rather than the
    current time for determining the latest 15-minute batch to fetch.
    """
    fn = gkg_filename(dt)
    download_file = GKG_DATA_DIR / fn 
    csv_file = GKG_DATA_DIR / fn[:-len(".zip")]
    if download_file.exists() and not force:
        print("Skipping file for previously processed period:", fn)
        return
    url = GKG_URL % fn
    print("Fetching gkg file:", url)
    try:
        fn = download(url, tofile=download_file)
    except Exception as e:
        print("Unable to download GKG file:", e)
        return
    with zipfile.ZipFile(download_file, "r") as ref:
        ref.extractall(GKG_DATA_DIR)
    with open(csv_file, encoding="utf8", errors="surrogateescape") as f:
        for i, line in enumerate(f):
            yield line.split("\t")[4].strip()
In [4]:
GKG_DATA_DIR.mkdir(parents=True, exist_ok=True)
In [5]:
urls = list(fetch_gkg_urls(force=False))
Fetching gkg file: http://data.gdeltproject.org/gdeltv2/20221011233000.gkg.csv.zip

Take a peek at the URLs🔗

In [6]:
urls[:10]
Out[6]:
['https://www.newsy.com/stories/ohio-senate-debate-with-ryan-vance-descends-into-attacks/',
 'https://www.russiaherald.com/news/272905643/elon-musk-spoke-to-putin-before-presenting-ukraine-peace-proposal-report',
 'https://www.forbes.com/sites/forbes-personal-shopper/2022/10/11/best-espresso-machines/',
 'https://www.kvor.com/news/pfizer-exec-admits-covid-vaccine-never-tested-to-prevent-transmission/',
 'https://www.kvor.com/news/fetterman-stroke-recovery-wont-have-an-impact-if-elected/',
 'https://www.montrosepress.com/free_access/hospital-recognizes-leadership-at-annual-fall-clinics/article_c7364b94-4994-11ed-985b-1f9ddec93453.html',
 'https://www.nbcconnecticut.com/tag/well-water/',
 'https://www.sbstatesman.com/2022/10/11/dont-worry-darling-continues-to-romanticize-bad-men/',
 'https://news.yahoo.com/obamas-praise-iranian-women-girls-200453660.html',
 'https://wild104.iheart.com/content/2022-10-11-khloe-kardashian-gets-rare-tumor-removed-from-face-take-this-seriously/']

Fetch the article pages and extract their texts🔗

Goose3 is used here to fetch the texts of the articles from our list of URLs.

Log level ERROR🔗

Goose3 tends to log out a lot of warnings about not being able to resolve article publication dates to UTC. We are just looking at text, so setting log level to ERROR to squelch those warnings.

Skipping failures🔗

The code below does that bad thing of burying the general Exception, and just skips any URLs it cannot fetch. If you require squeezing out as many of these URLs as possible, you might want to dig into the specific errors.

Alternative article extractors🔗

There are a number of good libraries available for article text extraction. Experiment to see what works best for your needs. Some options include:

Be prepared to wait🔗

It takes a while to fetch the whole batch of web pages. There is a crude dotted status indicator just to make things appear alive. In a production scenario, you will probably want to use an approach that spawns multiple threads or processes for fetching.

🔥 Pro tip: If you are trying to get an end-to-end run of the notebook just to see how things work, iterate just a slice of the urls (e.g.: urls[:10]) and carry on.

In [7]:
goose = goose3.Goose()
texts = []


crawler_logger = logging.getLogger("goose3.crawler")
crawler_logger.setLevel(logging.ERROR)

MIN_TEXT_LENGTH = 200 # Sometimes the extracted text just isn't worth keeping


for i, url in enumerate(urls):
    if i % 10 == 0:
        print(".", end=" ") # just trying to look alive
    try:
        article = goose.extract(url=url)
    except Exception:
        """Warning: buried general Exception is considered bad practice.
        
        There are various reasons a fetch might not succeed. If you are getting
        a lot of skipped URLs, you may want to dig into this and handle each
        error specifically.
        """
        print("Error. Skipping URL:", url)
        continue
    title = article.title
    descr = article.meta_description
    text = article.cleaned_text
    if len(text) >= MIN_TEXT_LENGTH:
        texts.append(text)

print("Done!")
. . . . . Error. Skipping URL: https://www.coincommunity.com/forum/topic.asp?TOPIC_ID=432255
. . . . . . . . . . . . . . . . Error. Skipping URL: https://www.npr.org/2022/10/11/1128197781/the-supreme-court-hears-the-pork-industrys-case-against-an-animal-welfare-law
. . Error. Skipping URL: https://www.npr.org/2022/10/11/1128208223/supreme-court-pork-producers-sue-california
. . . . . . . . . Error. Skipping URL: https://www.coincommunity.com/forum/topic.asp?TOPIC_ID=430541&whichpage=2
. . Error. Skipping URL: https://www.coincommunity.com/forum/topic.asp?TOPIC_ID=432256
. . . . . . . . . . . . . . . . . Error. Skipping URL: https://www.stuff.co.nz/environment/130146301/government-announces-new-measures-to-protect-south-island-hectors-dolphins
. . Error. Skipping URL: https://www.stuff.co.nz/life-style/homed/nz-house-garden/300671006/pappardelle-with-duck-ragout-recipe
. . . . . . . . . . Error. Skipping URL: https://www.coincommunity.com/forum/topic.asp?TOPIC_ID=432050&whichpage=2
. . . . . . . . . . 
ERROR:urllib3.connection:Certificate did not match expected hostname: markets.buffalonews.com. Certificate: {'subject': ((('commonName', '*.financialcontent.com'),),), 'issuer': ((('countryName', 'GB'),), (('stateOrProvinceName', 'Greater Manchester'),), (('localityName', 'Salford'),), (('organizationName', 'Sectigo Limited'),), (('commonName', 'Sectigo RSA Domain Validation Secure Server CA'),)), 'version': 3, 'serialNumber': 'E7612383D9F0E303A71AD943BDD22F23', 'notBefore': 'Apr 22 00:00:00 2022 GMT', 'notAfter': 'May 23 23:59:59 2023 GMT', 'subjectAltName': (('DNS', '*.financialcontent.com'), ('DNS', 'financialcontent.com')), 'OCSP': ('http://ocsp.sectigo.com',), 'caIssuers': ('http://crt.sectigo.com/SectigoRSADomainValidationSecureServerCA.crt',)}
Error. Skipping URL: https://markets.buffalonews.com/buffnews/article/etrendy-2022-10-11-a-closer-look-at-hong-kongs-fugitives-in-the-us
. . . Error. Skipping URL: https://www.npr.org/transcripts/1128032407
. Error. Skipping URL: https://www.coincommunity.com/forum/topic.asp?TOPIC_ID=350667&whichpage=1304
. . Error. Skipping URL: https://www.coincommunity.com/forum/topic.asp?TOPIC_ID=432257
. Error. Skipping URL: https://www.coincommunity.com/forum/topic.asp?TOPIC_ID=432254
. . Error. Skipping URL: https://www.npr.org/transcripts/1127995338
. . . Error. Skipping URL: https://www.tap.info.tn/en/Portal-Politics/15632206-president-saied
. . . . 
ERROR:urllib3.connection:Certificate did not match expected hostname: markets.buffalonews.com. Certificate: {'subject': ((('commonName', '*.financialcontent.com'),),), 'issuer': ((('countryName', 'GB'),), (('stateOrProvinceName', 'Greater Manchester'),), (('localityName', 'Salford'),), (('organizationName', 'Sectigo Limited'),), (('commonName', 'Sectigo RSA Domain Validation Secure Server CA'),)), 'version': 3, 'serialNumber': 'E7612383D9F0E303A71AD943BDD22F23', 'notBefore': 'Apr 22 00:00:00 2022 GMT', 'notAfter': 'May 23 23:59:59 2023 GMT', 'subjectAltName': (('DNS', '*.financialcontent.com'), ('DNS', 'financialcontent.com')), 'OCSP': ('http://ocsp.sectigo.com',), 'caIssuers': ('http://crt.sectigo.com/SectigoRSADomainValidationSecureServerCA.crt',)}
Error. Skipping URL: https://markets.buffalonews.com/buffnews/article/getnews-2022-10-11-sandy-ut-ac-repair-by-one-stop-heating-and-air-conditioning-for-all-heating-and-air-conditioning-woes
. . . Error. Skipping URL: https://www.stuff.co.nz/national/quizzes/300699690/quiz-take-the-three-strikes-trivia-test-for-october-12-2022
. Error. Skipping URL: https://www.stuff.co.nz/national/crime/300710637/repeat-child-sex-offender-cut-off-ankle-monitor-to-abuse-a-toddler
. . Error. Skipping URL: https://www.tap.info.tn/en/Portal-Politics/15632639-president-saied-pm
. . . . . . . Error. Skipping URL: https://classiccountry957.iheart.com/content/2022-10-11-blake-shelton-to-step-away-from-the-voice-after-his-23rd-seasons/
. Error. Skipping URL: https://www.stuff.co.nz/travel/destinations/nz/wellington/130143828/the-tripadvisor-restaurant-awards-were-a-big-snub-to-wellington-but-why
. . . . . . . Error. Skipping URL: https://www.tap.info.tn/en/Portal-Economy/15631995-economic-growth-in
. . . . . . . . Error. Skipping URL: https://www.stuff.co.nz/national/crime/300710713/man-charged-with-manslaughter-after-levin-car-crash-death
. . . . . . . . . . . . . . Error. Skipping URL: https://www.tap.info.tn/en/Portal-Politics/15631969-former-mp-ghazi
. . . . Error. Skipping URL: https://www.coincommunity.com/forum/topic.asp?TOPIC_ID=432006&whichpage=2
. . . . . Done!

Take a peek at the texts🔗

In [8]:
len(texts)
Out[8]:
1252
In [9]:
texts[0]
Out[9]:
'Rep. Tim Ryan – a 10-term congressman – and Republican JD Vance are fighting for Ohio\'s open U.S. Senate seat being vacated by GOP Sen. Rob Portman.\n\nThe first debate between Democratic U.S. Rep. Tim Ryan and Republican JD Vance descended quickly into attacks Monday, with the candidates for Ohio\'s open U.S. Senate seat accusing each other of being responsible for job losses and putting party loyalty ahead of voters’ needs.\n\nVance said Ryan had supported policies that led to a 10-year-old girl in Ohio being raped. Ryan said Vance had started a "fake nonprofit" to help people overcome addiction issues. The two accused each other of being beholden to their party, with Ryan echoing a comment from former President Donald Trump in calling Vance an "a— kisser" and Vance saying Ryan’s 100% voting record with President Joe Biden means he’s not the reasonable moderate he says he is.\n\nRelated Story What Are Ohioans Voting For In Their Senate Race?\n\nThe face-off between Ryan, a 10-term congressman, and Vance, a venture capitalist and author of "Hillbilly Elegy," for the seat being vacated by retiring GOP Sen. Rob Portman was one of the most contentious debates of the general election season so far. The race is one of the most expensive and closely watched of the midterms, with Democrats viewing it as a possible pickup opportunity in November.\n\nBoth candidates sought to tailor their messages to the working-class voters who could determine the election in an evening peppered with barbs and one-liners.\n\nRyan sought to paint Vance as an extremist, someone who associates with "crazies" from his party who falsely claim the 2020 election was stolen, support national abortion restrictions and contributed to the Jan. 6 attack on the U.S. Capitol.\n\n"You\'re running around with Ron DeSantis, the governor of Florida, who wants to ban books. You\'re running around with (Sen.) Lindsey Graham, who wants a national abortion ban. You\'re running around with (Rep.) Marjorie Taylor Greene, who\'s the absolute looniest politician in America," Ryan said.\n\nVance suggested Ryan’s focus on allegation of extremism was meant as a distraction from pocketbook issues important to voters, such as inflation and the price of groceries.\n\n"It’s close to Halloween and Tim Ryan has put on a costume where he pretends to be a reasonable moderate.” Had he been, Vance said, “Youngstown may not have lost 50,000 manufacturing jobs during your 20 years."\n\nRyan said: "I’m not gonna apologize for spending 20 years of my adult life slogging away to try to help one of the hardest economically hit regions of Ohio and dedicating my life to help that region come back. You ought to be ashamed of yourself, JD. You went off to California, you were drinking wine and eating cheese."\n\nVance countered that he left Ohio at 18 to join the Marines, and after working in Silicon Valley, he returned to Ohio to raise his family and start a business.\n\nDuring questioning about China, Ryan said Vance invested in China as a venture capitalist, the type of business move that exacerbated job losses in Ohio\'s manufacturing base. "The problem we\'re having now with inflation is our supply chains all went to China, and guys like him made a whole lot of money off that," Ryan said.\n\nVance said it is Democratic economic policies that have harmed manufacturing, saying, "They have completely gone to war against America\'s energy sector." He said he could not remember investing in China.\n\nOn abortion, Vance did not answer whether he would support Graham\'s proposed national ban after 15 weeks of pregnancy, with some exceptions. Vance said he thinks different states would likely want different laws but "some minimum national standard is totally fine with me."\n\nHe called himself "pro-life" but said he has "always believed in reasonable exceptions."\n\nRyan said he supports codifying the abortion rights established in Roe v. Wade, which was overturned by the U.S. Supreme Court in June. He said he opposes Ohio\'s law banning most abortions after fetal cardiac activity has been detected, as early as six weeks into pregnancy, which was blocked Friday.\n\nVance agreed with Ryan that a 10-year-old Ohio rape victim should not have had to leave the state for an abortion, but he said the fact the suspect was in the country illegally was a failure of weak border policies.\n\n"You voted so many times against the border wall funding, so many times for amnesty, Tim," Vance said. "If you had done your job, she would have never been raped in the first place."\n\nOn foreign policy, the pair parted ways on what the U.S. response should be if Russian President Vladimir Putin were to launch nuclear weapons in Ukraine.\n\nRyan said the U.S. should be prepared with a "swift and significant response," while Vance countered that the United States needs a "foreign policy establishment that puts the interests of our citizens first."\n\nRyan responded: "If JD had his way, Putin would be through Ukraine at this point. He’d be going into Poland."\n\n"If I had my way," Vance retorted, "you’d put money at the southern border, Tim, instead of launching tons of money into Ukraine." It echoed comments Vance had made in an interview before Russia\'s invasion of Ukraine, saying he didn\'t "really care what happens to Ukraine one way or another" because he wanted to see President Biden focus on his own country\'s border security.\n\nWhile the general election debate between Ryan and Vance was acrimonious, it didn\'t lead to a near-physical altercation, as an Ohio GOP Senate debate back in March during the primary season did. Former state Treasurer Josh Mandel and investment banker Mike Gibbons found themselves face to face on the debate stage, shouting at each other, while Vance told the two to stop fighting.\n\nAt the end of Monday\'s debate, Vance and Ryan shook hands.\n\nAdditional reporting by The Associated Press.'

Optionally save the texts🔗

You might do this above instead, depending on your goals. Here we save the acquired texts to the drive in anticipation of loading and using them later. Note, however, that the provided approach only saves to ephemeral space. If you need long term storage, you will want to mount your Drive account and save to an appropriate path.

In [10]:
TEXT_DIR = Path("texts")
TEXT_DIR.mkdir(parents=True, exist_ok=True)

for i, text in enumerate(texts):
    fp = TEXT_DIR / f"{i:04}.txt"
    with fp.open("w") as outfile:
        outfile.write(text)

Reloading texts🔗

The texts can now be reloaded as follows:

In [11]:
texts = []
for fp in sorted(list(TEXT_DIR.iterdir())):
    texts.append(fp.open().read())
In [12]:
texts[0]
Out[12]:
'Sam Smith revealed that they got an interesting gift from fellow singer-songwriter Ed Sheeran. During their appearance on The Kelly Clarkson Show, Smith talked about the massive NSFW (Not Safe For Work) present Sheeran recently sent them.\n\n"It\'s actually wild. I thought it was a joke," Smith said during Tuesday\'s (October 11th) episode. "It\'s a six-foot-two marble penis. It\'s two tons. I have to get it craned into my house." Clarkson thought the gift was hilarious and wanted to know more. "In your foyer? Like, what\'s going to happen?" To which Smith replied, "Well, I want to turn it into a fountain, which I think will be hard to do."'
In [12]: