quaintitative

I write about my explorations in AI and other quaintitative areas.

For more about me and my other interests, visit playgrd, quaintitative or socials below


Categories
Subscribe

Extracting and Processing Wikidata datasets

Wikidata is a open and rich knowledge base that draws on the structured data of its Wikimedia sister projects - Wikipedia, Wikivoyage, Wiktionary, Wikisource, and others. In simple terms, Wikidata is a coillection of pages where we have items (identified with a Q), that have properties (identified with a P), and they can be connected to other items via such properties. For example, we can have Douglas Adams (Q42), being a citizen of country (P27) of United Kingdom (Q1145).

Wikidata datasets are hence very useful for building graphs that connect different entities (items) with different relationships (properties).

There are two ways to get at such data that I will be describing here. Both involve the qwikidata library which we first install:

pip install qwikidata

The first approach is available at the notebook here, and involves downloading dumps of Wikidata from here.

We first create the object to hold the dump.

# example, we downloaded wikidata-20201130-all.json.bz2 
date = 'wikidata-20201130-all'
filename = f'{date}.json.bz2'

# create an instance of WikidataJsonDump
wjd_dump_path = filename
wjd = WikidataJsonDump(str(filename))

Assume we want to extract all relationships between a list of NYSE and NASDAQ companies.

nasdaq_tickers = pd.read_csv(data_dir/'NASDAQ_wiki.csv', names=['ticker', 'wikidataid'])
nyse_tickers = pd.read_csv(data_dir/'NYSE_wiki.csv', names=['ticker', 'wikidataid'])
nasdaq_tickers['mkt'] = 'NASDAQ'
nyse_tickers['mkt'] = 'NYSE'
tickers = pd.concat([nyse_tickers, nasdaq_tickers], axis=0)
assert len(tickers) == len(nasdaq_tickers) + len(nyse_tickers)

Credits to this source for the list of NYSE and NASDAQ company entity IDs we used above.

Now we use the following loop (as explained in comments) to build the knowledge graphs.

# This runs for a while due to number of json statements
# basically there are 'item' and 'property' types
# we can just run through each of the statements, find 'items' and match the id based on entity we want
# then there will be a list of claims which are the property statements
# look through them and extract all of the relevant claims - each a key
# each key can have multiple items linked to the claims so iterate through the list
# then extract the relevant claims that are 'wikibase-item', cause there are other possible items related to claims like quantities
company_list = list(tickers.wikidataid)
company_triplets = []
for index, entity_dict in tqdm(enumerate(wjd)):
    if entity_dict["type"] == "item":
        if entity_dict["id"] in company_list:
            company = entity_dict
            # print(company['id'], company['labels']['en']['value'])
            try:
                for claim in company['claims'].keys():
                    for claimno in range(len(company['claims'][claim])):
                        if company['claims'][claim][claimno]['mainsnak']['datatype'] == 'wikibase-item':
                            # print(f"{claim} | Q{company['claims'][claim][claimno]['mainsnak']['datavalue']['value']['numeric-id']}")
                            company_triplets.append({'source': company['id'], 'source_name': company['labels']['en']['value'], 'relationship': claim, 
                                                    'target': 'Q' + str(company['claims'][claim][claimno]['mainsnak']['datavalue']['value']['numeric-id']),
                                                    'ticker': tickers[tickers.wikidataid == company['id']].ticker.values[0], 
                                                    'id': tickers[tickers.wikidataid == company['id']].wikidataid.values[0], 
                                                    'mkt': tickers[tickers.wikidataid == company['id']].mkt.values[0]})
            except:
                company_triplets.append({'source': company['id'], 'source_name': 'NIL', 'relationship': 'NIL', 'target': 'NIL',
                                                    'ticker': tickers[tickers.wikidataid == company['id']].ticker.values[0], 
                                                    'id': tickers[tickers.wikidataid == company['id']].wikidataid.values[0], 
                                                    'mkt': tickers[tickers.wikidataid == company['id']].mkt.values[0]})

The second involves using the qwikidata library to download the information with the API, and the notebook is available here. The steps are very similar, so just refer to the notebook.

This repository has the full code.

And that’s it. Happy data explorations!


Articles

Comparing Prompts for Different Large Language Models (Other than ChatGPT)
AI and UIs
Listing NFTs
Extracting and Processing Wikidata datasets
Extracting and Processing Google Trends data
Extracting and Processing Reddit datasets from PushShift
Extracting and Processing GDELT GKG datasets from BigQuery
Some notes relating to Machine Learning
Some notes relating to Python
Using CCapture.js library with p5.js and three.js
Introduction to PoseNet with three.js
Topic Modelling
Three.js Series - Manipulating vertices in three.js
Three.js Series - Music and three.js
Three.js Series - Simple primer on three.js
HTML Scraping 101
(Almost) The Simplest Server Ever
Tweening in p5.js
Logistic Regression Classification in plain ole Javascript
Introduction to Machine Learning Right Inside the Browser
Nature and Math - Particle Swarm Optimisation
Growing a network garden in D3
Data Analytics with Blender
The Nature of Code Ported to Three.js
Primer on Generative Art in Blender
How normal are you? Checking distributional assumptions.
Monte Carlo Simulation of Value at Risk in Python
Measuring Expected Shortfall in Python
Style Transfer X Generative Art
Measuring Market Risk in Python
Simple charts | crossfilter.js and dc.js
d3.js vs. p5.js for visualisation
Portfolio Optimisation with Tensorflow and D3 Dashboard
Setting Up a Data Lab Environment - Part 6
Setting Up a Data Lab Environment - Part 5
Setting Up a Data Lab Environment - Part 4
Setting Up a Data Lab Environment - Part 3
Setting Up a Data Lab Environment - Part 2
Setting Up a Data Lab Environment - Part 1
Generating a Strange Attractor in three.js
(Almost) All the Most Common Machine Learning Algorithms in Javascript
3 Days of Hand Coding Visualisations - Day 3
3 Days of Hand Coding Visualisations - Day 2
3 Days of Hand Coding Visualisations - Day 1
3 Days of Hand Coding Visualisations - Introduction