Extracting and Processing Wikidata datasets

Wikidata is a open and rich knowledge base that draws on the structured data of its Wikimedia sister projects - Wikipedia, Wikivoyage, Wiktionary, Wikisource, and others. In simple terms, Wikidata is a coillection of pages where we have items (identified with a Q), that have properties (identified with a P), and they can be connected to other items via such properties. For example, we can have Douglas Adams (Q42), being a citizen of country (P27) of United Kingdom (Q1145).

Wikidata datasets are hence very useful for building graphs that connect different entities (items) with different relationships (properties).

There are two ways to get at such data that I will be describing here. Both involve the qwikidata library which we first install:

pip install qwikidata

The first approach is available at the notebook here, and involves downloading dumps of Wikidata from here.

We first create the object to hold the dump.

# example, we downloaded wikidata-20201130-all.json.bz2 
date = 'wikidata-20201130-all'
filename = f'{date}.json.bz2'

# create an instance of WikidataJsonDump
wjd_dump_path = filename
wjd = WikidataJsonDump(str(filename))

Assume we want to extract all relationships between a list of NYSE and NASDAQ companies.

nasdaq_tickers = pd.read_csv(data_dir/'NASDAQ_wiki.csv', names=['ticker', 'wikidataid'])
nyse_tickers = pd.read_csv(data_dir/'NYSE_wiki.csv', names=['ticker', 'wikidataid'])
nasdaq_tickers['mkt'] = 'NASDAQ'
nyse_tickers['mkt'] = 'NYSE'
tickers = pd.concat([nyse_tickers, nasdaq_tickers], axis=0)
assert len(tickers) == len(nasdaq_tickers) + len(nyse_tickers)

Credits to this source for the list of NYSE and NASDAQ company entity IDs we used above.

Now we use the following loop (as explained in comments) to build the knowledge graphs.

# This runs for a while due to number of json statements
# basically there are 'item' and 'property' types
# we can just run through each of the statements, find 'items' and match the id based on entity we want
# then there will be a list of claims which are the property statements
# look through them and extract all of the relevant claims - each a key
# each key can have multiple items linked to the claims so iterate through the list
# then extract the relevant claims that are 'wikibase-item', cause there are other possible items related to claims like quantities
company_list = list(tickers.wikidataid)
company_triplets = []
for index, entity_dict in tqdm(enumerate(wjd)):
    if entity_dict["type"] == "item":
        if entity_dict["id"] in company_list:
            company = entity_dict
            # print(company['id'], company['labels']['en']['value'])
            try:
                for claim in company['claims'].keys():
                    for claimno in range(len(company['claims'][claim])):
                        if company['claims'][claim][claimno]['mainsnak']['datatype'] == 'wikibase-item':
                            # print(f"{claim} | Q{company['claims'][claim][claimno]['mainsnak']['datavalue']['value']['numeric-id']}")
                            company_triplets.append({'source': company['id'], 'source_name': company['labels']['en']['value'], 'relationship': claim, 
                                                    'target': 'Q' + str(company['claims'][claim][claimno]['mainsnak']['datavalue']['value']['numeric-id']),
                                                    'ticker': tickers[tickers.wikidataid == company['id']].ticker.values[0], 
                                                    'id': tickers[tickers.wikidataid == company['id']].wikidataid.values[0], 
                                                    'mkt': tickers[tickers.wikidataid == company['id']].mkt.values[0]})
            except:
                company_triplets.append({'source': company['id'], 'source_name': 'NIL', 'relationship': 'NIL', 'target': 'NIL',
                                                    'ticker': tickers[tickers.wikidataid == company['id']].ticker.values[0], 
                                                    'id': tickers[tickers.wikidataid == company['id']].wikidataid.values[0], 
                                                    'mkt': tickers[tickers.wikidataid == company['id']].mkt.values[0]})

The second involves using the qwikidata library to download the information with the API, and the notebook is available here. The steps are very similar, so just refer to the notebook.

This repository has the full code.

And that’s it. Happy data explorations!

Extracting and Processing Wikidata datasets

Articles