In this method, we link GitHub repositories to ROR ids using the urls of the "owner" entity. For example, for the
fastpli project, we retrieve the owner, 3d-pli,
and attempt to match the listed url (https://www.fz-juelich.de/de/inm/inm-1) to ROR. Its domain name matches
Forschungszentrum Jülich in ROR, so we link 3d-pli/fastpli
to Forschungszentrum Jülich.
Create a new virtual environment using your favorite tool, and install the project requirements, e.g.:
virtualenv venv
pip install -r ../requirements.txt
. venv/bin/activate
Create two environment variables, GITHUB_ACCESS_TOKEN
(which should contain a GitHub personal access token with
read:org
and read:user
scopes), and GITHUB_USER
.
export GITHUB_ACCESS_TOKEN=your access token
export GITHUB_USER=your github username
The commands in this section only need to be run when you want to update the ROR data used in later scripts.
- Download the latest ROR data from Zenodo.
- Run
python3 get_urls_from_bulk_ror.py
. This will generate two JSONs,ror_domain_to_ids.json
, which maps domain names of organizations in ROR to ROR ids, andror_url_to_ids.json
, which maps full urls of organizations in ROR to ROR ids
To retrieve the affiliation for a single GitHub repository owner, run gh_owner_to_ror.py
with one argument,
the owner name. For example,
python3 gh_owner_to_ror.py stanfordnlp
ROR ids found for stanfordnlp: ['https://ror.org/00f54p054']
We have previously run software-ROR linking on software from the ORCA dataset, and on software from The Stack.
These scripts are available in get_{orca,stack}_org_rors.py
and can be run without arguments to update the default
bulk data files.