Skip to content

Add PyPI User API #15769

Open
Open
@import-pandas-as-numpy

Description

@import-pandas-as-numpy

What's the problem this feature will solve?

Currently, there is no way to enumerate user information programmatically without BigQuery. This is problematic for security organizations, which may use these characteristics to inform and contextualize automated detection engines. For instance, new users are statistically significant in the proportion of malware they create. A long-lived account maintaining several packages, consequentially, is less likely to upload malware. Additionally, individuals that maintain packages as a part of teams or organizations (which may be recursively enumerated by this endpoint) are also statistically less likely to upload malware.

Describe the solution you'd like
I am proposing the following JSON endpoint/response:

GET /user/import-pandas-as-numpy/json
{
  "username": "import-pandas-as-numpy",
  "name": "Rem",
  "joined": "May 1, 2023",
  "packages": [
    {
      "package": "safepull",
      "latest_version": "v2.0.0",
      "last_released": "Apr 1, 2024"
    },
    {
      "package": "foo",
      "latest_version": "v1.1.1"
      "last_released": "Apr 11, 2024"
    }
  ]
}

Obvious concessions can be made to account for specific data types in date/time pertinent fields as it is relevant; I have left them as strings to clearly denote the intent of the information this is to convey.

Additional context
I have concerns about the misuse of this feature to scrape PyPI's users, potentially facilitating things like automated social engineering attacks by mapping out relationships between authors. I propose that this feature be restricted to credible security organizations following conventions with currently in-development anti-malware API's.

I would like to take a moment to elaborate how we intend to use this to more effectively secure the Python Package Index:

  • We (@vipyrsec) have, in the past, been requested to perform additional scans on uploads by a malicious author to canvas the full scope of potentially malicious packages on a given account. This would facilitate these additional scans, and allow us to treat users as potentially malicious instead of packages. The benefit to this is twofold; the odds of a given user staging an undetected payload is significantly reduced as we are able to utilize more computationally expensive static code analysis tools on these packages, and we can do this programmatically to inform our reporting in anticipation of this request.

  • We often use account age in conjunction with less clear malicious behavior. An account that has maintained good standing historically and has contributed multiple non-malicious packages over some interval is less likely to have staged malicious code; and thus, requires a less critical examination. A new user uploading "azure-data-interactions" is much more alarming than a long-standing package maintainer uploading this package.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions