Skip to content

[RFC]: add support for bootstrap and jackknife resampling #133

Closed
@notacountry

Description

@notacountry

Full name

Zach Land

University status

Yes

University name

Imperial College London

University program

Joint Mathematics and Computer Science

Expected graduation

2027

Short biography

Hi! I'm Zach, and I am a first-year Math & CS undergraduate at Imperial College London with a 90.9% grade average.
I am relatively new to open source development, although I did spend 3 years as an active Wiki contributor and administrator (including development of backend templates & data structures).
My primary interest lies in finance and data science, and I want to pursue a career in the field. I am well-versed in statistical methods, both from my university courses and recreationally in Python and C++, where I've recently been reading, implementing, and optimizing various algorithms.
My best programming language is most likely C#, but I have significant experience programming in C++, Haskell, Java, and Kotlin, and in web development with CSS, HTML, and JavaScript with React.

Stdlib interests me in particular because I don't need to turn to another language such as Python or R for data analysis. I feel I can contribute significantly to its mathematics and statistics modules.

A full profile of my work experience and projects can be found on my LinkedIn.

Timezone

Greenwich Mean Time (GMT, UTC+00:00)

Contact details

email:zach123land@gmail.com

Platform

Windows

Editor

My IDE tends to depend on the language. I use Visual Studio for C# and Sublime Text for HTML, but in general I gravitate towards VSCode since it was easy to adapt to and use. I am looking toward transitioning to NeoVim in the near future though.

Note that my Laptop (from which I will be mainly contributing) uses Windows, however my personal PC uses Fedora Linux so I am okay with programming in and using both.

Programming experience

Key Project

One very recent project that springs to mind is one that my team and I are developing for Imperial College London's month-long AI Agent Hackathon. Our WIP is an accessibility extension to the UK's National Health Service app that allows users to scan medical documents, which are processed using various NLP algorithms such as keyword and sentiment analysis in order to summarize, translate, and signpost (with a focus on correctness and responsibility, as we are dealing with important information which may at times be fatally incorrect).

We are using React Native for our frontend (and I worked on porting several NHS frontend assets into the framework), with a python backend.

Additional Projects

  • Blog Website (JS):
    • Programmed, hosted, and maintained a blog website with search, tagging, and comment functionality.
  • Chatbot & Finance Simulation (C#):
    • Designed and implemented a chatbot which used the Wikipedia API to quiz users and measure their typing speed.
    • Developed a stock market simulator which increased user engagement. Users could buy and sell virtual stocks.
  • Free School Uniform Website (JS):
    • Implemented a website allowing users to search, sort, and order stock from a database that updated in real-time.
  • Black-Scholes research (Python):
    • Analytically and numerically quantified the effect of interest rate on options pricing.
  • British Physics Computational Challenge (Python):
    • Won a gold medal due to my unique (2D and 3D) visualizations and simulations of Kepler's Laws and Newton's Universal Law of Gravitation.

JavaScript experience

I have had a decent amount of experience with JavaScript since I develop websites quite frequently (see programming experience above).

I am very interested in Haskell and the functional programming paradigm in general. To this end, my favorite feature of JavaScript is currying and the ability to manipulate functions as variables. This was actually used in my recent stdlib pull request, wherein the function incrnanmminmaxabs( out, window ) returns an accumulator function which I modified to skip NaN values.

My least favorite feature is JavaScript's dynamic typing. Coming from my C# and Haskell experience, I am very used to having to declare (e.g.) parameter types and function return types, and it feels odd how this is not required in JavaScript. The worst part of this how JavaScript variables can change types. I feel that readability suffers (as someone who likes to precisely document and specify all of my code).

Node.js experience

I learnt Node.js for the first time in preparation for our AI Agents Hackathon (which I mentioned in my programming experience above).

C/Fortran experience

None.

Interest in stdlib

I mentioned this previously, but stdlib provides you with the ability to perform complex calculations within your browser without having to spend time using other languages for data processing.

I don't have too much experience with using stdlib, but one thing I really enjoyed was its huge list of incremental statistics methods. I am interested in stream processing and I find these methods particularly useful. My pull request was related to this, and my showcase utilizes this stdlib feature to perform real-time analysis of stocks.

Version control

Yes

Contributions to stdlib

This pull request adds the package @stdlib/stats/incr/nanmminmaxabs. The package is a thin wrapper around @stdlib/stats/incr/mminmaxabs. It wraps a NaN propagating accumulator and skips NaN values.

stdlib showcase

My showcase uses @stdlib/stats/incr/mmean and @stdlib/plot to take an input stream of financial data, employ a simple moving average strategy, and plot the profit.

I am planning on implementing more (e.g. running an optimization algorithm to improve the sliding window sizes, or implementing more strategies that use stdlib), but unfortunately this application process is happening during a busy time of my life (easter vacation work experience, AI Agents Hackathon, studying for finals next month) so I may not be able to get around to it.

Goals

Stdlib offers some ways to calculate maximum likelihood estimates given some sample data (e.g. see @stdlib/stats/base/mean and @stdlib/stats/base/variance), however from my research I could find no real way in stdlib to calculate the distribution of estimates (i.e. estimators).

I propose to extend this functionality by implementing a resampling package which exposes various resampling methods such as Bootstrapping and Jackknife.

Please note that my ideas are an initial draft. The listed project idea does not mention any mentors to discuss the project with, so any feedback at this stage is invaluable.

I will introduce a package, @stdlib/stats/base/resampling, which contains various sub-packages such as the simple Monte Carlo case resampling algorithm or the more complex Bayesian bootstrap, and then expose these via a class implementation (similar to how @stdlib/stats/base/dists contains distribution packages with sub-packages calculating specific quantities).

This structure was mentioned by kgryte in 2022 on Gitter in reference to statistical distribution implementation, but I feel I can apply that methodology to this project.

Alternatively, we could do what SciPy's bootstrap implementation did and add the various possible algorithms as optional parameters to a bootstrap function, although I dislike this approach.

Inputs and Outputs
Each resampling function should

  • take as inputs: data in the form of a TypedArray, a function which outputs an estimate from a given data set (such as sample mean), and optionally a random number generator for non-deterministic algorithms; and
  • output a TypedArray containing the various estimates made from the bootstrap. This differs from the SciPy implementation, which outputs a confidence interval.

Below I outline some important goals for the project. I understand that implementing many different types of bootstrapping may be overkill for the scope of stdlib. Since the project is 12 weeks, I doubt I will be able to implement all of these different types. I will discuss with my mentors which of the mid-to-low-importance goals I should bring into the project.

High Importance

Mid Importance

  • Poisson bootstrapping is very useful for stream processing and so can be added as a package to @stdlib/stats/incr. Using a Poisson model, we don't need to know the size of the input dataset in advance (see this article from Google's data scientists).
  • Bayesian bootstrapping provides smoother and more reliable estimators, particularly when only a small sample is available. It may be adapted to include a prior representing prior information we have about the data.
  • Bag of Little Bootstraps (BLB) reduces computational complexity for very large data sets by partitioning the sample data into buckets, each of which is separately bootstrapped.

Low Importance

  • Standard bootstrapping assumes the input data is not correlated. One might consider implementing the Block Bootstrap, which preserves the dependence between data somewhat by splitting input into blocks before resampling. This is similar to BLB, but the blocks are partitioned differently. Gaussian proccess bootstrapping is also a consideration since it can also deal with uncorrelated data, however we (funnily enough) require [Idea]: add support for the multivariate normal distribution #11 to be worked upon first.
  • Smooth bootstrapping adds random noise in an attempt to smooth the sample data. Interestingly, smooth bootstrapping can use Kernel Density Estimation as the underlying distribution from which to resample, so we can utilize the pre-defined @stdlib/stats/kde2d package. That being said, it is more computationally efficient to add random noise directly rather than drawing samples from a smooth, continuous approximation with KDE.

Why this project?

What excites me about the proposal is the unique opportunity I have to contribute to open source on a widely-used library for a popular programming language. I feel that stdlib has a lot of room for growth in term of its mathematical capabilities and I would love to contribute to the project in a structured program such as GSoC. My skills in implementing statistical algorithms and bootstrapping lends itself to this project very well, in such a way where I felt lucky to come across a proposal that felt so tailored to what I enjoy doing.

Qualifications

  • I am quite knowledgeable on bootstrapping and statistics generally, and have experience fiddling around with statistics algorithms in C#, C++ and Python.
  • As an A-Level student, I was interested in learning more about statistics. I took MIT's online statistics course in my spare time. It covered maximum likelihood estimators in great detail and thus I am very comfortable working with them. I also read through the A-Level Further Statistics 2 textbook, which was a module that our school didn't allow us to take.
  • During my undergraduate study, we learnt about bootstrapping in the statistics module. Our lecturer is very interested in bootstrapping and mentioned how it was his favorite part of the course. I have a good understanding of bootstrapping from the course, and additionally I can get strong help regarding specific implementation details and optimizations from my lecturer throughout the program.

Prior art

Popular libraries with bootstrapping implementations:

Written Materials

Commitment

Please be aware that I will be doing a full-time 9-week internship which overlaps with most of GSoC, so I expect to be developing part-time. Also, between June 2-20, I have university programming coursework. Despite this, I spend much of my free time programming and, as such, I still believe I can manage my time effectively. This does mean I will be contributing most over the weekends.

Specifically, I will work on the project for around 1 hour every weekday, and as long possible on weekends (say >6 hours per day).

I will contribute and play around with stdlib before the project starts to get used to using Git on a Windows machine and iron out any annoying development environment errors. Note that between finals and my internship start date, I will be available to focus mainly on the GSoC project and thoroughly flesh out my ideas.

Schedule

It is difficult to structure a specific schedule for this project, since it depends on how many resampling methods I will implement during the project. My goals can easily be separated into several mini-projects so I can re-scale the scope of the project easily. The below schedule is highly flexible and includes two weeks of leeway to ensure I will be able to submit the deliverables on time. If the weeks of leeway are not used, I can use the accumulated extra time at the end to implement more mid-to-low importance goals.

Preparation
I will submit more basic pull requests, get used to the stdlib development progress, and iron out development environment errors.

Community Bonding Period (May 8 - June 1):
I will attempt to arrange a voice chat with my mentors to discus the project. We discuss the specific implementation details, edge cases to worry about, how and when to structure and submit my pull requests, etc.
Speaking directly should be a good way to build strong relationships that will last throughout the program.

Week 1 (June 2-8):

  • I will develop the code for the Monte Carlo case sampling algorithm.
  • I will attempt to arrange an informal code review with my mentors.

Week 2 (June 9-15):

  • I will write robust tests and extensive documentation of the Monte Carlo case sampling algorithm.
  • I will submit a pull request and have a formal code review.

Week 3 (June 16-22):

  • I will develop the code for the Jackknife resampling algorithm.
  • I will attempt to arrange an informal code review with my mentors.

Week 4 (June 23-29):

  • My internship begins on June 23. I expect the tasks from now onward to be completed slower.
  • I will write robust tests and extensive documentation for the Jackknife resampling algorithm.
  • I will submit a pull request and have a formal code review.

Week 5 (June 30-July 6):

  • One week of leeway to account for unexpected delays.

Week 6: (Midterm) (July 7-13):

  • I will implement a confidence interval function using basic bootstrap.
  • I will attempt to arrange an informal code review with my mentors.

Week 7 (July 14-21):

  • I will write robust tests and extensive documentation for the basic bootstrap.
  • I will submit a pull request and have a more formal code review.

Week 8-9 (July 21-August 3):

  • Begin implementing one of my suggested Mid-Importance goals.
  • I will attempt to arrange an informal code review with my mentors.

Week 10 (August 4-10):

  • I will write robust tests and extensive documentation for the suggested Mid-Importance goal.
  • I will submit a pull request and have a more formal code review.

Week 11 (August 11-17):

  • One week of leeway to account for unexpected delays.
  • Code freeze.

Week 12 (August 17-24):

  • Reviewing and updating documentation and tests.
  • Final code review.

Final Week (August 25-September 1):

  • Project write-up (I will informally write down my experience throughout the project and bring it all together in a formal post near the end).
  • Project submission.

Related issues

Checklist

  • I have read and understood the Code of Conduct.
  • I have read and understood the application materials found in this repository.
  • I understand that plagiarism will not be tolerated, and I have authored this application in my own words.
  • I have read and understood the patch requirement which is necessary for my application to be considered for acceptance.
  • I have read and understood the stdlib showcase requirement which is necessary for my application to be considered for acceptance.
  • The issue name begins with [RFC]: and succinctly describes your proposal.
  • I understand that, in order to apply to be a GSoC contributor, I must submit my final application to https://summerofcode.withgoogle.com/ before the submission deadline.

Metadata

Metadata

Assignees

Labels

20252025 GSoC proposal.received feedbackA proposal which has received feedback.rfcProject proposal.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions