Skip to content

Parallelize loading include objects #6501

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: alpha
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions spec/ParseQuery.spec.js
Original file line number Diff line number Diff line change
Expand Up @@ -2439,6 +2439,27 @@ describe('Parse.Query testing', () => {
});
});

it('handles nested includes with no results', async () => {
const obj = new TestObject({ foo: 'bar' });
await obj.save();

const query = new Parse.Query(TestObject)
.equalTo('objectId', obj.id)
.include('fake.path');

// This query with should not reject
const results = await query.find();

// ... and we should get back the one object we saved
expect(results.length).toBe(1);

// ... and it should not add the `fake` attribute from the query that
// doesn't exist on the object
expect(
Object.prototype.hasOwnProperty.call(results[0].attributes, 'fake')
).toBeFalse();
});

it("include doesn't make dirty wrong", function(done) {
const Parent = Parse.Object.extend('ParentObject');
const Child = Parse.Object.extend('ChildObject');
Expand Down
105 changes: 66 additions & 39 deletions src/RestQuery.js
Original file line number Diff line number Diff line change
Expand Up @@ -747,30 +747,61 @@ RestQuery.prototype.handleExcludeKeys = function() {
};

// Augments this.response with data at the paths provided in this.include.
//
// Preconditions:
// - `this.include` is an array of arrays of strings; (in flow parlance, Array<Array<string>>)
//
// - `this.include` is de-duplicated. This ensures that we don't try to fetch
// the same objects twice.
//
// - For each value in `this.include` with length > 1, there is also
// an earlier value for the prefix of that value.
//
// Example: ['a', 'b', 'c'] in the array implies that ['a', 'b'] is also in
// the array, at an earlier position).
//
// This prevents trying to follow pointers on unfetched objects.
RestQuery.prototype.handleInclude = function() {
if (this.include.length == 0) {
return;
}

var pathResponse = includePath(
this.config,
this.auth,
this.response,
this.include[0],
this.restOptions
);
if (pathResponse.then) {
return pathResponse.then(newResponse => {
this.response = newResponse;
this.include = this.include.slice(1);
return this.handleInclude();
});
} else if (this.include.length > 0) {
this.include = this.include.slice(1);
return this.handleInclude();
}
// The list of includes form a sort of a tree - Each path should wait to
// start trying to load until its parent path has finished loading (so that
// the pointers it is trying to read and fetch are in the object tree).
//
// So, for instance, if we have an include of ['a', 'b', 'c'], that must
// wait on the include of ['a', 'b'] to finish, which must wait on the include
// of ['a'] to finish.
//
// This `promises` object is a map of dotted paths to promises that resolve
// when that path has finished loading into the tree. One special case is the
// empty path (represented by the empty string). This represents the root of
// the tree, which is `this.response` and is already fetched. We set a
// pre-resolved promise at that level, meaning that include paths with only
// one component (like `['a']`) will chain onto that resolved promise and
// are immediately unblocked.
const promises = { '': Promise.resolve() };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this idiom. trying to figure out how i can figure it out :).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little bit tricky, I could probably comment this more clearly!

What this is intending is that entries in this.includes that have no nesting will start processing right away, by chaining a then handler on to this resolved promise.

For an include with no nesting, its entry in this.include will be an array with one element: ['user'], for instance. From that path we extract a prefix, which in this case will be [].join('.') which is the empty string ''. We would find the above base case with promises[''].

It might be even clearer to write this as an inline default:

const promises = {};
this.include.forEach(path => {
  const prefix = path.slice(0, -1).join('.');
  const loadAfter = promises[prefix] || Promise.resolve();
  ...
}

That seems like it is more idiomatic JS, I think I'll probably follow up and switch it over.


this.include.forEach(path => {
Copy link
Contributor

@acinader acinader Mar 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that having some kind of concurrency limiter would be a plus. Something that mimics what bluebird.map offers as an option?

I could imagine a scenario where you have an array of pointers as a field and we're including for each of those objects and you could end up with dog pile of queries (this is off the cuff, I'd have to think through how that would actually work...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the way this works is that we'll end up issuing one query for each include, regardless of how many objects it is fetching. For each path we gather all the pointers at that path, and then do one subquery with where: { objectId: { $in: objectIds } }.

That said, there's a degenerate case where someone is trying to load in tons of include paths that could be dangerous, and a concurrency limiter might be a helpful thing to add. I don't love the idea of adding another promise implementation as a dependency. I could write a similar helper for native promises, but I wonder if someone has already created a nice implementation we can reuse... I'll look around a bit!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also suspect that this could be written more naturally as

const promises = this.include.reduce(() => {}, Promise.resolve())

but i'm still trying to grok, so feel free to disregard.... :).

Copy link
Contributor Author

@noahsilas noahsilas Mar 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const promises = this.include.reduce(() => {}, Promise.resolve())

I think this is conceptually similar to what's going on in the existing implementation, in that it will wait for each include to finish loading before starting the next one, which leads to the sort of "waterfall" that I'd like to avoid.

The thought behind this implementation is that the include paths form a sort of tree. All the includes that have only one path component can be unblocked at once, since none of them have dependencies on any data that wasn't part of the initial query.

From there, as each of those paths resolves, they can unblock any nested includes that are one path component deeper; for instance, if you are include post,post.author,post.category, then we would start a subquery for pointers at the post path immediately, and as soon as that resolves, we would ungate the two subqueries that rely on that one, post.author and post.category.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love the idea of adding another promise implementation as a dependency.

yes, I'm certainly not suggesting requiring bluebird, just an example...

Copy link
Contributor

@acinader acinader Mar 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you're doing here and the initiative you're taking is awesome. Honestly, I'm just trying to keep up!

My comments, as I've tried to point out, are me just thinking out loud as I try to process what you're presenting. Not that your replies sound defensive AT ALL, just want to make it clear :).

I think I should make a test case or two and step through the code so I can be sure that I understand, so I can then be helpful in reviewing this pr -- I'll do it, but it may take me a few days.

Having said all that, here's some feedback on your responses.

  1. I don't think you're on the same page with me on using reduce vs. forEach. Using reduce doesn't necessitate waiting before proceeding to the next iteration and that wasn't what I had in mind. My only point is that if you're declaring a variable that you then populate in a loop, it's a candidate for reduce (or map).

  2. I wasn't thinking explicitly about an array of different types, I was more thinking about an array of a dozen or so pointers of the same class, each of which has three or four pointer fields that are all included, and each of those has another one or two objects to include. In that situation, would this change introduce a potential dog pile. I don't know the answer, and I'd have to make a test and step through it to have any clue if I'm onto something or not....(you may already have an idea, but I don't).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for your feedback! It's definitely hard to convey tone in github comments, and I'm used to leaving reviews on stuff at work where I have preexisting relationships with the reviewers, so getting back into the habit of being a little less terse would probably help me convey my thoughts more effectively, haha.

Thanks for your attention and thoughtfulness on my PRs. I notice your profile says you're in SF - maybe I can get you a coffee or a drink sometime to show my appreciation!

And now, into the responses!

Re: 1

const promises = this.include.reduce(() => {}, Promise.resolve())

I think that because of the initial value being a Promise, I thought that you were suggesting using the reducer to create a promise chain, like this:

const promise = this.include.reduce(
  (runAfter, path) => runAfter.then(() => includePath(path, ...)),
  Promise.resolve()
);
await promise;

But it sounds like you are actually thinking about the reduce as a replacement for the usage of forEach to construct the object that indexes the promises by path. Is something like this what you're thinking about?

const promises = this.include.reduce((acc, path) => {
  const parentDottedPath = path.slice(0, -1).join('.');
  const myDottedPath = path.join('.');
  const runAfter = acc[parentDottedPath] || Promise.resolve();
  acc[myDottedPath] = runAfter.then(() => includePath(path, ...));
  return acc;
}, {});
await Promise.all(Object.keys(promises));

I'm happy to make such a switch. I find myself often waffling on the usage of reduce when building up an object - using reduce nicely signifies "I am building a single value", which can help readability, but the ergonomics of it feel just a little off to me. I think it might be something about the type of the accumulator not being obvious until you reach the initialValue argument which is sort of dangling when the reducer function is inline this way. Maybe that suggests I should add some type annotations?

type PromiseMap = { [dottedPath: string]: Promise<void> }

const promises = this.include.reduce((acc: PromiseMap, path: Array<string>) => {
  ...
}, {});

Let me know what your preferences here are, and I'm happy to incorporate that feedback!

Re: 2

So, it sounds like you're envisioning a scenario like:

class Item extends Parse.Object { }

// imagine that there are many "items" with characteristics like this:
const children = _.times(100, idx => new Item({ number: idx }));
const parent = new Item({ number: 0, children });
await parent.save();

// initiate a query that starts prefetches on those big arrays, potentially recursing down
new Parse.Query(Item).include('children.children').find();

It seems like you're worried about a scenario like this, where we have huge fan-out in the number of documents fetched via include - our base query could get 100 documents, which could each reference 100 documents, which could each reference 100 documents, and that's potentially a million rows that we would be fetching.

This is certainly a legitimate concern! I think it's somewhat alleviated by how this is already implemented. At a high level, when we are doing an include on a path, the code looks roughly like this:

includePath(path, object) {
  const pointers = recursivelyFindPointersAtPath(path, object);
  const groups = _.groupBy(pointers, p => p.className);
  const subqueries = _.map(groups, (pointers, className) => {
    const where = { objectId: { $in: pointers.map(p => p.objectId) } };
    return get(`/classes/${classname}`, { where });
  });
  const responses = await Promise.all(subqueries);
  responses.forEach(response => substitutePointers(path, object, response));
}

So, for each included path, we do at most one query per included class, regardless of how many included pointers there are. In this step, we're also already letting the number of concurrent queries fan out to the number of distinct classes that we found pointers to in this path. This seems like another good place for a concurrency limit!

While looking at this, I noticed something that strikes me as potentially a related problem that I'd like to address later - it seems like the subquery is going to use the default limit of 100 rows. In that scenario, we'll end up dropping pointers to the objects that didn't fit in that batch 😨

But, that feels like a sort of separate problem, and is probably one that requires some kind of careful decision. It would definitely be a breaking change if some queries suddenly started returning something like ParseError.TOO_MANY_INCLUDED_OBJECTS.

Whew! That's a lot of words about how I'm thinking about these things. I think I'll try to verify the problem around too many included objects and open an issue about it tomorrow so we can discuss options about solving that separately.

Thanks again for your time and energy looking in to this with me! I have a couple high traffic queries that include something like 4 related objects, so I'm personally pretty excited about the potential performance boost here 😁

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof - looks like I was wrong about the subquery being limited. It seems like the default limit: 100 is being applied by the ClassesRouter when the query is received by the server. This nested query doesn't traverse through that level of the stack, and so doesn't get any limit applied. So, the good news is that include should always find all the related documents! The bad news is that with a high degree of fan-out, this could become an enormous fetch.

Copy link
Contributor

@acinader acinader Mar 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that's another issue, right? I'm not worried about that problem, we can't prevent every user from shooting themselves in the foot and your good change doesn't exacerbate the issue I don't think beyond the concurrency limit.

The appropriate tool for large nested objects is a relation and I think the documentation makes that clear.

My only concern at this point with the change, which we definitely want, is to make sure that when we release this, queries that are maybe sluggish but work, don't start bringing down data servers cause we're doing 1k queries simultaneously instead of sequentially.

On the loop vs. reduce issue, you clearly understand my thought, so I'll now defer to your judgment. I also appreciate your detailed and thoughtful response.

I'm gonna step through your code today to make sure I'm properly groking.

PS: we work DIRECTLY across Market street from each other. In a couple of weeks when things get back to normal (hopefully), I'd love to meet up. In the meantime, if you have any spare cycles at all, I'd like to add you to the parse-sever and js-sdk teams (do you use any of the other SDKs?). Also, you can contact me directly on the parseopensource slack channel if it's helpful in the meantime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that's another issue, right?

Yeah, the giant fan-out for large include lists is already present and unrelated to this change. I'm still going to think about this a little bit, but I think you're probably right that it's up to users to protect themselves from this. Maybe the follow up here will be to add a note to the documentation later?

when we release this, queries that are maybe sluggish but work, don't start bringing down data servers cause we're doing 1k queries simultaneously instead of sequentially

Ok, I think I understand! I think the expected fan-out that this introduces is roughly the largest number of included paths at a given depth. So, include('a.b.c', 'd.e.f') should have a concurrency of 2 (since there are exactly two chains of dependencies), while include('foo', 'bar', 'baz', 'biz') would have a concurrency of 4 (all of those includes are at depth 1 and can be processed in parallel).

The worst case concurrency should be equal to the number of include paths. As an example, include('a', 'b.foo', 'b.bar', b.baz') would start with concurrency of 2 (a and b), and if the query for a is still running when the query for b completes. That would start the queries for the three fields that are dependent on b, and we would have concurrency of 4.

I think that having hundreds of paths in your include is a bit of a degenerate case, but certainly there's nothing stopping someone from using parse-server in that way. Maybe this is another good thing to add to the documentation around the include field?

One thing I'm noticing as I've been pushing patches on to this branch to make the tests all pass is that the traversal stuff is actually quite complex. I'm thinking that it might be worthwhile to drop that from this PR. I tried a bit more of a minimalistic approach, and that looks something like this: https://github.com/Hustle/parse-server/commits/concurrent-includes-minimal (note particularly 87d6d91). Does that seem like a preferable approach? I think it's sort of hard to interpret some of the trickiness in replacePointers, but it certainly seems less likely to introduce subtle breakages. I'd love your thoughts about the different approaches!

I've also added 03c02cd which hopefully will make it easier to step through what's going on in handleInclude. Let me know how that reads to you!

I'm mostly working in JS these days, so happy to keep contributing to parse-server and the JS SDK! What does it mean to be on those teams?

My fingers are crossed that things get back to normal fast! Looking forward to meeting you sometime then 😄

const dottedPath = path.join('.');

// Get the promise for the parent path
const parentDottedPath = path.slice(0, -1).join('.');
const parentPromise = promises[parentDottedPath];

// Once the parent promise has resolved, do this path's load step
const loadPromise = parentPromise.then(() =>
includePath(this.config, this.auth, this.response, path, this.restOptions)
);

// Put our promise into the promises map, so child paths can find and chain
// off of it
promises[dottedPath] = loadPromise;
});

return pathResponse;
// Wait for all includes to be fetched and merged in to the response tree
return Promise.all(Object.values(promises));
};

//Returns a promise of a processed set of results
Expand Down Expand Up @@ -821,7 +852,10 @@ RestQuery.prototype.runAfterFindTrigger = function() {

// Adds included values to the response.
// Path is a list of field names.
// Returns a promise for an augmented response.
//
// Modifies the response in place by replacing pointers with fetched objects
// Returns a promise that resolves when all includes have been fetched and
// substituted into the response.
function includePath(config, auth, response, path, restOptions = {}) {
var pointers = findPointers(response.results, path);
if (pointers.length == 0) {
Expand Down Expand Up @@ -905,19 +939,12 @@ function includePath(config, auth, response, path, restOptions = {}) {
return replace;
}, {});

var resp = {
results: replacePointers(response.results, path, replace),
};
if (response.count) {
resp.count = response.count;
}
return resp;
replacePointers(response.results, path, replace);
});
}

// Object may be a list of REST-format object to find pointers in, or
// it may be a single object.
// If the path yields things that aren't pointers, this throws an error.
// Path is a list of fields to search into.
// Returns a list of pointers in REST format.
function findPointers(object, path) {
Expand Down Expand Up @@ -951,8 +978,8 @@ function findPointers(object, path) {
// in, or it may be a single object.
// Path is a list of fields to search into.
// replace is a map from object id -> object.
// Returns something analogous to object, but with the appropriate
// pointers inflated.
//
// Performs in-place replacements on object
function replacePointers(object, path, replace) {
if (object instanceof Array) {
return object
Expand All @@ -964,27 +991,27 @@ function replacePointers(object, path, replace) {
return object;
}

// base case: we're returning a replacement
if (path.length === 0) {
if (object && object.__type === 'Pointer') {
return replace[object.objectId];
}
return object;
}

// penultimate case: we're looking at the object holding a reference
// to be mutated
else if (path.length === 1) {
object[path[0]] = replacePointers(object[path[0]], [], replace);
return;
}

var subobject = object[path[0]];
if (!subobject) {
return object;
}
var newsub = replacePointers(subobject, path.slice(1), replace);
var answer = {};
for (var key in object) {
if (key == path[0]) {
answer[key] = newsub;
} else {
answer[key] = object[key];
}
}
return answer;

return replacePointers(subobject, path.slice(1), replace);
}

// Finds a subobject that has the given key, if there is one.
Expand Down