Description
Preconditions and environment
- See 2.4.8-beta1
- (Should be present from 2.4.4)
- In order for the performance impact to be measurable, the system would need to have many products in various categories
- Note: that in any case
dev:query-log:enable
will show the extra queries
- Note: that in any case
Steps to reproduce
- Create a csv file with two (2) columns
- sku and any other column should be enough
- 1-2 rows should be enough
- Run the product import logic with append option selected
- (Both the API and admin panel import should exhibit this behavior)
Expected result
Extra logic related to loading / saving category data will not be called
Actual result
- Logic related to finding the min position is called
- Logic related to insert, on duplicate key update, is called for the existing category product relations
- Logic related to loading the category collection for the entire site is called
Additional information
If the above three sets of calls; when the product to category table (catalog_category_product) is large loading the min position is the most expensive.
This is because their is no index on position along with category_id and the query needs to scan all entries for that category.
SELECT MIN(position) AS `position` FROM `catalog_category_product` WHERE (category_id = ?)
The insert of existing data is un-needed, but seems to cost less.
Logic related to loading the category collection for the entire site is called once in bulk; it costs a flat amount per import depending on the number of categories in the site.
Code related notes
5da1c357bfd9 appears to introduce most of the issue for 1 and 2 above.
It changes processRowCategories to load the existing category product data from the product.
The change likely was intended make the data available for AfterImportDataObserver as generating the new urls appears to require the existing categories. (It seems, that the data made available to AfterImportDataObserver will differ if a COL_CATEGORY is set, as if it is, the product's existing category data will not be loaded).
The issue is that Product::saveProductCategoriesPhase adds it directly into the categoriesCache
. Without differentiating between rowData and existing/loaded product data.
(And Product::_saveProductCategories is passed categoriesCache
with both rowData and existing product category associations and calling getProductCategoriesDataSave) which calls the min position query and the insert, even if COL_CATEGORY is not present in the import's columns)
For 3 above, CategoryProcessor calls initCategories during __construct
.
Possible fixes:
saveProductCategoriesPhase
andprocessRowCategories
can be adjusted, where processRowCategories returns some metadata to differentiate rowData from product loaded data. ThensaveProductCategoriesPhase
can add tocategoriesCache
with the value false. Then_saveProductCategories
can be called with only the 'true' values.- Alternatively,
processRowCategories
can be adjusted to not load the product's data, andgetProductCategories
can be adjusted to load the product data if rowData is missing. (processRowCategories
could also load the product data into a sperate property andgetProductCategories
can use which ever property has data available).- (Also,
AfterImportDataObserver
has it's own checks on rowData and in many cases loading the product to get the existing category data will not be needed).
- (Also,
- It should be possible to Proxy the
CategoryProcessor
if https://github.com/magento/magento2/blob/2.4.8-beta1/app/code/Magento/CatalogImportExport/Model/Import/Product.php#L1641C47-L1641C68clearFailedCategories
is called insideprocessRowCategories
after the check on !empty($rowData[self::COL_CATEGORY]
Release note
No response
Triage and priority
- Severity: S0 - Affects critical data or functionality and leaves users without workaround.
- Severity: S1 - Affects critical data or functionality and forces users to employ a workaround.
- Severity: S2 - Affects non-critical data or functionality and forces users to employ a workaround.
- Severity: S3 - Affects non-critical data or functionality and does not force users to employ a workaround.
- Severity: S4 - Affects aesthetics, professional look and feel, “quality” or “usability”.