Description
Preconditions and environment
- Magento version 2.4.6-p1
Steps to reproduce
-
At the section of the file "vendor/magento/module-customer/etc/di.xml" add your user agent at the section below:
<type name="Magento\Customer\Model\Visitor"> <arguments> <argument name="ignoredUserAgents" xsi:type="array"> <item name="google1" xsi:type="string">Googlebot/1.0 (googlebot@googlebot.com http://googlebot.com/)</item> <item name="google2" xsi:type="string">Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)</item> <item name="google3" xsi:type="string">Googlebot/2.1 (+http://www.googlebot.com/bot.html)</item> </argument> </arguments> </type>
-
Recompile Magento.
-
Generate the static files.
-
Clear the caches.
-
Visit a product page with your user agent.
Expected result
Since your user agent is configured to be ignored in the file "vendor/magento/module-customer/etc/di.xml", no record should be created in the table "report_viewed_product_index" for your visiting any product page of the store.
Actual result
A record is created in the table "report_viewed_product_index" for your visiting any product page of the store.
Additional information
- The user agents to be ignored should be matched as complete strings in the file "vendor/magento/module-customer/etc/di.xml". By default, the di.xml file contains 3 user agents associated with GoogleBot. But, the GoogleBot user agent contains the string "Chrome/W.X.Y.Z" which changes from time to time based on the version of the Chrome browser used by that very user agent; e.g.: "Chrome/41.0.2272.96". So, the log files of Apache or Nginx should be regularly checked in order to get new user agents associated with GoogleBot. The same happens with BingBot, and maybe with other bots. A relative match would be better, for example matching any user agent including the string "Googlebot/2.1".
- It is not clear how the visitor_id and customer_id fields of the table "report_viewed_product_index" are updated. When a logged-in customer views a product page, then the customer_id field gets a value, while the visitor_id field is NULL. When the customer logs out and view another product page, then the visitor_id field gets a value while the customer_id field is NULL. However, when the customer has never logged in while browsing the store, then both the visitor_id and customer_id fields get the NULL value.
Due to the above problems, the table "report_viewed_product_index":
- includes data coming from both real visitors and bots, while it is not possible to discriminate the data coming from bots, so as to at least truncate the relevant records. As a result the statistics are polluted by bots.
- can become really huge within a short time depending o.n the number of products.
- generates slow queries in the database when its size becomes large, as it is used in queries via INNER JOINs. A slow query causes a general performance issue on the database, as the involved tables stay open more time waiting for the query to get executed.
- is updated for every visit of product pages generating unnecessary work load on the database given that a bot can crawl thousands of pages within a day.
Release note
No response
Triage and priority
- Severity: S0 - Affects critical data or functionality and leaves users without workaround.
- Severity: S1 - Affects critical data or functionality and forces users to employ a workaround.
- Severity: S2 - Affects non-critical data or functionality and forces users to employ a workaround.
- Severity: S3 - Affects non-critical data or functionality and does not force users to employ a workaround.
- Severity: S4 - Affects aesthetics, professional look and feel, “quality” or “usability”.