-
Notifications
You must be signed in to change notification settings - Fork 13.4k
[StaticDataLayout][PGO] Add profile format for static data layout, and the classes to operate on the profiles. #138170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-llvm-adt Author: Mingming Liu (mingmingl-llvm) ChangesContext: For https://discourse.llvm.org/t/rfc-profile-guided-static-data-partitioning/83744#p-336543-background-3, we propose to profile memory loads and stores via hardware events, symbolize the addresses of binary static data sections and feed the profile back into compiler for data partitioning. This change adds the profile format for static data layout, and the classes to operate on it. The profile and its format
In the profile's serialization format, strings are concatenated together and compressed. Individual records stores the index. A separate PR will connect this class to InstrProfReader/Writer via MemProfReader/Writer. Patch is 27.53 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/138170.diff 7 Files Affected:
diff --git a/llvm/include/llvm/ADT/MapVector.h b/llvm/include/llvm/ADT/MapVector.h
index c11617a81c97d..fe0d106795c34 100644
--- a/llvm/include/llvm/ADT/MapVector.h
+++ b/llvm/include/llvm/ADT/MapVector.h
@@ -57,6 +57,8 @@ class MapVector {
return std::move(Vector);
}
+ ArrayRef<value_type> getArrayRef() const { return Vector; }
+
size_type size() const { return Vector.size(); }
/// Grow the MapVector so that it can contain at least \p NumEntries items
diff --git a/llvm/include/llvm/ProfileData/DataAccessProf.h b/llvm/include/llvm/ProfileData/DataAccessProf.h
new file mode 100644
index 0000000000000..2cce4945fddd5
--- /dev/null
+++ b/llvm/include/llvm/ProfileData/DataAccessProf.h
@@ -0,0 +1,156 @@
+//===- DataAccessProf.h - Data access profile format support ---------*- C++
+//-*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This file contains support to construct and use data access profiles.
+//
+// For the original RFC of this pass please see
+// https://discourse.llvm.org/t/rfc-profile-guided-static-data-partitioning/83744
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_PROFILEDATA_DATAACCESSPROF_H_
+#define LLVM_PROFILEDATA_DATAACCESSPROF_H_
+
+#include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/DenseMapInfoVariant.h"
+#include "llvm/ADT/MapVector.h"
+#include "llvm/ADT/STLExtras.h"
+#include "llvm/ADT/SetVector.h"
+#include "llvm/ADT/SmallVector.h"
+#include "llvm/ADT/StringRef.h"
+#include "llvm/ProfileData/InstrProf.h"
+#include "llvm/Support/Allocator.h"
+#include "llvm/Support/Error.h"
+#include "llvm/Support/StringSaver.h"
+
+#include <cstdint>
+#include <variant>
+
+namespace llvm {
+
+namespace data_access_prof {
+// The location of data in the source code.
+struct DataLocation {
+ // The filename where the data is located.
+ StringRef FileName;
+ // The line number in the source code.
+ uint32_t Line;
+};
+
+// The data access profiles for a symbol.
+struct DataAccessProfRecord {
+ // Represents a data symbol. The semantic comes in two forms: a symbol index
+ // for symbol name if `IsStringLiteral` is false, or the hash of a string
+ // content if `IsStringLiteral` is true. Required.
+ uint64_t SymbolID;
+
+ // The access count of symbol. Required.
+ uint64_t AccessCount;
+
+ // True iff this is a record for string literal (symbols with name pattern
+ // `.str.*` in the symbol table). Required.
+ bool IsStringLiteral;
+
+ // The locations of data in the source code. Optional.
+ llvm::SmallVector<DataLocation> Locations;
+};
+
+/// Encapsulates the data access profile data and the methods to operate on it.
+/// This class provides profile look-up, serialization and deserialization.
+class DataAccessProfData {
+public:
+ // SymbolID is either a string representing symbol name, or a uint64_t
+ // representing the content hash of a string literal.
+ using SymbolID = std::variant<StringRef, uint64_t>;
+ using StringToIndexMap = llvm::MapVector<StringRef, uint64_t>;
+
+ DataAccessProfData() : saver(Allocator) {}
+
+ /// Serialize profile data to the output stream.
+ /// Storage layout:
+ /// - Serialized strings.
+ /// - The encoded hashes.
+ /// - Records.
+ Error serialize(ProfOStream &OS) const;
+
+ /// Deserialize this class from the given buffer.
+ Error deserialize(const unsigned char *&Ptr);
+
+ /// Returns a pointer of profile record for \p SymbolID, or nullptr if there
+ /// isn't a record. Internally, this function will canonicalize the symbol
+ /// name before the lookup.
+ const DataAccessProfRecord *getProfileRecord(const SymbolID SymID) const;
+
+ /// Returns true if \p SymID is seen in profiled binaries and cold.
+ bool isKnownColdSymbol(const SymbolID SymID) const;
+
+ /// Methods to add symbolized data access profile. Returns error if duplicated
+ /// symbol names or content hashes are seen. The user of this class should
+ /// aggregate counters that corresponds to the same symbol name or with the
+ /// same string literal hash before calling 'add*' methods.
+ Error addSymbolizedDataAccessProfile(SymbolID SymbolID, uint64_t AccessCount);
+ Error addSymbolizedDataAccessProfile(
+ SymbolID SymbolID, uint64_t AccessCount,
+ const llvm::SmallVector<DataLocation> &Locations);
+ Error addKnownSymbolWithoutSamples(SymbolID SymbolID);
+
+ /// Returns a iterable StringRef for strings in the order they are added.
+ auto getStrings() const {
+ ArrayRef<std::pair<StringRef, uint64_t>> RefSymbolNames(
+ StrToIndexMap.begin(), StrToIndexMap.end());
+ return llvm::make_first_range(RefSymbolNames);
+ }
+
+ /// Returns array reference for various internal data structures.
+ inline ArrayRef<
+ std::pair<std::variant<StringRef, uint64_t>, DataAccessProfRecord>>
+ getRecords() const {
+ return Records.getArrayRef();
+ }
+ inline ArrayRef<StringRef> getKnownColdSymbols() const {
+ return KnownColdSymbols.getArrayRef();
+ }
+ inline ArrayRef<uint64_t> getKnownColdHashes() const {
+ return KnownColdHashes.getArrayRef();
+ }
+
+private:
+ /// Serialize the symbol strings into the output stream.
+ Error serializeStrings(ProfOStream &OS) const;
+
+ /// Deserialize the symbol strings from \p Ptr and increment \p Ptr to the
+ /// start of the next payload.
+ Error deserializeStrings(const unsigned char *&Ptr,
+ const uint64_t NumSampledSymbols,
+ uint64_t NumColdKnownSymbols);
+
+ /// Decode the records and increment \p Ptr to the start of the next payload.
+ Error deserializeRecords(const unsigned char *&Ptr);
+
+ /// A helper function to compute a storage index for \p SymbolID.
+ uint64_t getEncodedIndex(const SymbolID SymbolID) const;
+
+ // `Records` stores the records and `SymbolToRecordIndex` maps a symbol ID to
+ // its record index.
+ MapVector<SymbolID, DataAccessProfRecord> Records;
+
+ // Use MapVector to keep input order of strings for serialization and
+ // deserialization.
+ StringToIndexMap StrToIndexMap;
+ llvm::SetVector<uint64_t> KnownColdHashes;
+ llvm::SetVector<StringRef> KnownColdSymbols;
+ // Keeps owned copies of the input strings.
+ llvm::BumpPtrAllocator Allocator;
+ llvm::UniqueStringSaver saver;
+};
+
+} // namespace data_access_prof
+} // namespace llvm
+
+#endif // LLVM_PROFILEDATA_DATAACCESSPROF_H_
diff --git a/llvm/include/llvm/ProfileData/InstrProf.h b/llvm/include/llvm/ProfileData/InstrProf.h
index 2d011c89f27cb..8a6be22bdb1a4 100644
--- a/llvm/include/llvm/ProfileData/InstrProf.h
+++ b/llvm/include/llvm/ProfileData/InstrProf.h
@@ -357,6 +357,12 @@ void createPGONameMetadata(GlobalObject &GO, StringRef PGOName);
/// the duplicated profile variables for Comdat functions.
bool needsComdatForCounter(const GlobalObject &GV, const Module &M);
+/// \c NameStrings is a string composed of one of more possibly encoded
+/// sub-strings. The substrings are separated by 0 or more zero bytes. This
+/// method decodes the string and calls `NameCallback` for each substring.
+Error readAndDecodeStrings(StringRef NameStrings,
+ std::function<Error(StringRef)> NameCallback);
+
/// An enum describing the attributes of an instrumented profile.
enum class InstrProfKind {
Unknown = 0x0,
@@ -493,6 +499,11 @@ class InstrProfSymtab {
public:
using AddrHashMap = std::vector<std::pair<uint64_t, uint64_t>>;
+ // Returns the canonial name of the given PGOName. In a canonical name, all
+ // suffixes that begins with "." except ".__uniq." are stripped.
+ // FIXME: Unify this with `FunctionSamples::getCanonicalFnName`.
+ static StringRef getCanonicalName(StringRef PGOName);
+
private:
using AddrIntervalMap =
IntervalMap<uint64_t, uint64_t, 4, IntervalMapHalfOpenInfo<uint64_t>>;
@@ -528,11 +539,6 @@ class InstrProfSymtab {
static StringRef getExternalSymbol() { return "** External Symbol **"; }
- // Returns the canonial name of the given PGOName. In a canonical name, all
- // suffixes that begins with "." except ".__uniq." are stripped.
- // FIXME: Unify this with `FunctionSamples::getCanonicalFnName`.
- static StringRef getCanonicalName(StringRef PGOName);
-
// Add the function into the symbol table, by creating the following
// map entries:
// name-set = {PGOFuncName} union {getCanonicalName(PGOFuncName)}
diff --git a/llvm/lib/ProfileData/CMakeLists.txt b/llvm/lib/ProfileData/CMakeLists.txt
index eb7c2a3c1a28a..67a69d7761b2c 100644
--- a/llvm/lib/ProfileData/CMakeLists.txt
+++ b/llvm/lib/ProfileData/CMakeLists.txt
@@ -1,4 +1,5 @@
add_llvm_component_library(LLVMProfileData
+ DataAccessProf.cpp
GCOV.cpp
IndexedMemProfData.cpp
InstrProf.cpp
diff --git a/llvm/lib/ProfileData/DataAccessProf.cpp b/llvm/lib/ProfileData/DataAccessProf.cpp
new file mode 100644
index 0000000000000..cf538d6a1b28e
--- /dev/null
+++ b/llvm/lib/ProfileData/DataAccessProf.cpp
@@ -0,0 +1,246 @@
+#include "llvm/ProfileData/DataAccessProf.h"
+#include "llvm/ADT/DenseMapInfoVariant.h"
+#include "llvm/ADT/STLExtras.h"
+#include "llvm/ProfileData/InstrProf.h"
+#include "llvm/Support/Compression.h"
+#include "llvm/Support/Endian.h"
+#include "llvm/Support/Errc.h"
+#include "llvm/Support/Error.h"
+#include "llvm/Support/StringSaver.h"
+#include "llvm/Support/raw_ostream.h"
+#include <sys/types.h>
+
+namespace llvm {
+namespace data_access_prof {
+
+// If `Map` has an entry keyed by `Str`, returns the entry iterator. Otherwise,
+// creates an owned copy of `Str`, adds a map entry for it and returns the
+// iterator.
+static MapVector<StringRef, uint64_t>::iterator
+saveStringToMap(MapVector<StringRef, uint64_t> &Map,
+ llvm::UniqueStringSaver &saver, StringRef Str) {
+ auto [Iter, Inserted] = Map.try_emplace(saver.save(Str), Map.size());
+ return Iter;
+}
+
+const DataAccessProfRecord *
+DataAccessProfData::getProfileRecord(const SymbolID SymbolID) const {
+ auto Key = SymbolID;
+ if (std::holds_alternative<StringRef>(SymbolID))
+ Key = InstrProfSymtab::getCanonicalName(std::get<StringRef>(SymbolID));
+
+ auto It = Records.find(Key);
+ if (It != Records.end())
+ return &It->second;
+
+ return nullptr;
+}
+
+bool DataAccessProfData::isKnownColdSymbol(const SymbolID SymID) const {
+ if (std::holds_alternative<uint64_t>(SymID))
+ return KnownColdHashes.count(std::get<uint64_t>(SymID));
+ return KnownColdSymbols.count(std::get<StringRef>(SymID));
+}
+
+Error DataAccessProfData::addSymbolizedDataAccessProfile(SymbolID Symbol,
+ uint64_t AccessCount) {
+ uint64_t RecordID = -1;
+ bool IsStringLiteral = false;
+ SymbolID Key;
+ if (std::holds_alternative<uint64_t>(Symbol)) {
+ RecordID = std::get<uint64_t>(Symbol);
+ Key = RecordID;
+ IsStringLiteral = true;
+ } else {
+ StringRef SymbolName = std::get<StringRef>(Symbol);
+ if (SymbolName.empty())
+ return make_error<StringError>("Empty symbol name",
+ llvm::errc::invalid_argument);
+
+ StringRef CanonicalName = InstrProfSymtab::getCanonicalName(SymbolName);
+ Key = CanonicalName;
+ RecordID = saveStringToMap(StrToIndexMap, saver, CanonicalName)->second;
+ IsStringLiteral = false;
+ }
+
+ auto [Iter, Inserted] = Records.try_emplace(
+ Key, DataAccessProfRecord{RecordID, AccessCount, IsStringLiteral});
+ if (!Inserted)
+ return make_error<StringError>("Duplicate symbol or string literal added. "
+ "User of DataAccessProfData should "
+ "aggregate count for the same symbol. ",
+ llvm::errc::invalid_argument);
+
+ return Error::success();
+}
+
+Error DataAccessProfData::addSymbolizedDataAccessProfile(
+ SymbolID SymbolID, uint64_t AccessCount,
+ const llvm::SmallVector<DataLocation> &Locations) {
+ if (Error E = addSymbolizedDataAccessProfile(SymbolID, AccessCount))
+ return E;
+
+ auto &Record = Records.back().second;
+ for (const auto &Location : Locations)
+ Record.Locations.push_back(
+ {saveStringToMap(StrToIndexMap, saver, Location.FileName)->first,
+ Location.Line});
+
+ return Error::success();
+}
+
+Error DataAccessProfData::addKnownSymbolWithoutSamples(SymbolID SymbolID) {
+ if (std::holds_alternative<uint64_t>(SymbolID)) {
+ KnownColdHashes.insert(std::get<uint64_t>(SymbolID));
+ return Error::success();
+ }
+ StringRef SymbolName = std::get<StringRef>(SymbolID);
+ if (SymbolName.empty())
+ return make_error<StringError>("Empty symbol name",
+ llvm::errc::invalid_argument);
+ StringRef CanonicalSymName = InstrProfSymtab::getCanonicalName(SymbolName);
+ KnownColdSymbols.insert(CanonicalSymName);
+ return Error::success();
+}
+
+Error DataAccessProfData::deserialize(const unsigned char *&Ptr) {
+ uint64_t NumSampledSymbols =
+ support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);
+ uint64_t NumColdKnownSymbols =
+ support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);
+ if (Error E = deserializeStrings(Ptr, NumSampledSymbols, NumColdKnownSymbols))
+ return E;
+
+ uint64_t Num =
+ support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);
+ for (uint64_t I = 0; I < Num; ++I)
+ KnownColdHashes.insert(
+ support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr));
+
+ return deserializeRecords(Ptr);
+}
+
+Error DataAccessProfData::serializeStrings(ProfOStream &OS) const {
+ OS.write(StrToIndexMap.size());
+ OS.write(KnownColdSymbols.size());
+
+ std::vector<std::string> Strs;
+ Strs.reserve(StrToIndexMap.size() + KnownColdSymbols.size());
+ for (const auto &Str : StrToIndexMap)
+ Strs.push_back(Str.first.str());
+ for (const auto &Str : KnownColdSymbols)
+ Strs.push_back(Str.str());
+
+ std::string CompressedStrings;
+ if (!Strs.empty())
+ if (Error E = collectGlobalObjectNameStrings(
+ Strs, compression::zlib::isAvailable(), CompressedStrings))
+ return E;
+ const uint64_t CompressedStringLen = CompressedStrings.length();
+ // Record the length of compressed string.
+ OS.write(CompressedStringLen);
+ // Write the chars in compressed strings.
+ for (auto &c : CompressedStrings)
+ OS.writeByte(static_cast<uint8_t>(c));
+ // Pad up to a multiple of 8.
+ // InstrProfReader could read bytes according to 'CompressedStringLen'.
+ const uint64_t PaddedLength = alignTo(CompressedStringLen, 8);
+ for (uint64_t K = CompressedStringLen; K < PaddedLength; K++)
+ OS.writeByte(0);
+ return Error::success();
+}
+
+uint64_t DataAccessProfData::getEncodedIndex(const SymbolID SymbolID) const {
+ if (std::holds_alternative<uint64_t>(SymbolID))
+ return std::get<uint64_t>(SymbolID);
+
+ return StrToIndexMap.find(std::get<StringRef>(SymbolID))->second;
+}
+
+Error DataAccessProfData::serialize(ProfOStream &OS) const {
+ if (Error E = serializeStrings(OS))
+ return E;
+ OS.write(KnownColdHashes.size());
+ for (const auto &Hash : KnownColdHashes)
+ OS.write(Hash);
+ OS.write((uint64_t)(Records.size()));
+ for (const auto &[Key, Rec] : Records) {
+ OS.write(getEncodedIndex(Rec.SymbolID));
+ OS.writeByte(Rec.IsStringLiteral);
+ OS.write(Rec.AccessCount);
+ OS.write(Rec.Locations.size());
+ for (const auto &Loc : Rec.Locations) {
+ OS.write(getEncodedIndex(Loc.FileName));
+ OS.write32(Loc.Line);
+ }
+ }
+ return Error::success();
+}
+
+Error DataAccessProfData::deserializeStrings(const unsigned char *&Ptr,
+ uint64_t NumSampledSymbols,
+ uint64_t NumColdKnownSymbols) {
+ uint64_t Len =
+ support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);
+
+ // With M=NumSampledSymbols and N=NumColdKnownSymbols, the first M strings are
+ // symbols with samples, and next N strings are known cold symbols.
+ uint64_t StringCnt = 0;
+ std::function<Error(StringRef)> addName = [&](StringRef Name) {
+ if (StringCnt < NumSampledSymbols)
+ saveStringToMap(StrToIndexMap, saver, Name);
+ else
+ KnownColdSymbols.insert(saver.save(Name));
+ ++StringCnt;
+ return Error::success();
+ };
+ if (Error E =
+ readAndDecodeStrings(StringRef((const char *)Ptr, Len), addName))
+ return E;
+
+ Ptr += alignTo(Len, 8);
+ return Error::success();
+}
+
+Error DataAccessProfData::deserializeRecords(const unsigned char *&Ptr) {
+ SmallVector<StringRef> Strings = llvm::to_vector(getStrings());
+
+ uint64_t NumRecords =
+ support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);
+
+ for (uint64_t I = 0; I < NumRecords; ++I) {
+ uint64_t ID =
+ support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);
+
+ bool IsStringLiteral =
+ support::endian::readNext<uint8_t, llvm::endianness::little>(Ptr);
+
+ uint64_t AccessCount =
+ support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);
+
+ SymbolID SymbolID;
+ if (IsStringLiteral)
+ SymbolID = ID;
+ else
+ SymbolID = Strings[ID];
+ if (Error E = addSymbolizedDataAccessProfile(SymbolID, AccessCount))
+ return E;
+
+ auto &Record = Records.back().second;
+
+ uint64_t NumLocations =
+ support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);
+
+ Record.Locations.reserve(NumLocations);
+ for (uint64_t J = 0; J < NumLocations; ++J) {
+ uint64_t FileNameIndex =
+ support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);
+ uint32_t Line =
+ support::endian::readNext<uint32_t, llvm::endianness::little>(Ptr);
+ Record.Locations.push_back({Strings[FileNameIndex], Line});
+ }
+ }
+ return Error::success();
+}
+} // namespace data_access_prof
+} // namespace llvm
diff --git a/llvm/lib/ProfileData/InstrProf.cpp b/llvm/lib/ProfileData/InstrProf.cpp
index 88621787c1dd9..254f941acde82 100644
--- a/llvm/lib/ProfileData/InstrProf.cpp
+++ b/llvm/lib/ProfileData/InstrProf.cpp
@@ -573,12 +573,8 @@ Error InstrProfSymtab::addVTableWithName(GlobalVariable &VTable,
return Error::success();
}
-/// \c NameStrings is a string composed of one of more possibly encoded
-/// sub-strings. The substrings are separated by 0 or more zero bytes. This
-/// method decodes the string and calls `NameCallback` for each substring.
-static Error
-readAndDecodeStrings(StringRef NameStrings,
- std::function<Error(StringRef)> NameCallback) {
+Error readAndDecodeStrings(StringRef NameStrings,
+ std::function<Error(StringRef)> NameCallback) {
const uint8_t *P = NameStrings.bytes_begin();
const uint8_t *EndP = NameStrings.bytes_end();
while (P < EndP) {
diff --git a/llvm/unittests/ProfileData/MemProfTest.cpp b/llvm/unittests/ProfileData/MemProfTest.cpp
index 3e430aa4eae58..f6362448a5734 100644
--- a/llvm/unittests/ProfileData/MemProfTest.cpp
+++ b/llvm/unittests/ProfileData/MemProfTest.cpp
@@ -10,14 +10,17 @@
#include "llvm/ADT/DenseMap.h"
#include "llvm/ADT/MapVector.h"
#include "llvm/ADT/STLForwardCompat.h"
+#include "llvm/ADT/SmallVector.h"
#include "llvm/DebugInfo/DIContext.h"
#include "llvm/DebugInfo/Symbolize/SymbolizableModule.h"
#include "llvm/IR/Value.h"
#include "llvm/Object/ObjectFile.h"
+#include "llvm/ProfileData/DataAccessProf.h"
#include "llvm/ProfileData/MemProfData.inc"
#include "llvm/ProfileData/MemProfReader.h"
#include "llvm/ProfileData/MemProfYAML.h"
#include "llvm/Support/raw_ostream.h"
+#include "gmock/gmock-more-matchers.h"
#include "gmock/gmock.h"
#include "gtest/gtest.h"
@@ -36,6 +39,8 @@ using ::llvm::StringRef;
using ::llvm::object::SectionedAddress;
using ::llvm::symbolize::SymbolizableModule;
using ::testing::ElementsAre;
+using ::testing::ElementsAreArray;
+using ::testing::HasSubstr;
using ::testing::IsEmpty;
using ::testing::Pair;
using ::testing::Return;
@@ -747,6 +752,162 @@ TEST(MemProf, YAMLParser) {
ElementsAre(0x3000)))));
}
+static std::string ErrorToString(Error E) {
+ std::string ErrMsg;
...
[truncated]
|
@llvm/pr-subscribers-pgo Author: Mingming Liu (mingmingl-llvm) ChangesContext: For https://discourse.llvm.org/t/rfc-profile-guided-static-data-partitioning/83744#p-336543-background-3, we propose to profile memory loads and stores via hardware events, symbolize the addresses of binary static data sections and feed the profile back into compiler for data partitioning. This change adds the profile format for static data layout, and the classes to operate on it. The profile and its format
In the profile's serialization format, strings are concatenated together and compressed. Individual records stores the index. A separate PR will connect this class to InstrProfReader/Writer via MemProfReader/Writer. Patch is 27.53 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/138170.diff 7 Files Affected:
diff --git a/llvm/include/llvm/ADT/MapVector.h b/llvm/include/llvm/ADT/MapVector.h
index c11617a81c97d..fe0d106795c34 100644
--- a/llvm/include/llvm/ADT/MapVector.h
+++ b/llvm/include/llvm/ADT/MapVector.h
@@ -57,6 +57,8 @@ class MapVector {
return std::move(Vector);
}
+ ArrayRef<value_type> getArrayRef() const { return Vector; }
+
size_type size() const { return Vector.size(); }
/// Grow the MapVector so that it can contain at least \p NumEntries items
diff --git a/llvm/include/llvm/ProfileData/DataAccessProf.h b/llvm/include/llvm/ProfileData/DataAccessProf.h
new file mode 100644
index 0000000000000..2cce4945fddd5
--- /dev/null
+++ b/llvm/include/llvm/ProfileData/DataAccessProf.h
@@ -0,0 +1,156 @@
+//===- DataAccessProf.h - Data access profile format support ---------*- C++
+//-*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This file contains support to construct and use data access profiles.
+//
+// For the original RFC of this pass please see
+// https://discourse.llvm.org/t/rfc-profile-guided-static-data-partitioning/83744
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_PROFILEDATA_DATAACCESSPROF_H_
+#define LLVM_PROFILEDATA_DATAACCESSPROF_H_
+
+#include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/DenseMapInfoVariant.h"
+#include "llvm/ADT/MapVector.h"
+#include "llvm/ADT/STLExtras.h"
+#include "llvm/ADT/SetVector.h"
+#include "llvm/ADT/SmallVector.h"
+#include "llvm/ADT/StringRef.h"
+#include "llvm/ProfileData/InstrProf.h"
+#include "llvm/Support/Allocator.h"
+#include "llvm/Support/Error.h"
+#include "llvm/Support/StringSaver.h"
+
+#include <cstdint>
+#include <variant>
+
+namespace llvm {
+
+namespace data_access_prof {
+// The location of data in the source code.
+struct DataLocation {
+ // The filename where the data is located.
+ StringRef FileName;
+ // The line number in the source code.
+ uint32_t Line;
+};
+
+// The data access profiles for a symbol.
+struct DataAccessProfRecord {
+ // Represents a data symbol. The semantic comes in two forms: a symbol index
+ // for symbol name if `IsStringLiteral` is false, or the hash of a string
+ // content if `IsStringLiteral` is true. Required.
+ uint64_t SymbolID;
+
+ // The access count of symbol. Required.
+ uint64_t AccessCount;
+
+ // True iff this is a record for string literal (symbols with name pattern
+ // `.str.*` in the symbol table). Required.
+ bool IsStringLiteral;
+
+ // The locations of data in the source code. Optional.
+ llvm::SmallVector<DataLocation> Locations;
+};
+
+/// Encapsulates the data access profile data and the methods to operate on it.
+/// This class provides profile look-up, serialization and deserialization.
+class DataAccessProfData {
+public:
+ // SymbolID is either a string representing symbol name, or a uint64_t
+ // representing the content hash of a string literal.
+ using SymbolID = std::variant<StringRef, uint64_t>;
+ using StringToIndexMap = llvm::MapVector<StringRef, uint64_t>;
+
+ DataAccessProfData() : saver(Allocator) {}
+
+ /// Serialize profile data to the output stream.
+ /// Storage layout:
+ /// - Serialized strings.
+ /// - The encoded hashes.
+ /// - Records.
+ Error serialize(ProfOStream &OS) const;
+
+ /// Deserialize this class from the given buffer.
+ Error deserialize(const unsigned char *&Ptr);
+
+ /// Returns a pointer of profile record for \p SymbolID, or nullptr if there
+ /// isn't a record. Internally, this function will canonicalize the symbol
+ /// name before the lookup.
+ const DataAccessProfRecord *getProfileRecord(const SymbolID SymID) const;
+
+ /// Returns true if \p SymID is seen in profiled binaries and cold.
+ bool isKnownColdSymbol(const SymbolID SymID) const;
+
+ /// Methods to add symbolized data access profile. Returns error if duplicated
+ /// symbol names or content hashes are seen. The user of this class should
+ /// aggregate counters that corresponds to the same symbol name or with the
+ /// same string literal hash before calling 'add*' methods.
+ Error addSymbolizedDataAccessProfile(SymbolID SymbolID, uint64_t AccessCount);
+ Error addSymbolizedDataAccessProfile(
+ SymbolID SymbolID, uint64_t AccessCount,
+ const llvm::SmallVector<DataLocation> &Locations);
+ Error addKnownSymbolWithoutSamples(SymbolID SymbolID);
+
+ /// Returns a iterable StringRef for strings in the order they are added.
+ auto getStrings() const {
+ ArrayRef<std::pair<StringRef, uint64_t>> RefSymbolNames(
+ StrToIndexMap.begin(), StrToIndexMap.end());
+ return llvm::make_first_range(RefSymbolNames);
+ }
+
+ /// Returns array reference for various internal data structures.
+ inline ArrayRef<
+ std::pair<std::variant<StringRef, uint64_t>, DataAccessProfRecord>>
+ getRecords() const {
+ return Records.getArrayRef();
+ }
+ inline ArrayRef<StringRef> getKnownColdSymbols() const {
+ return KnownColdSymbols.getArrayRef();
+ }
+ inline ArrayRef<uint64_t> getKnownColdHashes() const {
+ return KnownColdHashes.getArrayRef();
+ }
+
+private:
+ /// Serialize the symbol strings into the output stream.
+ Error serializeStrings(ProfOStream &OS) const;
+
+ /// Deserialize the symbol strings from \p Ptr and increment \p Ptr to the
+ /// start of the next payload.
+ Error deserializeStrings(const unsigned char *&Ptr,
+ const uint64_t NumSampledSymbols,
+ uint64_t NumColdKnownSymbols);
+
+ /// Decode the records and increment \p Ptr to the start of the next payload.
+ Error deserializeRecords(const unsigned char *&Ptr);
+
+ /// A helper function to compute a storage index for \p SymbolID.
+ uint64_t getEncodedIndex(const SymbolID SymbolID) const;
+
+ // `Records` stores the records and `SymbolToRecordIndex` maps a symbol ID to
+ // its record index.
+ MapVector<SymbolID, DataAccessProfRecord> Records;
+
+ // Use MapVector to keep input order of strings for serialization and
+ // deserialization.
+ StringToIndexMap StrToIndexMap;
+ llvm::SetVector<uint64_t> KnownColdHashes;
+ llvm::SetVector<StringRef> KnownColdSymbols;
+ // Keeps owned copies of the input strings.
+ llvm::BumpPtrAllocator Allocator;
+ llvm::UniqueStringSaver saver;
+};
+
+} // namespace data_access_prof
+} // namespace llvm
+
+#endif // LLVM_PROFILEDATA_DATAACCESSPROF_H_
diff --git a/llvm/include/llvm/ProfileData/InstrProf.h b/llvm/include/llvm/ProfileData/InstrProf.h
index 2d011c89f27cb..8a6be22bdb1a4 100644
--- a/llvm/include/llvm/ProfileData/InstrProf.h
+++ b/llvm/include/llvm/ProfileData/InstrProf.h
@@ -357,6 +357,12 @@ void createPGONameMetadata(GlobalObject &GO, StringRef PGOName);
/// the duplicated profile variables for Comdat functions.
bool needsComdatForCounter(const GlobalObject &GV, const Module &M);
+/// \c NameStrings is a string composed of one of more possibly encoded
+/// sub-strings. The substrings are separated by 0 or more zero bytes. This
+/// method decodes the string and calls `NameCallback` for each substring.
+Error readAndDecodeStrings(StringRef NameStrings,
+ std::function<Error(StringRef)> NameCallback);
+
/// An enum describing the attributes of an instrumented profile.
enum class InstrProfKind {
Unknown = 0x0,
@@ -493,6 +499,11 @@ class InstrProfSymtab {
public:
using AddrHashMap = std::vector<std::pair<uint64_t, uint64_t>>;
+ // Returns the canonial name of the given PGOName. In a canonical name, all
+ // suffixes that begins with "." except ".__uniq." are stripped.
+ // FIXME: Unify this with `FunctionSamples::getCanonicalFnName`.
+ static StringRef getCanonicalName(StringRef PGOName);
+
private:
using AddrIntervalMap =
IntervalMap<uint64_t, uint64_t, 4, IntervalMapHalfOpenInfo<uint64_t>>;
@@ -528,11 +539,6 @@ class InstrProfSymtab {
static StringRef getExternalSymbol() { return "** External Symbol **"; }
- // Returns the canonial name of the given PGOName. In a canonical name, all
- // suffixes that begins with "." except ".__uniq." are stripped.
- // FIXME: Unify this with `FunctionSamples::getCanonicalFnName`.
- static StringRef getCanonicalName(StringRef PGOName);
-
// Add the function into the symbol table, by creating the following
// map entries:
// name-set = {PGOFuncName} union {getCanonicalName(PGOFuncName)}
diff --git a/llvm/lib/ProfileData/CMakeLists.txt b/llvm/lib/ProfileData/CMakeLists.txt
index eb7c2a3c1a28a..67a69d7761b2c 100644
--- a/llvm/lib/ProfileData/CMakeLists.txt
+++ b/llvm/lib/ProfileData/CMakeLists.txt
@@ -1,4 +1,5 @@
add_llvm_component_library(LLVMProfileData
+ DataAccessProf.cpp
GCOV.cpp
IndexedMemProfData.cpp
InstrProf.cpp
diff --git a/llvm/lib/ProfileData/DataAccessProf.cpp b/llvm/lib/ProfileData/DataAccessProf.cpp
new file mode 100644
index 0000000000000..cf538d6a1b28e
--- /dev/null
+++ b/llvm/lib/ProfileData/DataAccessProf.cpp
@@ -0,0 +1,246 @@
+#include "llvm/ProfileData/DataAccessProf.h"
+#include "llvm/ADT/DenseMapInfoVariant.h"
+#include "llvm/ADT/STLExtras.h"
+#include "llvm/ProfileData/InstrProf.h"
+#include "llvm/Support/Compression.h"
+#include "llvm/Support/Endian.h"
+#include "llvm/Support/Errc.h"
+#include "llvm/Support/Error.h"
+#include "llvm/Support/StringSaver.h"
+#include "llvm/Support/raw_ostream.h"
+#include <sys/types.h>
+
+namespace llvm {
+namespace data_access_prof {
+
+// If `Map` has an entry keyed by `Str`, returns the entry iterator. Otherwise,
+// creates an owned copy of `Str`, adds a map entry for it and returns the
+// iterator.
+static MapVector<StringRef, uint64_t>::iterator
+saveStringToMap(MapVector<StringRef, uint64_t> &Map,
+ llvm::UniqueStringSaver &saver, StringRef Str) {
+ auto [Iter, Inserted] = Map.try_emplace(saver.save(Str), Map.size());
+ return Iter;
+}
+
+const DataAccessProfRecord *
+DataAccessProfData::getProfileRecord(const SymbolID SymbolID) const {
+ auto Key = SymbolID;
+ if (std::holds_alternative<StringRef>(SymbolID))
+ Key = InstrProfSymtab::getCanonicalName(std::get<StringRef>(SymbolID));
+
+ auto It = Records.find(Key);
+ if (It != Records.end())
+ return &It->second;
+
+ return nullptr;
+}
+
+bool DataAccessProfData::isKnownColdSymbol(const SymbolID SymID) const {
+ if (std::holds_alternative<uint64_t>(SymID))
+ return KnownColdHashes.count(std::get<uint64_t>(SymID));
+ return KnownColdSymbols.count(std::get<StringRef>(SymID));
+}
+
+Error DataAccessProfData::addSymbolizedDataAccessProfile(SymbolID Symbol,
+ uint64_t AccessCount) {
+ uint64_t RecordID = -1;
+ bool IsStringLiteral = false;
+ SymbolID Key;
+ if (std::holds_alternative<uint64_t>(Symbol)) {
+ RecordID = std::get<uint64_t>(Symbol);
+ Key = RecordID;
+ IsStringLiteral = true;
+ } else {
+ StringRef SymbolName = std::get<StringRef>(Symbol);
+ if (SymbolName.empty())
+ return make_error<StringError>("Empty symbol name",
+ llvm::errc::invalid_argument);
+
+ StringRef CanonicalName = InstrProfSymtab::getCanonicalName(SymbolName);
+ Key = CanonicalName;
+ RecordID = saveStringToMap(StrToIndexMap, saver, CanonicalName)->second;
+ IsStringLiteral = false;
+ }
+
+ auto [Iter, Inserted] = Records.try_emplace(
+ Key, DataAccessProfRecord{RecordID, AccessCount, IsStringLiteral});
+ if (!Inserted)
+ return make_error<StringError>("Duplicate symbol or string literal added. "
+ "User of DataAccessProfData should "
+ "aggregate count for the same symbol. ",
+ llvm::errc::invalid_argument);
+
+ return Error::success();
+}
+
+Error DataAccessProfData::addSymbolizedDataAccessProfile(
+ SymbolID SymbolID, uint64_t AccessCount,
+ const llvm::SmallVector<DataLocation> &Locations) {
+ if (Error E = addSymbolizedDataAccessProfile(SymbolID, AccessCount))
+ return E;
+
+ auto &Record = Records.back().second;
+ for (const auto &Location : Locations)
+ Record.Locations.push_back(
+ {saveStringToMap(StrToIndexMap, saver, Location.FileName)->first,
+ Location.Line});
+
+ return Error::success();
+}
+
+Error DataAccessProfData::addKnownSymbolWithoutSamples(SymbolID SymbolID) {
+ if (std::holds_alternative<uint64_t>(SymbolID)) {
+ KnownColdHashes.insert(std::get<uint64_t>(SymbolID));
+ return Error::success();
+ }
+ StringRef SymbolName = std::get<StringRef>(SymbolID);
+ if (SymbolName.empty())
+ return make_error<StringError>("Empty symbol name",
+ llvm::errc::invalid_argument);
+ StringRef CanonicalSymName = InstrProfSymtab::getCanonicalName(SymbolName);
+ KnownColdSymbols.insert(CanonicalSymName);
+ return Error::success();
+}
+
+Error DataAccessProfData::deserialize(const unsigned char *&Ptr) {
+ uint64_t NumSampledSymbols =
+ support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);
+ uint64_t NumColdKnownSymbols =
+ support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);
+ if (Error E = deserializeStrings(Ptr, NumSampledSymbols, NumColdKnownSymbols))
+ return E;
+
+ uint64_t Num =
+ support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);
+ for (uint64_t I = 0; I < Num; ++I)
+ KnownColdHashes.insert(
+ support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr));
+
+ return deserializeRecords(Ptr);
+}
+
+Error DataAccessProfData::serializeStrings(ProfOStream &OS) const {
+ OS.write(StrToIndexMap.size());
+ OS.write(KnownColdSymbols.size());
+
+ std::vector<std::string> Strs;
+ Strs.reserve(StrToIndexMap.size() + KnownColdSymbols.size());
+ for (const auto &Str : StrToIndexMap)
+ Strs.push_back(Str.first.str());
+ for (const auto &Str : KnownColdSymbols)
+ Strs.push_back(Str.str());
+
+ std::string CompressedStrings;
+ if (!Strs.empty())
+ if (Error E = collectGlobalObjectNameStrings(
+ Strs, compression::zlib::isAvailable(), CompressedStrings))
+ return E;
+ const uint64_t CompressedStringLen = CompressedStrings.length();
+ // Record the length of compressed string.
+ OS.write(CompressedStringLen);
+ // Write the chars in compressed strings.
+ for (auto &c : CompressedStrings)
+ OS.writeByte(static_cast<uint8_t>(c));
+ // Pad up to a multiple of 8.
+ // InstrProfReader could read bytes according to 'CompressedStringLen'.
+ const uint64_t PaddedLength = alignTo(CompressedStringLen, 8);
+ for (uint64_t K = CompressedStringLen; K < PaddedLength; K++)
+ OS.writeByte(0);
+ return Error::success();
+}
+
+uint64_t DataAccessProfData::getEncodedIndex(const SymbolID SymbolID) const {
+ if (std::holds_alternative<uint64_t>(SymbolID))
+ return std::get<uint64_t>(SymbolID);
+
+ return StrToIndexMap.find(std::get<StringRef>(SymbolID))->second;
+}
+
+Error DataAccessProfData::serialize(ProfOStream &OS) const {
+ if (Error E = serializeStrings(OS))
+ return E;
+ OS.write(KnownColdHashes.size());
+ for (const auto &Hash : KnownColdHashes)
+ OS.write(Hash);
+ OS.write((uint64_t)(Records.size()));
+ for (const auto &[Key, Rec] : Records) {
+ OS.write(getEncodedIndex(Rec.SymbolID));
+ OS.writeByte(Rec.IsStringLiteral);
+ OS.write(Rec.AccessCount);
+ OS.write(Rec.Locations.size());
+ for (const auto &Loc : Rec.Locations) {
+ OS.write(getEncodedIndex(Loc.FileName));
+ OS.write32(Loc.Line);
+ }
+ }
+ return Error::success();
+}
+
+Error DataAccessProfData::deserializeStrings(const unsigned char *&Ptr,
+ uint64_t NumSampledSymbols,
+ uint64_t NumColdKnownSymbols) {
+ uint64_t Len =
+ support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);
+
+ // With M=NumSampledSymbols and N=NumColdKnownSymbols, the first M strings are
+ // symbols with samples, and next N strings are known cold symbols.
+ uint64_t StringCnt = 0;
+ std::function<Error(StringRef)> addName = [&](StringRef Name) {
+ if (StringCnt < NumSampledSymbols)
+ saveStringToMap(StrToIndexMap, saver, Name);
+ else
+ KnownColdSymbols.insert(saver.save(Name));
+ ++StringCnt;
+ return Error::success();
+ };
+ if (Error E =
+ readAndDecodeStrings(StringRef((const char *)Ptr, Len), addName))
+ return E;
+
+ Ptr += alignTo(Len, 8);
+ return Error::success();
+}
+
+Error DataAccessProfData::deserializeRecords(const unsigned char *&Ptr) {
+ SmallVector<StringRef> Strings = llvm::to_vector(getStrings());
+
+ uint64_t NumRecords =
+ support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);
+
+ for (uint64_t I = 0; I < NumRecords; ++I) {
+ uint64_t ID =
+ support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);
+
+ bool IsStringLiteral =
+ support::endian::readNext<uint8_t, llvm::endianness::little>(Ptr);
+
+ uint64_t AccessCount =
+ support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);
+
+ SymbolID SymbolID;
+ if (IsStringLiteral)
+ SymbolID = ID;
+ else
+ SymbolID = Strings[ID];
+ if (Error E = addSymbolizedDataAccessProfile(SymbolID, AccessCount))
+ return E;
+
+ auto &Record = Records.back().second;
+
+ uint64_t NumLocations =
+ support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);
+
+ Record.Locations.reserve(NumLocations);
+ for (uint64_t J = 0; J < NumLocations; ++J) {
+ uint64_t FileNameIndex =
+ support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);
+ uint32_t Line =
+ support::endian::readNext<uint32_t, llvm::endianness::little>(Ptr);
+ Record.Locations.push_back({Strings[FileNameIndex], Line});
+ }
+ }
+ return Error::success();
+}
+} // namespace data_access_prof
+} // namespace llvm
diff --git a/llvm/lib/ProfileData/InstrProf.cpp b/llvm/lib/ProfileData/InstrProf.cpp
index 88621787c1dd9..254f941acde82 100644
--- a/llvm/lib/ProfileData/InstrProf.cpp
+++ b/llvm/lib/ProfileData/InstrProf.cpp
@@ -573,12 +573,8 @@ Error InstrProfSymtab::addVTableWithName(GlobalVariable &VTable,
return Error::success();
}
-/// \c NameStrings is a string composed of one of more possibly encoded
-/// sub-strings. The substrings are separated by 0 or more zero bytes. This
-/// method decodes the string and calls `NameCallback` for each substring.
-static Error
-readAndDecodeStrings(StringRef NameStrings,
- std::function<Error(StringRef)> NameCallback) {
+Error readAndDecodeStrings(StringRef NameStrings,
+ std::function<Error(StringRef)> NameCallback) {
const uint8_t *P = NameStrings.bytes_begin();
const uint8_t *EndP = NameStrings.bytes_end();
while (P < EndP) {
diff --git a/llvm/unittests/ProfileData/MemProfTest.cpp b/llvm/unittests/ProfileData/MemProfTest.cpp
index 3e430aa4eae58..f6362448a5734 100644
--- a/llvm/unittests/ProfileData/MemProfTest.cpp
+++ b/llvm/unittests/ProfileData/MemProfTest.cpp
@@ -10,14 +10,17 @@
#include "llvm/ADT/DenseMap.h"
#include "llvm/ADT/MapVector.h"
#include "llvm/ADT/STLForwardCompat.h"
+#include "llvm/ADT/SmallVector.h"
#include "llvm/DebugInfo/DIContext.h"
#include "llvm/DebugInfo/Symbolize/SymbolizableModule.h"
#include "llvm/IR/Value.h"
#include "llvm/Object/ObjectFile.h"
+#include "llvm/ProfileData/DataAccessProf.h"
#include "llvm/ProfileData/MemProfData.inc"
#include "llvm/ProfileData/MemProfReader.h"
#include "llvm/ProfileData/MemProfYAML.h"
#include "llvm/Support/raw_ostream.h"
+#include "gmock/gmock-more-matchers.h"
#include "gmock/gmock.h"
#include "gtest/gtest.h"
@@ -36,6 +39,8 @@ using ::llvm::StringRef;
using ::llvm::object::SectionedAddress;
using ::llvm::symbolize::SymbolizableModule;
using ::testing::ElementsAre;
+using ::testing::ElementsAreArray;
+using ::testing::HasSubstr;
using ::testing::IsEmpty;
using ::testing::Pair;
using ::testing::Return;
@@ -747,6 +752,162 @@ TEST(MemProf, YAMLParser) {
ElementsAre(0x3000)))));
}
+static std::string ErrorToString(Error E) {
+ std::string ErrMsg;
...
[truncated]
|
/// symbol names or content hashes are seen. The user of this class should | ||
/// aggregate counters that corresponds to the same symbol name or with the | ||
/// same string literal hash before calling 'add*' methods. | ||
Error addSymbolizedDataAccessProfile(SymbolID SymbolID, uint64_t AccessCount); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shorten the name --> addDataAccessProfile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the semantics of the method, may be change the name to 'setDataAccessProfile'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
/// aggregate counters that corresponds to the same symbol name or with the | ||
/// same string literal hash before calling 'add*' methods. | ||
Error addSymbolizedDataAccessProfile(SymbolID SymbolID, uint64_t AccessCount); | ||
Error addSymbolizedDataAccessProfile( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
document location parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
Error addSymbolizedDataAccessProfile( | ||
SymbolID SymbolID, uint64_t AccessCount, | ||
const llvm::SmallVector<DataLocation> &Locations); | ||
Error addKnownSymbolWithoutSamples(SymbolID SymbolID); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it the same as setting the symbol access count to zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, and the idea is similar to SamplePGO's function symbol list.
public: | ||
// SymbolID is either a string representing symbol name, or a uint64_t | ||
// representing the content hash of a string literal. | ||
using SymbolID = std::variant<StringRef, uint64_t>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the use of StringRef, is there a risk of use-after-free error (if the caller passes in a Ref to short lived string)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question. This class keeps owned copy of strings (inside saver
member of type llvm::UniqueStringSaver
) and references the owned copies as opposed to caller's copy. FWIW, I moved saver
to the beginning of class member list to make sure it's destructed after other class members that references the strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably add a comment here and in the below type that the StringRef is owned by this class's saver object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also fixed a potential use-after-free bug in L69 with std::tie(Key, RecordID) = *saveStringToMap(StrToIndexMap, Saver, *CanonicalName);
in the updated change. Basically Key
should use the owned copy.
bool IsStringLiteral; | ||
|
||
// The locations of data in the source code. Optional. | ||
llvm::SmallVector<DataLocation> Locations; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, we can have many instances of DataAccessProfRecord
. Do we know the most common number of entries here? If we don't, we might want to start out with llvm::SmallVector<DataLocation, 0>
, which still saves space relative to std::vector<...>
. I'm afraid that the inlined storage goes wasted (and multiplied by the number of DataAccessProfRecord
entries) if the real usage is way off the number of inlined entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know the most common number of entries here? If we don't, we might want to start out with llvm::SmallVector<DataLocation, 0>
I'd expect that many records have small single-digit number of locations, and definitely within the range of unsigned
, which is type for SmallVector<T, 0>
size and capacity per https://llvm.org/docs/ProgrammersManual.html#llvm-adt-smallvector-h.
Done.
return KnownColdHashes.count(std::get<uint64_t>(SymID)); | ||
return KnownColdSymbols.count(std::get<StringRef>(SymID)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Could we use contains
instead of count
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
auto [Iter, Inserted] = Records.try_emplace( | ||
Key, DataAccessProfRecord{RecordID, AccessCount, IsStringLiteral}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: With try_emplace
, you should be able to drop the value type and expect the constructor arguments to be perfectly forwarded like:
auto [Iter, Inserted] = Records.try_emplace(Key, RecordID, AccessCount, IsStringLiteral);
By the way, when I try your PR, I get:
llvm/lib/ProfileData/DataAccessProf.cpp:67:71: error: missing field 'Locations' initializer [-Werror,-Wmissing-field-initializers]
Key, DataAccessProfRecord{RecordID, AccessCount, IsStringLiteral});
^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The updated PR adds a constructor for DataAccessProfRecord
and does Records.try_emplace(Key, RecordID, AccessCount, IsStringLiteral);
as suggested. I also re-configured my cmake to set -DCMAKE_C_COMPILER
and -DCMAKE_CXX_COMPILER
to clang in the environment. Now I no longer see missing field 'Locations' initializer
.
In the future when C++20 is available for LLVM developer, we can possibly just use designated initializer here.
return deserializeRecords(Ptr); | ||
} | ||
|
||
Error DataAccessProfData::serializeStrings(ProfOStream &OS) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Could we name this something like serializeSymbolNames
to indicate that we are serializing symbol names? The same applies to deserializeString
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good to make the names more targeted. Renamed to {serialize,deserialize}SymbolsAndFilenames
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for reviews! PTAL.
bool IsStringLiteral; | ||
|
||
// The locations of data in the source code. Optional. | ||
llvm::SmallVector<DataLocation> Locations; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know the most common number of entries here? If we don't, we might want to start out with llvm::SmallVector<DataLocation, 0>
I'd expect that many records have small single-digit number of locations, and definitely within the range of unsigned
, which is type for SmallVector<T, 0>
size and capacity per https://llvm.org/docs/ProgrammersManual.html#llvm-adt-smallvector-h.
Done.
/// symbol names or content hashes are seen. The user of this class should | ||
/// aggregate counters that corresponds to the same symbol name or with the | ||
/// same string literal hash before calling 'add*' methods. | ||
Error addSymbolizedDataAccessProfile(SymbolID SymbolID, uint64_t AccessCount); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
return KnownColdHashes.count(std::get<uint64_t>(SymID)); | ||
return KnownColdSymbols.count(std::get<StringRef>(SymID)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
auto [Iter, Inserted] = Records.try_emplace( | ||
Key, DataAccessProfRecord{RecordID, AccessCount, IsStringLiteral}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The updated PR adds a constructor for DataAccessProfRecord
and does Records.try_emplace(Key, RecordID, AccessCount, IsStringLiteral);
as suggested. I also re-configured my cmake to set -DCMAKE_C_COMPILER
and -DCMAKE_CXX_COMPILER
to clang in the environment. Now I no longer see missing field 'Locations' initializer
.
In the future when C++20 is available for LLVM developer, we can possibly just use designated initializer here.
return deserializeRecords(Ptr); | ||
} | ||
|
||
Error DataAccessProfData::serializeStrings(ProfOStream &OS) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good to make the names more targeted. Renamed to {serialize,deserialize}SymbolsAndFilenames
.
/// aggregate counters that corresponds to the same symbol name or with the | ||
/// same string literal hash before calling 'add*' methods. | ||
Error addSymbolizedDataAccessProfile(SymbolID SymbolID, uint64_t AccessCount); | ||
Error addSymbolizedDataAccessProfile( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
Error addSymbolizedDataAccessProfile( | ||
SymbolID SymbolID, uint64_t AccessCount, | ||
const llvm::SmallVector<DataLocation> &Locations); | ||
Error addKnownSymbolWithoutSamples(SymbolID SymbolID); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, and the idea is similar to SamplePGO's function symbol list.
public: | ||
// SymbolID is either a string representing symbol name, or a uint64_t | ||
// representing the content hash of a string literal. | ||
using SymbolID = std::variant<StringRef, uint64_t>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question. This class keeps owned copy of strings (inside saver
member of type llvm::UniqueStringSaver
) and references the owned copies as opposed to caller's copy. FWIW, I moved saver
to the beginning of class member list to make sure it's destructed after other class members that references the strings.
llvm::BumpPtrAllocator Allocator; | ||
llvm::UniqueStringSaver Saver; | ||
|
||
// `Records` stores the records and `SymbolToRecordIndex` maps a symbol ID to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see SymbolToRecordIndex defined anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoops, SymbolToRecordIndex
refers to deleted code. Removed it.
Originally I had
DenseMap<SymbolID, uint64_t> SymbolToRecordIndex;
SmallVector<DataAccessProfRecord> Records;
as opposed to
MapVector<SymbolID, DataAccessProfRecord>
, in order to get a ArrayRef
of the records.
Later I realized that it's simpler to use the latter and add MapVector::getArrayRef
. Strictly speaking MapVector::getArrayRef
returns ArrayRef<key, value>
rather than ArrayRef<value>
, but the implementation is overall simpler to have MapVector
.
|
||
/// Returns a iterable StringRef for strings in the order they are added. | ||
auto getStrings() const { | ||
ArrayRef<std::pair<StringRef, uint64_t>> RefSymbolNames( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use SymbolID instead of the std::pair for consistency and simplicity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The updated change used new MapVector::getArrayRef
as suggested, and this line is simplified away.
// Represents a data symbol. The semantic comes in two forms: a symbol index | ||
// for symbol name if `IsStringLiteral` is false, or the hash of a string | ||
// content if `IsStringLiteral` is true. Required. | ||
uint64_t SymbolID; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little confusing that SymbolID is a different thing (a type) in the following class. Suggest making these different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. I renamed the type std::variant<StringRef, uint64_t>
to SymbolHandle
.
llvm/include/llvm/ADT/MapVector.h
Outdated
@@ -57,6 +57,8 @@ class MapVector { | |||
return std::move(Vector); | |||
} | |||
|
|||
ArrayRef<value_type> getArrayRef() const { return Vector; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably put this into a separate prep patch with a test in llvm/unittests/ADT/MapVectorTest.cpp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in #138726
public: | ||
// SymbolID is either a string representing symbol name, or a uint64_t | ||
// representing the content hash of a string literal. | ||
using SymbolID = std::variant<StringRef, uint64_t>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably add a comment here and in the below type that the StringRef is owned by this class's saver object.
uint64_t Len = | ||
support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr); | ||
|
||
// With M=NumSampledSymbols and N=NumColdKnownSymbols, the first M strings are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simpler to directly say "The first NumSampledSymbols strings are symbols with samples, and the next NumColdKnownSymbols are known cold symbols." ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
: SymbolID(SymbolID), AccessCount(AccessCount), | ||
IsStringLiteral(IsStringLiteral) {} | ||
|
||
// Represents a data symbol. The semantic comes in two forms: a symbol index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would the different forms be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The semantic of this field depends on the IsStringLiteral
field below. For a string literal, IsStringLiteral
is true and SymbolID
has the semantic of 'hash'; otherwise, IsStringLiteral
is false and SymbolID
represent an index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understood that, but my question was more about why in practice some would be string literals and some would be hashes. Might be useful to note this in a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understood that, but my question was more about why in practice some would be string literals and some would be hashes.
This makes sense. Added comment at L55 to explain why two forms are used.
/// This class provides profile look-up, serialization and deserialization. | ||
class DataAccessProfData { | ||
public: | ||
// SymbolID is either a string representing symbol name, or a uint64_t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to question for above class - what determines which type is used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type is determined by the symbol itself. This variant is uint64_t
for string literals (which has the unstable name pattern like .str.<N>[.llvm.<hash>]
), and the canonicalized symbol name for those symbols with stable mangled names relative to source code. Updated the comment. PTAL.
|
||
// Test the various scenarios when DataAccessProfData should return error on | ||
// invalid input. | ||
TEST(MemProf, DataAccessProfileError) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the DataAccessProf tests should be moved to a dedicated DataAccessProfTest.cpp file to mirror where they are defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM modulo some nits. Thanks for updating the patch!
|
||
Error DataAccessProfData::setDataAccessProfile( | ||
SymbolID SymbolID, uint64_t AccessCount, | ||
const llvm::SmallVector<DataLocation> &Locations) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer ArrayRef<DataLocation>
to const llvm::SmallVector<DataLocation> &
per:
https://llvm.org/docs/ProgrammersManual.html#llvm-adt-arrayref-h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
Co-authored-by: Kazu Hirata <kazu@google.com>
✅ With the latest revision this PR passed the C/C++ code formatter. |
…ng vector (#138726) SetVector currently has a [similar method](https://github.com/llvm/llvm-project/blob/c956ed06dc1c1b340d0c589c472c438b9220b36d/llvm/include/llvm/ADT/SetVector.h#L90), and #138170 has a use case to get an ArrayRef of MapVector's underlying vector.
uint64_t AccessCount = | ||
support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr); | ||
|
||
SymbolID SymbolID; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a different name for variable from the type name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I didn't notice this comment in my latest reply. Teresa suggested similarly in #138170 (comment). So this is done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for reviews! PTAL.
llvm/include/llvm/ADT/MapVector.h
Outdated
@@ -57,6 +57,8 @@ class MapVector { | |||
return std::move(Vector); | |||
} | |||
|
|||
ArrayRef<value_type> getArrayRef() const { return Vector; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in #138726
|
||
// Test the various scenarios when DataAccessProfData should return error on | ||
// invalid input. | ||
TEST(MemProf, DataAccessProfileError) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
uint64_t Len = | ||
support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr); | ||
|
||
// With M=NumSampledSymbols and N=NumColdKnownSymbols, the first M strings are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
if (std::holds_alternative<uint64_t>(SymbolID)) | ||
return std::get<uint64_t>(SymbolID); | ||
|
||
return StrToIndexMap.find(std::get<StringRef>(SymbolID))->second; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
|
||
Error DataAccessProfData::setDataAccessProfile( | ||
SymbolID SymbolID, uint64_t AccessCount, | ||
const llvm::SmallVector<DataLocation> &Locations) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
|
||
/// Returns a iterable StringRef for strings in the order they are added. | ||
auto getStrings() const { | ||
ArrayRef<std::pair<StringRef, uint64_t>> RefSymbolNames( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The updated change used new MapVector::getArrayRef
as suggested, and this line is simplified away.
llvm::BumpPtrAllocator Allocator; | ||
llvm::UniqueStringSaver Saver; | ||
|
||
// `Records` stores the records and `SymbolToRecordIndex` maps a symbol ID to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoops, SymbolToRecordIndex
refers to deleted code. Removed it.
Originally I had
DenseMap<SymbolID, uint64_t> SymbolToRecordIndex;
SmallVector<DataAccessProfRecord> Records;
as opposed to
MapVector<SymbolID, DataAccessProfRecord>
, in order to get a ArrayRef
of the records.
Later I realized that it's simpler to use the latter and add MapVector::getArrayRef
. Strictly speaking MapVector::getArrayRef
returns ArrayRef<key, value>
rather than ArrayRef<value>
, but the implementation is overall simpler to have MapVector
.
@@ -357,6 +357,12 @@ void createPGONameMetadata(GlobalObject &GO, StringRef PGOName); | |||
/// the duplicated profile variables for Comdat functions. | |||
bool needsComdatForCounter(const GlobalObject &GV, const Module &M); | |||
|
|||
/// \c NameStrings is a string composed of one of more possibly encoded | |||
/// sub-strings. The substrings are separated by 0 or more zero bytes. This |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done by mentioning the substrings are separated by \01
, which is returned by getInstrProfNameSeparator
.
return Error::success(); | ||
} | ||
StringRef SymbolName = std::get<StringRef>(SymbolID); | ||
if (SymbolName.empty()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a static
helper function getCanonicalName
in this cpp file.
public: | ||
// SymbolID is either a string representing symbol name, or a uint64_t | ||
// representing the content hash of a string literal. | ||
using SymbolID = std::variant<StringRef, uint64_t>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also fixed a potential use-after-free bug in L69 with std::tie(Key, RecordID) = *saveStringToMap(StrToIndexMap, Saver, *CanonicalName);
in the updated change. Basically Key
should use the owned copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm with a couple of small comments below.
if (std::holds_alternative<StringRef>(SymbolID)) { | ||
StringRef Name = std::get<StringRef>(SymbolID); | ||
assert(!Name.empty() && "Empty symbol name"); | ||
Key = InstrProfSymtab::getCanonicalName(Name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call the new getCanonicalName helper defined above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
This getProfileRecord
suppresses name canonicalization error rather than further propagating it, mainly to make the return value type simple (i.e., no need to handle Expected<Ptr>
type) for the common no-error path.
Added assertion to make the code intention clearer.
: SymbolID(SymbolID), AccessCount(AccessCount), | ||
IsStringLiteral(IsStringLiteral) {} | ||
|
||
// Represents a data symbol. The semantic comes in two forms: a symbol index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understood that, but my question was more about why in practice some would be string literals and some would be hashes. Might be useful to note this in a comment.
…'s underlying vector (#138726) SetVector currently has a [similar method](https://github.com/llvm/llvm-project/blob/c956ed06dc1c1b340d0c589c472c438b9220b36d/llvm/include/llvm/ADT/SetVector.h#L90), and llvm/llvm-project#138170 has a use case to get an ArrayRef of MapVector's underlying vector.
…ng vector (llvm#138726) SetVector currently has a [similar method](https://github.com/llvm/llvm-project/blob/c956ed06dc1c1b340d0c589c472c438b9220b36d/llvm/include/llvm/ADT/SetVector.h#L90), and llvm#138170 has a use case to get an ArrayRef of MapVector's underlying vector.
Context: For https://discourse.llvm.org/t/rfc-profile-guided-static-data-partitioning/83744#p-336543-background-3, we propose to profile memory loads and stores via hardware events, symbolize the addresses of binary static data sections and feed the profile back into compiler for data partitioning.
This change adds the profile format for static data layout, and the classes to operate on it.
The profile and its format
.str.<N>[.llvm.<hash>]
.In the profile's serialization format, strings are concatenated together and compressed. Individual records stores the index.
A separate PR will connect this class to InstrProfReader/Writer via MemProfReader/Writer.