Skip to content

[StaticDataLayout][PGO] Add profile format for static data layout, and the classes to operate on the profiles. #138170

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 16, 2025
214 changes: 214 additions & 0 deletions llvm/include/llvm/ProfileData/DataAccessProf.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,214 @@
//===- DataAccessProf.h - Data access profile format support ---------*- C++
//-*-===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//
//
// This file contains support to construct and use data access profiles.
//
// For the original RFC of this pass please see
// https://discourse.llvm.org/t/rfc-profile-guided-static-data-partitioning/83744
//
//===----------------------------------------------------------------------===//

#ifndef LLVM_PROFILEDATA_DATAACCESSPROF_H_
#define LLVM_PROFILEDATA_DATAACCESSPROF_H_

#include "llvm/ADT/DenseMap.h"
#include "llvm/ADT/DenseMapInfoVariant.h"
#include "llvm/ADT/MapVector.h"
#include "llvm/ADT/STLExtras.h"
#include "llvm/ADT/SetVector.h"
#include "llvm/ADT/SmallVector.h"
#include "llvm/ADT/StringRef.h"
#include "llvm/ProfileData/InstrProf.h"
#include "llvm/Support/Allocator.h"
#include "llvm/Support/Error.h"
#include "llvm/Support/StringSaver.h"

#include <cstdint>
#include <optional>
#include <variant>

namespace llvm {

namespace data_access_prof {

/// The location of data in the source code. Used by profile lookup API.
struct SourceLocation {
SourceLocation(StringRef FileNameRef, uint32_t Line)
: FileName(FileNameRef.str()), Line(Line) {}
/// The filename where the data is located.
std::string FileName;
/// The line number in the source code.
uint32_t Line;
};

namespace internal {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the contents of the internal namespace (i.e. the ref variants) be moved to the .cpp file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to use forward declaration under internal namespace; this gives compile errors which indicates forward decl doesn't work with class template instantiation.

Something like this when moving SourceLocationRef itself, and moving DataAccessProfRecordRef along with SourceLocationRef caused similar static assertion errors for class DataAccessProfData.

/../../../../include/c++/14/type_traits:1364:21: error: static assertion failed due to requirement 'std::__is_complete_or_unbounded(std::__type_identity<llvm::data_access_prof::internal::SourceLocationRef>{})': template argument must be a complete class or an unbounded array
 1364 |       static_assert(std::__is_complete_or_unbounded(__type_identity<_Tp>{}),
      |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
llvm/include/llvm/ADT/SmallVector.h:327:36: note: in instantiation of template class 'std::is_trivially_move_constructible<llvm::data_access_prof::internal::SourceLocationRef>' requested here
  327 |                              (std::is_trivially_move_constructible<T>::value) &&
      |                                    ^
llvm/include/llvm/ADT/SmallVector.h:573:32: note: in instantiation of default argument for 'SmallVectorTemplateBase<llvm::data_access_prof::internal::SourceLocationRef>' required here
  573 | class SmallVectorImpl : public SmallVectorTemplateBase<T> {
      |                                ^~~~~~~~~~~~~~~~~~~~~~~~~~
llvm/include/llvm/ADT/SmallVector.h:1195:43: note: in instantiation of template class 'llvm::SmallVectorImpl<llvm::data_access_prof::internal::SourceLocationRef>' requested here
 1195 | class LLVM_GSL_OWNER SmallVector : public SmallVectorImpl<T>,
      |                                           ^
llvm/include/llvm/ProfileData/DataAccessProf.h:82:43: note: in instantiation of template class 'llvm::SmallVector<llvm::data_access_prof::internal::SourceLocationRef, 0>' requested here
   82 |   llvm::SmallVector<SourceLocationRef, 0> Locations;


// Conceptually similar to SourceLocation except that FileNames are StringRef of
// which strings are owned by `DataAccessProfData`. Used by `DataAccessProfData`
// to represent data locations internally.
struct SourceLocationRef {
// The filename where the data is located.
StringRef FileName;
// The line number in the source code.
uint32_t Line;
};

// The data access profiles for a symbol. Used by `DataAccessProfData`
// to represent records internally.
struct DataAccessProfRecordRef {
DataAccessProfRecordRef(uint64_t SymbolID, uint64_t AccessCount,
bool IsStringLiteral)
: SymbolID(SymbolID), AccessCount(AccessCount),
IsStringLiteral(IsStringLiteral) {}

// Represents a data symbol. The semantic comes in two forms: a symbol index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would the different forms be used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantic of this field depends on the IsStringLiteral field below. For a string literal, IsStringLiteral is true and SymbolID has the semantic of 'hash'; otherwise, IsStringLiteral is false and SymbolID represent an index.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understood that, but my question was more about why in practice some would be string literals and some would be hashes. Might be useful to note this in a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understood that, but my question was more about why in practice some would be string literals and some would be hashes.

This makes sense. Added comment at L55 to explain why two forms are used.

// for symbol name if `IsStringLiteral` is false, or the hash of a string
// content if `IsStringLiteral` is true. For most of the symbolizable static
// data, the mangled symbol names remain stable relative to the source code
// and therefore used to identify symbols across binary releases. String
// literals have unstable name patterns like `.str.N[.llvm.hash]`, so we use
// the content hash instead. This is a required field.
uint64_t SymbolID;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little confusing that SymbolID is a different thing (a type) in the following class. Suggest making these different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I renamed the type std::variant<StringRef, uint64_t> to SymbolHandle.


// The access count of symbol. Required.
uint64_t AccessCount;

// True iff this is a record for string literal (symbols with name pattern
// `.str.*` in the symbol table). Required.
bool IsStringLiteral;

// The locations of data in the source code. Optional.
llvm::SmallVector<SourceLocationRef, 0> Locations;
};
} // namespace internal

// SymbolID is either a string representing symbol name if the symbol has
// stable mangled name relative to source code, or a uint64_t representing the
// content hash of a string literal (with unstable name patterns like
// `.str.N[.llvm.hash]`). The StringRef is owned by the class's saver object.
using SymbolHandleRef = std::variant<StringRef, uint64_t>;

// The senamtic is the same as `SymbolHandleRef` above. The strings are owned.
using SymbolHandle = std::variant<std::string, uint64_t>;

/// The data access profiles for a symbol.
struct DataAccessProfRecord {
public:
DataAccessProfRecord(SymbolHandleRef SymHandleRef,
ArrayRef<internal::SourceLocationRef> LocRefs) {
if (std::holds_alternative<StringRef>(SymHandleRef)) {
SymHandle = std::get<StringRef>(SymHandleRef).str();
} else
SymHandle = std::get<uint64_t>(SymHandleRef);

for (auto Loc : LocRefs)
Locations.push_back(SourceLocation(Loc.FileName, Loc.Line));
}
SymbolHandle SymHandle;

// The locations of data in the source code. Optional.
SmallVector<SourceLocation> Locations;
};

/// Encapsulates the data access profile data and the methods to operate on
/// it. This class provides profile look-up, serialization and
/// deserialization.
class DataAccessProfData {
public:
// Use MapVector to keep input order of strings for serialization and
// deserialization.
using StringToIndexMap = llvm::MapVector<StringRef, uint64_t>;

DataAccessProfData() : Saver(Allocator) {}

/// Serialize profile data to the output stream.
/// Storage layout:
/// - Serialized strings.
/// - The encoded hashes.
/// - Records.
Error serialize(ProfOStream &OS) const;

/// Deserialize this class from the given buffer.
Error deserialize(const unsigned char *&Ptr);

/// Returns a profile record for \p SymbolID, or std::nullopt if there
/// isn't a record. Internally, this function will canonicalize the symbol
/// name before the lookup.
std::optional<DataAccessProfRecord>
getProfileRecord(const SymbolHandleRef SymID) const;

/// Returns true if \p SymID is seen in profiled binaries and cold.
bool isKnownColdSymbol(const SymbolHandleRef SymID) const;

/// Methods to set symbolized data access profile. Returns error if
/// duplicated symbol names or content hashes are seen. The user of this
/// class should aggregate counters that correspond to the same symbol name
/// or with the same string literal hash before calling 'set*' methods.
Error setDataAccessProfile(SymbolHandleRef SymbolID, uint64_t AccessCount);
/// Similar to the method above, for records with \p Locations representing
/// the `filename:line` where this symbol shows up. Note because of linker's
/// merge of identical symbols (e.g., unnamed_addr string literals), one
/// symbol is likely to have multiple locations.
Error setDataAccessProfile(SymbolHandleRef SymbolID, uint64_t AccessCount,
ArrayRef<SourceLocation> Locations);
/// Add a symbol that's seen in the profiled binary without samples.
Error addKnownSymbolWithoutSamples(SymbolHandleRef SymbolID);

/// The following methods return array reference for various internal data
/// structures.
ArrayRef<StringToIndexMap::value_type> getStrToIndexMapRef() const {
return StrToIndexMap.getArrayRef();
}
ArrayRef<
MapVector<SymbolHandleRef, internal::DataAccessProfRecordRef>::value_type>
getRecords() const {
return Records.getArrayRef();
}
ArrayRef<StringRef> getKnownColdSymbols() const {
return KnownColdSymbols.getArrayRef();
}
ArrayRef<uint64_t> getKnownColdHashes() const {
return KnownColdHashes.getArrayRef();
}

private:
/// Serialize the symbol strings into the output stream.
Error serializeSymbolsAndFilenames(ProfOStream &OS) const;

/// Deserialize the symbol strings from \p Ptr and increment \p Ptr to the
/// start of the next payload.
Error deserializeSymbolsAndFilenames(const unsigned char *&Ptr,
const uint64_t NumSampledSymbols,
const uint64_t NumColdKnownSymbols);

/// Decode the records and increment \p Ptr to the start of the next
/// payload.
Error deserializeRecords(const unsigned char *&Ptr);

/// A helper function to compute a storage index for \p SymbolID.
uint64_t getEncodedIndex(const SymbolHandleRef SymbolID) const;

// Keeps owned copies of the input strings.
// NOTE: Keep `Saver` initialized before other class members that reference
// its string copies and destructed after they are destructed.
llvm::BumpPtrAllocator Allocator;
llvm::UniqueStringSaver Saver;

// `Records` stores the records.
MapVector<SymbolHandleRef, internal::DataAccessProfRecordRef> Records;

StringToIndexMap StrToIndexMap;
llvm::SetVector<uint64_t> KnownColdHashes;
llvm::SetVector<StringRef> KnownColdSymbols;
};

} // namespace data_access_prof
} // namespace llvm

#endif // LLVM_PROFILEDATA_DATAACCESSPROF_H_
17 changes: 12 additions & 5 deletions llvm/include/llvm/ProfileData/InstrProf.h
Original file line number Diff line number Diff line change
@@ -357,6 +357,13 @@ void createPGONameMetadata(GlobalObject &GO, StringRef PGOName);
/// the duplicated profile variables for Comdat functions.
bool needsComdatForCounter(const GlobalObject &GV, const Module &M);

/// \c NameStrings is a string composed of one or more possibly encoded
/// sub-strings. The substrings are separated by `\01` (returned by
/// InstrProf.h:getInstrProfNameSeparator). This method decodes the string and
/// calls `NameCallback` for each substring.
Error readAndDecodeStrings(StringRef NameStrings,
std::function<Error(StringRef)> NameCallback);

/// An enum describing the attributes of an instrumented profile.
enum class InstrProfKind {
Unknown = 0x0,
@@ -493,6 +500,11 @@ class InstrProfSymtab {
public:
using AddrHashMap = std::vector<std::pair<uint64_t, uint64_t>>;

// Returns the canonical name of the given PGOName. In a canonical name, all
// suffixes that begins with "." except ".__uniq." are stripped.
// FIXME: Unify this with `FunctionSamples::getCanonicalFnName`.
static StringRef getCanonicalName(StringRef PGOName);

private:
using AddrIntervalMap =
IntervalMap<uint64_t, uint64_t, 4, IntervalMapHalfOpenInfo<uint64_t>>;
@@ -528,11 +540,6 @@ class InstrProfSymtab {

static StringRef getExternalSymbol() { return "** External Symbol **"; }

// Returns the canonial name of the given PGOName. In a canonical name, all
// suffixes that begins with "." except ".__uniq." are stripped.
// FIXME: Unify this with `FunctionSamples::getCanonicalFnName`.
static StringRef getCanonicalName(StringRef PGOName);

// Add the function into the symbol table, by creating the following
// map entries:
// name-set = {PGOFuncName} union {getCanonicalName(PGOFuncName)}
1 change: 1 addition & 0 deletions llvm/lib/ProfileData/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
add_llvm_component_library(LLVMProfileData
DataAccessProf.cpp
GCOV.cpp
IndexedMemProfData.cpp
InstrProf.cpp
Loading