Skip to content

Commit f3f2832

Browse files
[StaticDataLayout][PGO] Add profile format for static data layout, and the classes to operate on the profiles. (#138170)
Context: For https://discourse.llvm.org/t/rfc-profile-guided-static-data-partitioning/83744#p-336543-background-3, we propose to profile memory loads and stores via hardware events, symbolize the addresses of binary static data sections and feed the profile back into compiler for data partitioning. This change adds the profile format for static data layout, and the classes to operate on it. The profile and its format 1. Conceptually, a piece of data (call it a symbol) is represented by its symbol name or its content hash. The former applies to majority of data whose mangled name remains relatively stable over binary releases, and the latter applies to string literals (with name patterns like `.str.<N>[.llvm.<hash>]`. - The symbols with samples are hot data. The number of hot symbols is small relative to all symbols. The profile tracks its sampled counts and locations. Sampled counts come from hardware events, and locations come from debug information in the profiled binary. The symbols without samples are cold data. The number of such cold symbols is large. The profile tracks its representation (the name or content hash). - Based on a preliminary study, debug information coverage for data symbols is partial and best-effort. In the LLVM IR, global variables with source code correspondence may or may not have debug information. Therefore the location information is optional in the profiles. 2. The profile-and-compile cycle is similar to SamplePGO. Profiles are sampled from production binaries, and used in next binary releases. Known cold symbols and new hot symbols can both have zero sampled counts, so the profile records known cold symbols to tell the two for next compile. In the profile's serialization format, strings are concatenated together and compressed. Individual records stores the index. A separate PR will connect this class to InstrProfReader/Writer via MemProfReader/Writer. --------- Co-authored-by: Kazu Hirata <kazu@google.com>
1 parent 97ad399 commit f3f2832

File tree

7 files changed

+677
-11
lines changed

7 files changed

+677
-11
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,214 @@
1+
//===- DataAccessProf.h - Data access profile format support ---------*- C++
2+
//-*-===//
3+
//
4+
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
5+
// See https://llvm.org/LICENSE.txt for license information.
6+
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
7+
//
8+
//===----------------------------------------------------------------------===//
9+
//
10+
// This file contains support to construct and use data access profiles.
11+
//
12+
// For the original RFC of this pass please see
13+
// https://discourse.llvm.org/t/rfc-profile-guided-static-data-partitioning/83744
14+
//
15+
//===----------------------------------------------------------------------===//
16+
17+
#ifndef LLVM_PROFILEDATA_DATAACCESSPROF_H_
18+
#define LLVM_PROFILEDATA_DATAACCESSPROF_H_
19+
20+
#include "llvm/ADT/DenseMap.h"
21+
#include "llvm/ADT/DenseMapInfoVariant.h"
22+
#include "llvm/ADT/MapVector.h"
23+
#include "llvm/ADT/STLExtras.h"
24+
#include "llvm/ADT/SetVector.h"
25+
#include "llvm/ADT/SmallVector.h"
26+
#include "llvm/ADT/StringRef.h"
27+
#include "llvm/ProfileData/InstrProf.h"
28+
#include "llvm/Support/Allocator.h"
29+
#include "llvm/Support/Error.h"
30+
#include "llvm/Support/StringSaver.h"
31+
32+
#include <cstdint>
33+
#include <optional>
34+
#include <variant>
35+
36+
namespace llvm {
37+
38+
namespace data_access_prof {
39+
40+
/// The location of data in the source code. Used by profile lookup API.
41+
struct SourceLocation {
42+
SourceLocation(StringRef FileNameRef, uint32_t Line)
43+
: FileName(FileNameRef.str()), Line(Line) {}
44+
/// The filename where the data is located.
45+
std::string FileName;
46+
/// The line number in the source code.
47+
uint32_t Line;
48+
};
49+
50+
namespace internal {
51+
52+
// Conceptually similar to SourceLocation except that FileNames are StringRef of
53+
// which strings are owned by `DataAccessProfData`. Used by `DataAccessProfData`
54+
// to represent data locations internally.
55+
struct SourceLocationRef {
56+
// The filename where the data is located.
57+
StringRef FileName;
58+
// The line number in the source code.
59+
uint32_t Line;
60+
};
61+
62+
// The data access profiles for a symbol. Used by `DataAccessProfData`
63+
// to represent records internally.
64+
struct DataAccessProfRecordRef {
65+
DataAccessProfRecordRef(uint64_t SymbolID, uint64_t AccessCount,
66+
bool IsStringLiteral)
67+
: SymbolID(SymbolID), AccessCount(AccessCount),
68+
IsStringLiteral(IsStringLiteral) {}
69+
70+
// Represents a data symbol. The semantic comes in two forms: a symbol index
71+
// for symbol name if `IsStringLiteral` is false, or the hash of a string
72+
// content if `IsStringLiteral` is true. For most of the symbolizable static
73+
// data, the mangled symbol names remain stable relative to the source code
74+
// and therefore used to identify symbols across binary releases. String
75+
// literals have unstable name patterns like `.str.N[.llvm.hash]`, so we use
76+
// the content hash instead. This is a required field.
77+
uint64_t SymbolID;
78+
79+
// The access count of symbol. Required.
80+
uint64_t AccessCount;
81+
82+
// True iff this is a record for string literal (symbols with name pattern
83+
// `.str.*` in the symbol table). Required.
84+
bool IsStringLiteral;
85+
86+
// The locations of data in the source code. Optional.
87+
llvm::SmallVector<SourceLocationRef, 0> Locations;
88+
};
89+
} // namespace internal
90+
91+
// SymbolID is either a string representing symbol name if the symbol has
92+
// stable mangled name relative to source code, or a uint64_t representing the
93+
// content hash of a string literal (with unstable name patterns like
94+
// `.str.N[.llvm.hash]`). The StringRef is owned by the class's saver object.
95+
using SymbolHandleRef = std::variant<StringRef, uint64_t>;
96+
97+
// The senamtic is the same as `SymbolHandleRef` above. The strings are owned.
98+
using SymbolHandle = std::variant<std::string, uint64_t>;
99+
100+
/// The data access profiles for a symbol.
101+
struct DataAccessProfRecord {
102+
public:
103+
DataAccessProfRecord(SymbolHandleRef SymHandleRef,
104+
ArrayRef<internal::SourceLocationRef> LocRefs) {
105+
if (std::holds_alternative<StringRef>(SymHandleRef)) {
106+
SymHandle = std::get<StringRef>(SymHandleRef).str();
107+
} else
108+
SymHandle = std::get<uint64_t>(SymHandleRef);
109+
110+
for (auto Loc : LocRefs)
111+
Locations.push_back(SourceLocation(Loc.FileName, Loc.Line));
112+
}
113+
SymbolHandle SymHandle;
114+
115+
// The locations of data in the source code. Optional.
116+
SmallVector<SourceLocation> Locations;
117+
};
118+
119+
/// Encapsulates the data access profile data and the methods to operate on
120+
/// it. This class provides profile look-up, serialization and
121+
/// deserialization.
122+
class DataAccessProfData {
123+
public:
124+
// Use MapVector to keep input order of strings for serialization and
125+
// deserialization.
126+
using StringToIndexMap = llvm::MapVector<StringRef, uint64_t>;
127+
128+
DataAccessProfData() : Saver(Allocator) {}
129+
130+
/// Serialize profile data to the output stream.
131+
/// Storage layout:
132+
/// - Serialized strings.
133+
/// - The encoded hashes.
134+
/// - Records.
135+
Error serialize(ProfOStream &OS) const;
136+
137+
/// Deserialize this class from the given buffer.
138+
Error deserialize(const unsigned char *&Ptr);
139+
140+
/// Returns a profile record for \p SymbolID, or std::nullopt if there
141+
/// isn't a record. Internally, this function will canonicalize the symbol
142+
/// name before the lookup.
143+
std::optional<DataAccessProfRecord>
144+
getProfileRecord(const SymbolHandleRef SymID) const;
145+
146+
/// Returns true if \p SymID is seen in profiled binaries and cold.
147+
bool isKnownColdSymbol(const SymbolHandleRef SymID) const;
148+
149+
/// Methods to set symbolized data access profile. Returns error if
150+
/// duplicated symbol names or content hashes are seen. The user of this
151+
/// class should aggregate counters that correspond to the same symbol name
152+
/// or with the same string literal hash before calling 'set*' methods.
153+
Error setDataAccessProfile(SymbolHandleRef SymbolID, uint64_t AccessCount);
154+
/// Similar to the method above, for records with \p Locations representing
155+
/// the `filename:line` where this symbol shows up. Note because of linker's
156+
/// merge of identical symbols (e.g., unnamed_addr string literals), one
157+
/// symbol is likely to have multiple locations.
158+
Error setDataAccessProfile(SymbolHandleRef SymbolID, uint64_t AccessCount,
159+
ArrayRef<SourceLocation> Locations);
160+
/// Add a symbol that's seen in the profiled binary without samples.
161+
Error addKnownSymbolWithoutSamples(SymbolHandleRef SymbolID);
162+
163+
/// The following methods return array reference for various internal data
164+
/// structures.
165+
ArrayRef<StringToIndexMap::value_type> getStrToIndexMapRef() const {
166+
return StrToIndexMap.getArrayRef();
167+
}
168+
ArrayRef<
169+
MapVector<SymbolHandleRef, internal::DataAccessProfRecordRef>::value_type>
170+
getRecords() const {
171+
return Records.getArrayRef();
172+
}
173+
ArrayRef<StringRef> getKnownColdSymbols() const {
174+
return KnownColdSymbols.getArrayRef();
175+
}
176+
ArrayRef<uint64_t> getKnownColdHashes() const {
177+
return KnownColdHashes.getArrayRef();
178+
}
179+
180+
private:
181+
/// Serialize the symbol strings into the output stream.
182+
Error serializeSymbolsAndFilenames(ProfOStream &OS) const;
183+
184+
/// Deserialize the symbol strings from \p Ptr and increment \p Ptr to the
185+
/// start of the next payload.
186+
Error deserializeSymbolsAndFilenames(const unsigned char *&Ptr,
187+
const uint64_t NumSampledSymbols,
188+
const uint64_t NumColdKnownSymbols);
189+
190+
/// Decode the records and increment \p Ptr to the start of the next
191+
/// payload.
192+
Error deserializeRecords(const unsigned char *&Ptr);
193+
194+
/// A helper function to compute a storage index for \p SymbolID.
195+
uint64_t getEncodedIndex(const SymbolHandleRef SymbolID) const;
196+
197+
// Keeps owned copies of the input strings.
198+
// NOTE: Keep `Saver` initialized before other class members that reference
199+
// its string copies and destructed after they are destructed.
200+
llvm::BumpPtrAllocator Allocator;
201+
llvm::UniqueStringSaver Saver;
202+
203+
// `Records` stores the records.
204+
MapVector<SymbolHandleRef, internal::DataAccessProfRecordRef> Records;
205+
206+
StringToIndexMap StrToIndexMap;
207+
llvm::SetVector<uint64_t> KnownColdHashes;
208+
llvm::SetVector<StringRef> KnownColdSymbols;
209+
};
210+
211+
} // namespace data_access_prof
212+
} // namespace llvm
213+
214+
#endif // LLVM_PROFILEDATA_DATAACCESSPROF_H_

llvm/include/llvm/ProfileData/InstrProf.h

+12-5
Original file line numberDiff line numberDiff line change
@@ -357,6 +357,13 @@ void createPGONameMetadata(GlobalObject &GO, StringRef PGOName);
357357
/// the duplicated profile variables for Comdat functions.
358358
bool needsComdatForCounter(const GlobalObject &GV, const Module &M);
359359

360+
/// \c NameStrings is a string composed of one or more possibly encoded
361+
/// sub-strings. The substrings are separated by `\01` (returned by
362+
/// InstrProf.h:getInstrProfNameSeparator). This method decodes the string and
363+
/// calls `NameCallback` for each substring.
364+
Error readAndDecodeStrings(StringRef NameStrings,
365+
std::function<Error(StringRef)> NameCallback);
366+
360367
/// An enum describing the attributes of an instrumented profile.
361368
enum class InstrProfKind {
362369
Unknown = 0x0,
@@ -493,6 +500,11 @@ class InstrProfSymtab {
493500
public:
494501
using AddrHashMap = std::vector<std::pair<uint64_t, uint64_t>>;
495502

503+
// Returns the canonical name of the given PGOName. In a canonical name, all
504+
// suffixes that begins with "." except ".__uniq." are stripped.
505+
// FIXME: Unify this with `FunctionSamples::getCanonicalFnName`.
506+
static StringRef getCanonicalName(StringRef PGOName);
507+
496508
private:
497509
using AddrIntervalMap =
498510
IntervalMap<uint64_t, uint64_t, 4, IntervalMapHalfOpenInfo<uint64_t>>;
@@ -528,11 +540,6 @@ class InstrProfSymtab {
528540

529541
static StringRef getExternalSymbol() { return "** External Symbol **"; }
530542

531-
// Returns the canonial name of the given PGOName. In a canonical name, all
532-
// suffixes that begins with "." except ".__uniq." are stripped.
533-
// FIXME: Unify this with `FunctionSamples::getCanonicalFnName`.
534-
static StringRef getCanonicalName(StringRef PGOName);
535-
536543
// Add the function into the symbol table, by creating the following
537544
// map entries:
538545
// name-set = {PGOFuncName} union {getCanonicalName(PGOFuncName)}

llvm/lib/ProfileData/CMakeLists.txt

+1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
add_llvm_component_library(LLVMProfileData
2+
DataAccessProf.cpp
23
GCOV.cpp
34
IndexedMemProfData.cpp
45
InstrProf.cpp

0 commit comments

Comments
 (0)