Conversation
|
It would be helpful if you could put custom labels on the |
|
the FreeBSD check failure is unrelated |
rdementi
left a comment
There was a problem hiding this comment.
great start! Your TODO list makes sense to me. Thank you!
| bool PCM::CoreLocalMemoryBWMetricAvailable() const | ||
| { | ||
| if (cpu_model == SKX && cpu_stepping < 5) return false; // SKZ4 errata | ||
| //if (cpu_model == SKX && cpu_stepping < 5) return false; // SKZ4 errata |
There was a problem hiding this comment.
did you remove these checks for testing purposes?
| bool PCM::CoreRemoteMemoryBWMetricAvailable() const | ||
| { | ||
| if (cpu_model == SKX && cpu_stepping < 5) return false; // SKZ4 errata | ||
| //if (cpu_model == SKX && cpu_stepping < 5) return false; // SKZ4 errata |
fmuyassarov
left a comment
There was a problem hiding this comment.
I only checked the commit that adds the Helm chart and it looks good to me. The only thing I wondered is do you really expect users to modify all the available values in the values.yaml. To me some looked not very necessary but I can't judge since I'm not familiar with PCM.
LGTM
@jcpunk Thanks for this comment, that is very helpfull indeed - having that there is no longer need to hack prometheus-operator chart to disable podMonitorSelector - I added comments in README/values file about this (check this changes in commit for details (7f2c707#diff-618d3b78482c88190c469bb01731f774bb931bcdc14db7b8980691f5745ba08aR151-R152) or this documentation added in values pcm/deployment/pcm/values.yaml Lines 87 to 91 in 7f2c707 Anyway help for pointing this. |
That is the trade of between flexibility and complexity that I'm finding hard to balance. I see two options here:
and try to handle all possible (both proper and inproper combination) - but now rewrite this using go/helm template language - not very maintanable nor readable - I don't really want chart to be validator of possible PCM envs and I'm terified of hardcoding all proper possible combinations - in other words - if you don't like defaults (or other example value files) - then you're responsible for validating that pcm-sensor-server binary will run - but finding right combination is your job in your case. (there is also another option - e.g. allow to pass *any enviornment/options to pcm - but that is discouraged security practice) I the end, I decided to just allow use of this pcm chart to pass those "all PCM suppored" environment variables directly - and tried to cover possible value of combinations as different values files. One more comment, I agree that value file is alread quite big, but not yet as scary as the I see in prometheus node exporter official chart values.yaml - it is 500 lines vs my so far about 100 (but I miss some feature though RBACs, vertical scaling) which I'm using as example of good practicies :) |
a30b342 to
8edb9dc
Compare
old comments: sys/pci/mcfg mounts are unnessesary for indirect method fix old wrong defaults in README fix formatting possible fix for issue with resctrl remove hacks to handle /pcm/resctrl and unessesary out-of-date files update License to use the same as pcm itself update README, remove out-of-date info links do values formatting + links do values update README an values comments update README address jcfunk comments: interval and extra labels for PodMonitor + refactor readme fix typos readme: reminder about removing msr kernel module after rebasing: point to correct default pcm image from intel organization Refactoring: - explicit values file for privileged direct method, - hide (into docs directory) "unprivileged" direct method (and fixes), - remove unnessesary mounts (mcfg, /dev/cpu/dev/mem for privileged access), - add instructions to collection methods, - fixes (extra builder) for build local development image, - silent mode - move collection methods to the top fix values files for direct privileged method New: support for PERFMON capability, silent mode and some extra env debug variables VPA: v1 - first version of vertical pod autoscaler Grafana dashboard: instructions rename resctrlHostMount to resctrlMount fix dashboard rate interval pcm-sensor-server: add new metrics DRAM Local percantage Fix dockerbuild by using separate Dockerfile + build in dockerignore improve dockerfile.debug extra env PCM_NO_MAIN_EXCEPTION_HANDLER
|
|
||
| if (pcm->localMemoryRequestRatioMetricAvailable()) | ||
| printCounter( "DRAM Local Percentage", getLocalMemoryRequestRatio( before, after ) ); | ||
|
|
There was a problem hiding this comment.
I think the addition of a new metric should be split out into a separate merge request or at least a separate commit.
|
|
||
| if (pcm->localMemoryRequestRatioMetricAvailable()) | ||
| printCounter( "DRAM Local Percentage", getLocalMemoryRequestRatio( before, after ) ); | ||
|
|
|
Hello, I think that I have successfully made all steps before 6) Deploy PCM helm chart in paragraph Validation on local kind cluster |
|
|
||
| - kubectl/kind/helm/jq binaries available in PATH, | ||
| - docker service up and running. | ||
| - full set of metrics available only bare-metal instance or Cloud .metal instance. |
There was a problem hiding this comment.
I think you could add information on how to check it or enabled it.
| helm install ... --set nfd=true --set podMonitor=true --set verticalPodAutoscaler.enabled=true | ||
| ``` | ||
|
|
||
| ### Requirements |
There was a problem hiding this comment.
I think you could add information on how to check it or enable all requirements.
|
I was trying to run e2e test Please, add information about how to add this env Also, the file |

Initial version of Helm chart for deploying PCM
Features include:
TODO:
Must have:
/pcmprefix ends up with non-critical warningperf list | grep -i energyreturnspower/energy-pkg/andpower/energy-ram/possible development to PCM required - are available through perf subsystem perf_even_open syscall and enumerating/discovering /sys/devices/power/events PMU type=9 or through power cap framework similary to node_exporter/rapl collector, reading data directly from sysfs like thiscat /sys/class/powercap/intel-rapl/intel-rapl:0/energy_ujTesting:
Follow up tasks (in another PRs ideas):