-
Notifications
You must be signed in to change notification settings - Fork 265
feature: terminate on page fault for xe2 #878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
jackm-intel
wants to merge
1
commit into
intel:master
Choose a base branch
from
jackm-intel:xe2-pagefault-termination
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
+559
−23
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Enable FAULT_MODE explicitly at the call site when scratch pages are disabled. Previously, getFlagsForVmCreate() implicitly enabled fault mode when disableScratch was true. This change makes the policy decision explicit in createDrmVirtualMemory(): - disableScratch requires page fault handling because without scratch pages, unmapped GPU addresses need fault detection to report errors - Moving this logic to the call site makes the relationship clear rather than hiding it inside the ioctl helper Signed-off-by: Jack Myers <jack.myers@intel.com>
intel-lab-lkp
pushed a commit
to intel-lab-lkp/linux
that referenced
this pull request
Feb 6, 2026
Add initial declarations for the drm_xe_vm_get_property ioctl. v2: - Expand kernel docs for drm_xe_vm_get_property (Jianxun) v3: - Remove address type external definitions (Jianxun) - Add fault type to xe_drm_fault struct (Jianxun) v4: - Remove engine class and instance (Ivan) v5: - Add declares for fault type, access type, and fault level (Matt Brost, Ivan) v6: - Fix inconsistent use of whitespace in defines v7: - Rebase and refactor (jcavitt) v8: - Rebase (jcavitt) uAPI: intel/compute-runtime#878 Signed-off-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Reviewed-by: Shuicheng Lin <shuicheng.lin@intel.com> Acked-by: Matthew Brost <matthew.brost@intel.com> Acked-by: Ivan Briano <ivan.briano@intel.com> Cc: Zhang Jianxun <jianxun.zhang@intel.com> Cc: Ivan Briano <ivan.briano@intel.com> Cc: Matthew Brost <matthew.brost@intel.com>
intel-lab-lkp
pushed a commit
to intel-lab-lkp/linux
that referenced
this pull request
Feb 6, 2026
Add additional information to each VM so they can report up to the first 50 seen faults. Only pagefaults are saved this way currently, though in the future, all faults should be tracked by the VM for future reporting. Additionally, of the pagefaults reported, only failed pagefaults are saved this way, as successful pagefaults should recover silently and not need to be reported to userspace. v2: - Free vm after use (Shuicheng) - Compress pf copy logic (Shuicheng) - Update fault_unsuccessful before storing (Shuicheng) - Fix old struct name in comments (Shuicheng) - Keep first 50 pagefaults instead of last 50 (Jianxun) v3: - Avoid unnecessary execution by checking MAX_PFS earlier (jcavitt) - Fix double-locking error (jcavitt) - Assert kmemdump is successful (Shuicheng) v4: - Rename xe_vm.pfs to xe_vm.faults (jcavitt) - Store fault data and not pagefault in xe_vm faults list (jcavitt) - Store address, address type, and address precision per fault (jcavitt) - Store engine class and instance data per fault (Jianxun) - Add and fix kernel docs (Michal W) - Properly handle kzalloc error (Michal W) - s/MAX_PFS/MAX_FAULTS_SAVED_PER_VM (Michal W) - Store fault level per fault (Micahl M) v5: - Store fault and access type instead of address type (Jianxun) v6: - Store pagefaults in non-fault-mode VMs as well (Jianxun) v7: - Fix kernel docs and comments (Michal W) v8: - Fix double-locking issue (Jianxun) v9: - Do not report faults from reserved engines (Jianxun) v10: - Remove engine class and instance (Ivan) v11: - Perform kzalloc outside of lock (Auld) v12: - Fix xe_vm_fault_entry kernel docs (Shuicheng) v13: - Rebase and refactor (jcavitt) v14: - Correctly ignore fault mode in save_pagefault_to_vm (jcavitt) v15: - s/save_pagefault_to_vm/xe_pagefault_save_to_vm (Matt Brost) - Use guard instead of spin_lock/unlock (Matt Brost) - GT was added to xe_pagefault struct. Use xe_gt_hw_engine instead of creating a new helper function (Matt Brost) v16: - Set address precision programmatically (Matt Brost) uAPI: intel/compute-runtime#878 Signed-off-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Suggested-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Cc: Shuicheng Lin <shuicheng.lin@intel.com> Cc: Jianxun Zhang <jianxun.zhang@intel.com> Cc: Michal Wajdeczko <Michal.Wajdeczko@intel.com> Cc: Michal Mzorek <michal.mzorek@intel.com> Cc: Ivan Briano <ivan.briano@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com>
intel-lab-lkp
pushed a commit
to intel-lab-lkp/linux
that referenced
this pull request
Feb 6, 2026
Add support for userspace to request a list of observed faults from a specified VM. v2: - Only allow querying of failed pagefaults (Matt Brost) v3: - Remove unnecessary size parameter from helper function, as it is a property of the arguments. (jcavitt) - Remove unnecessary copy_from_user (Jainxun) - Set address_precision to 1 (Jainxun) - Report max size instead of dynamic size for memory allocation purposes. Total memory usage is reported separately. v4: - Return int from xe_vm_get_property_size (Shuicheng) - Fix memory leak (Shuicheng) - Remove unnecessary size variable (jcavitt) v5: - Rename ioctl to xe_vm_get_faults_ioctl (jcavitt) - Update fill_property_pfs to eliminate need for kzalloc (Jianxun) v6: - Repair and move fill_faults break condition (Dan Carpenter) - Free vm after use (jcavitt) - Combine assertions (jcavitt) - Expand size check in xe_vm_get_faults_ioctl (jcavitt) - Remove return mask from fill_faults, as return is already -EFAULT or 0 (jcavitt) v7: - Revert back to using xe_vm_get_property_ioctl - Apply better copy_to_user logic (jcavitt) v8: - Fix and clean up error value handling in ioctl (jcavitt) - Reapply return mask for fill_faults (jcavitt) v9: - Future-proof size logic for zero-size properties (jcavitt) - Add access and fault types (Jianxun) - Remove address type (Jianxun) v10: - Remove unnecessary switch case logic (Raag) - Compress size get, size validation, and property fill functions into a single helper function (jcavitt) - Assert valid size (jcavitt) v11: - Remove unnecessary else condition - Correct backwards helper function size logic (jcavitt) v12: - Use size_t instead of int (Raag) v13: - Remove engine class and instance (Ivan) v14: - Map access type, fault type, and fault level to user macros (Matt Brost, Ivan) v15: - Remove unnecessary size assertion (jcavitt) v16: - Nit fixes (Matt Brost) v17: - Rebase and refactor (jcavitt) v18: - Do not copy_to_user in critical section (Matt Brost) - Assert args->size is multiple of sizeof(struct xe_vm_fault) (Matt Brost) v19: - Remove unnecessary memset (Matt Brost) uAPI: intel/compute-runtime#878 Signed-off-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Suggested-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Cc: Jainxun Zhang <jianxun.zhang@intel.com> Cc: Shuicheng Lin <shuicheng.lin@intel.com> Cc: Raag Jadav <raag.jadav@intel.com> Cc: Ivan Briano <ivan.briano@intel.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Enable FAULT_MODE explicitly at the call site when scratch pages are disabled. Previously, getFlagsForVmCreate() implicitly enabled fault mode when disableScratch was true. This change makes the policy decision explicit in createDrmVirtualMemory():