feature: terminate on page fault for xe2 #878

jackm-intel · 2025-12-30T18:35:03Z

Enable FAULT_MODE explicitly at the call site when scratch pages are disabled. Previously, getFlagsForVmCreate() implicitly enabled fault mode when disableScratch was true. This change makes the policy decision explicit in createDrmVirtualMemory():

disableScratch requires page fault handling because without scratch pages, unmapped GPU addresses need fault detection to report errors
Moving this logic to the call site makes the relationship clear rather than hiding it inside the ioctl helper

Enable FAULT_MODE explicitly at the call site when scratch pages are disabled. Previously, getFlagsForVmCreate() implicitly enabled fault mode when disableScratch was true. This change makes the policy decision explicit in createDrmVirtualMemory(): - disableScratch requires page fault handling because without scratch pages, unmapped GPU addresses need fault detection to report errors - Moving this logic to the call site makes the relationship clear rather than hiding it inside the ioctl helper Signed-off-by: Jack Myers <jack.myers@intel.com>

Add initial declarations for the drm_xe_vm_get_property ioctl. v2: - Expand kernel docs for drm_xe_vm_get_property (Jianxun) v3: - Remove address type external definitions (Jianxun) - Add fault type to xe_drm_fault struct (Jianxun) v4: - Remove engine class and instance (Ivan) v5: - Add declares for fault type, access type, and fault level (Matt Brost, Ivan) v6: - Fix inconsistent use of whitespace in defines v7: - Rebase and refactor (jcavitt) v8: - Rebase (jcavitt) uAPI: intel/compute-runtime#878 Signed-off-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Reviewed-by: Shuicheng Lin <shuicheng.lin@intel.com> Acked-by: Matthew Brost <matthew.brost@intel.com> Acked-by: Ivan Briano <ivan.briano@intel.com> Cc: Zhang Jianxun <jianxun.zhang@intel.com> Cc: Ivan Briano <ivan.briano@intel.com> Cc: Matthew Brost <matthew.brost@intel.com>

Add additional information to each VM so they can report up to the first 50 seen faults. Only pagefaults are saved this way currently, though in the future, all faults should be tracked by the VM for future reporting. Additionally, of the pagefaults reported, only failed pagefaults are saved this way, as successful pagefaults should recover silently and not need to be reported to userspace. v2: - Free vm after use (Shuicheng) - Compress pf copy logic (Shuicheng) - Update fault_unsuccessful before storing (Shuicheng) - Fix old struct name in comments (Shuicheng) - Keep first 50 pagefaults instead of last 50 (Jianxun) v3: - Avoid unnecessary execution by checking MAX_PFS earlier (jcavitt) - Fix double-locking error (jcavitt) - Assert kmemdump is successful (Shuicheng) v4: - Rename xe_vm.pfs to xe_vm.faults (jcavitt) - Store fault data and not pagefault in xe_vm faults list (jcavitt) - Store address, address type, and address precision per fault (jcavitt) - Store engine class and instance data per fault (Jianxun) - Add and fix kernel docs (Michal W) - Properly handle kzalloc error (Michal W) - s/MAX_PFS/MAX_FAULTS_SAVED_PER_VM (Michal W) - Store fault level per fault (Micahl M) v5: - Store fault and access type instead of address type (Jianxun) v6: - Store pagefaults in non-fault-mode VMs as well (Jianxun) v7: - Fix kernel docs and comments (Michal W) v8: - Fix double-locking issue (Jianxun) v9: - Do not report faults from reserved engines (Jianxun) v10: - Remove engine class and instance (Ivan) v11: - Perform kzalloc outside of lock (Auld) v12: - Fix xe_vm_fault_entry kernel docs (Shuicheng) v13: - Rebase and refactor (jcavitt) v14: - Correctly ignore fault mode in save_pagefault_to_vm (jcavitt) v15: - s/save_pagefault_to_vm/xe_pagefault_save_to_vm (Matt Brost) - Use guard instead of spin_lock/unlock (Matt Brost) - GT was added to xe_pagefault struct. Use xe_gt_hw_engine instead of creating a new helper function (Matt Brost) v16: - Set address precision programmatically (Matt Brost) uAPI: intel/compute-runtime#878 Signed-off-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Suggested-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Cc: Shuicheng Lin <shuicheng.lin@intel.com> Cc: Jianxun Zhang <jianxun.zhang@intel.com> Cc: Michal Wajdeczko <Michal.Wajdeczko@intel.com> Cc: Michal Mzorek <michal.mzorek@intel.com> Cc: Ivan Briano <ivan.briano@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com>

Add support for userspace to request a list of observed faults from a specified VM. v2: - Only allow querying of failed pagefaults (Matt Brost) v3: - Remove unnecessary size parameter from helper function, as it is a property of the arguments. (jcavitt) - Remove unnecessary copy_from_user (Jainxun) - Set address_precision to 1 (Jainxun) - Report max size instead of dynamic size for memory allocation purposes. Total memory usage is reported separately. v4: - Return int from xe_vm_get_property_size (Shuicheng) - Fix memory leak (Shuicheng) - Remove unnecessary size variable (jcavitt) v5: - Rename ioctl to xe_vm_get_faults_ioctl (jcavitt) - Update fill_property_pfs to eliminate need for kzalloc (Jianxun) v6: - Repair and move fill_faults break condition (Dan Carpenter) - Free vm after use (jcavitt) - Combine assertions (jcavitt) - Expand size check in xe_vm_get_faults_ioctl (jcavitt) - Remove return mask from fill_faults, as return is already -EFAULT or 0 (jcavitt) v7: - Revert back to using xe_vm_get_property_ioctl - Apply better copy_to_user logic (jcavitt) v8: - Fix and clean up error value handling in ioctl (jcavitt) - Reapply return mask for fill_faults (jcavitt) v9: - Future-proof size logic for zero-size properties (jcavitt) - Add access and fault types (Jianxun) - Remove address type (Jianxun) v10: - Remove unnecessary switch case logic (Raag) - Compress size get, size validation, and property fill functions into a single helper function (jcavitt) - Assert valid size (jcavitt) v11: - Remove unnecessary else condition - Correct backwards helper function size logic (jcavitt) v12: - Use size_t instead of int (Raag) v13: - Remove engine class and instance (Ivan) v14: - Map access type, fault type, and fault level to user macros (Matt Brost, Ivan) v15: - Remove unnecessary size assertion (jcavitt) v16: - Nit fixes (Matt Brost) v17: - Rebase and refactor (jcavitt) v18: - Do not copy_to_user in critical section (Matt Brost) - Assert args->size is multiple of sizeof(struct xe_vm_fault) (Matt Brost) v19: - Remove unnecessary memset (Matt Brost) uAPI: intel/compute-runtime#878 Signed-off-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Suggested-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Cc: Jainxun Zhang <jianxun.zhang@intel.com> Cc: Shuicheng Lin <shuicheng.lin@intel.com> Cc: Raag Jadav <raag.jadav@intel.com> Cc: Ivan Briano <ivan.briano@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: terminate on page fault for xe2 #878

feature: terminate on page fault for xe2 #878

jackm-intel commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feature: terminate on page fault for xe2 #878

Are you sure you want to change the base?

feature: terminate on page fault for xe2 #878

Conversation

jackm-intel commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant