Skip to content

Conversation

@yhnsu
Copy link
Collaborator

@yhnsu yhnsu commented Jan 16, 2026

Description

add reward manager for RL training

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a reward manager system for reinforcement learning training, enabling modular reward computation through configurable reward functors. The reward manager follows the same design pattern as the existing observation manager and event manager.

Changes:

  • Added RewardManager class to orchestrate reward computation with support for multiple weighted reward terms
  • Implemented 11 reusable reward functions covering distance-based rewards, penalties, and success bonuses
  • Integrated reward manager into EmbodiedEnv and BaseEnv for automatic reward computation
  • Refactored PushCubeEnv to use the reward manager instead of manual reward calculation
  • Added randomize_target_pose function for virtual goal poses without physical scene objects

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
embodichain/lab/gym/envs/managers/reward_manager.py New reward manager class for orchestrating reward computation
embodichain/lab/gym/envs/managers/rewards.py New module with 11 reward functor implementations
embodichain/lab/gym/envs/managers/cfg.py Added RewardCfg configuration class
embodichain/lab/gym/envs/managers/init.py Exported RewardCfg and RewardManager
embodichain/lab/gym/envs/embodied_env.py Integrated reward manager initialization and reset
embodichain/lab/gym/envs/base_env.py Added _extend_reward hook in get_reward method
embodichain/lab/gym/utils/gym_utils.py Added reward parsing logic in load_gym_cfg
embodichain/lab/gym/envs/tasks/rl/push_cube.py Refactored to use reward manager, removed manual reward code
embodichain/lab/gym/envs/managers/randomization/spatial.py Added randomize_target_pose function
embodichain/lab/gym/envs/managers/observations.py Added get_robot_ee_pose and target_position observation functions
configs/agents/rl/push_cube/gym_config.json Updated to use reward manager configuration
configs/agents/rl/push_cube/train_config.json Changed eval_freq from 2 to 200

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +68 to +73
"end_effector_cfg": {"uid": "Manipulator", "body_ids": "ee_link"},
"object_cfg": {"uid": "cube"},
"goal_cfg": {"uid": "goal_sphere"},
"behind_offset": 0.015,
"height_offset": 0.015,
"distance_scale": 5.0
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reward function references 'goal_sphere' object which has been removed from the 'background' array (line 171). The reaching_behind_object_reward function will fail at runtime when trying to access this non-existent object. Either restore the goal_sphere object in the scene or update the reward function to use the virtual goal_pose instead.

Suggested change
"end_effector_cfg": {"uid": "Manipulator", "body_ids": "ee_link"},
"object_cfg": {"uid": "cube"},
"goal_cfg": {"uid": "goal_sphere"},
"behind_offset": 0.015,
"height_offset": 0.015,
"distance_scale": 5.0
"end_effector_cfg": {"uid": "Manipulator", "body_ids": "ee_link"},
"object_cfg": {"uid": "cube"},
"target_pose_key": "goal_pose",
"behind_offset": 0.015,
"height_offset": 0.015,
"distance_scale": 5.0

Copilot uses AI. Check for mistakes.
Args:
env: The environment instance.
obs: The observation dictionary.
robot_uid: The uid of the robot. If None, uses env.robot.
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring mentions a 'robot_uid' parameter but this parameter doesn't exist in the function signature. The function always uses env.robot. Remove this incorrect documentation line.

Suggested change
robot_uid: The uid of the robot. If None, uses env.robot.

Copilot uses AI. Check for mistakes.
table.align["Name"] = "l"
for index, name in enumerate(self._mode_functor_names[mode]):
functor_cfg = self._mode_functor_cfgs[mode][index]
weight = functor_cfg.params.get("weight", 1.0)
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The weight is being retrieved from params instead of from the RewardCfg.weight attribute. This means the weight configured at the top level of RewardCfg (line 168) is ignored. It should use functor_cfg.weight to correctly display and apply the configured weight.

Suggested change
weight = functor_cfg.params.get("weight", 1.0)
weight = getattr(functor_cfg, "weight", 1.0)

Copilot uses AI. Check for mistakes.
) -> torch.Tensor:
"""Reward based on distance between two entities."""
# get source entity position
source_obj = env.sim[source_entity_cfg.uid]
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code uses direct dictionary access env.sim[uid] which will raise a KeyError if the entity doesn't exist. For consistency with the observation functions (see get_rigid_object_pose in observations.py), consider using env.sim.get_rigid_object() which provides safer access with None checking.

Copilot uses AI. Check for mistakes.
info["rewards"]["action_penalty"] = action_penalty
info["rewards"]["success_bonus"] = success_bonus
return reward
# Fallback: no virtual goal set
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback to zero goal position could mask configuration errors. If goal_poses is not set, it indicates the randomize_goal event is missing from configuration. Consider logging a warning when falling back to zeros to help with debugging misconfiguration.

Suggested change
# Fallback: no virtual goal set
# Fallback: no virtual goal set
logger.warning(
"PushCubeEnv: _goal_poses is not set; falling back to zero goal positions. "
"This may indicate that the randomize_target_pose event is missing from the configuration."
)

Copilot uses AI. Check for mistakes.
current_dist = torch.norm(source_pos - target_pos, dim=-1)

# initialize previous distance on first call
prev_dist_key = f"_prev_dist_{source_entity_cfg.uid}_{target_pose_key}"
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The state key uses string concatenation which could create long attribute names. While functional, consider using a dictionary-based state management approach (e.g., env._reward_states[key]) for better organization when multiple incremental rewards are used.

Copilot uses AI. Check for mistakes.
raise_if_not_found=True,
)

from embodichain.lab.gym.envs.managers import RewardCfg
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import statement is placed inside the loop iteration. Move this import to the top of the file with other imports to follow Python best practices and avoid repeated import overhead during loop iterations.

Copilot uses AI. Check for mistakes.
Comment on lines +355 to +357
# Add individual reward terms to info for logging
for term_name, term_value in reward_info.items():
info[f"reward/{term_name}"] = term_value
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Individual reward terms are added directly to the info dict with a 'reward/' prefix. Consider organizing them under info['rewards'] = reward_info for consistency with how the removed manual implementation organized reward components (see line 170 in the diff), making it easier for trainers to identify and log reward components.

Suggested change
# Add individual reward terms to info for logging
for term_name, term_value in reward_info.items():
info[f"reward/{term_name}"] = term_value
# Organize individual reward terms under info["rewards"] for logging
existing_rewards = info.get("rewards")
if isinstance(existing_rewards, dict):
existing_rewards.update(reward_info)
else:
info["rewards"] = reward_info

Copilot uses AI. Check for mistakes.
return torch.stack(masks, dim=-1)


def get_robot_ee_pose(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use get_robot_eef_pose

target_value: float = 0.0,
scale: float = 1.0,
) -> torch.Tensor:
"""Reward based on observation values."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add docs for this function

from embodichain.lab.gym.envs import EmbodiedEnv


def reward_from_obs(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can not understand the meanning of this reward founction

exponential: bool = False,
sigma: float = 1.0,
) -> torch.Tensor:
"""Reward based on distance between two entities."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add full docs for this function

robot_uid: str = "robot",
joint_ids: slice | list[int] = slice(None),
) -> torch.Tensor:
"""Penalize large joint velocities."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add full docs for this function

"reaching_reward": {
"func": "reaching_behind_object_reward",
"mode": "add",
"name": "reaching",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the arg name is duplicated, since we already have this reward functor key

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants