Skip to content

Conversation

@KiritoHugh
Copy link

For example,
ReluLLaMA-7B; NVIDIA GeForce RTX 2080 Ti 11264MiB; ffn_up,ffn_gate,ffn_down_t all are[4096,11008];
A neuron should be [4096,1] not [1,11008].

when env CUDA_VISIBLE_DEVICES=0 ./build/bin/main -m ./ReluLLaMA-7B/llama-7b-relu.powerinfer.gguf -n 128 -t 8 -p "Once upon a time" :

  • before revising:
    slice_size=22016
    vram_bytes_per_slice=99072
    vram_allocatable_bytes=4212178944
    neuron_cap=170064

  • after revising:
    slice_size=8192
    vram_bytes_per_slice=24576
    vram_allocatable_bytes=4212178944
    neuron_cap=171394

For example, in ReluLLaMA-7B, NVIDIA GeForce RTX 2080 Ti 11264MiB; 
ffn_up,ffn_gate,ffn_down all are [4096,11008];
`env CUDA_VISIBLE_DEVICES=0 ./build/bin/main -m ./ReluLLaMA-7B/llama-7b-relu.powerinfer.gguf -n 128 -t 8 -p "Once upon a time"`

- before revising:
`slice_size=22016`
`vram_bytes_per_slice=99072`
`vram_allocatable_bytes=4212178944`
`neuron_cap=170064`

- after revising:
`slice_size=8192`
`vram_bytes_per_slice=24576`
`vram_allocatable_bytes=4212178944`
`neuron_cap=171394`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant