Edit model card

QuantFactory/st-vicuna-v1.3-5.5b-ppl-GGUF

This is quantized version of nota-ai/st-vicuna-v1.3-5.5b-ppl created using llama.cpp

Model Description

Shortened LLaMA Model Card

Shortened LLaMA is a depth-pruned version of LLaMA models & variants for efficient text generation.

Compression Method

After identifying unimportant Transformer blocks, we perform one-shot pruning and light LoRA-based retraining.

Click to see a method figure. method

Model Links

Source
Model
Pruning
Ratio
Pruning
Criterion
HF Models
Link
LLaMA-1-7B 20% PPL nota-ai/st-llama-1-5.5b-ppl
LLaMA-1-7B 20% Taylor+ nota-ai/st-llama-1-5.5b-taylor
Vicuna-v1.3-7B 20% PPL nota-ai/st-vicuna-v1.3-5.5b-ppl
Vicuna-v1.3-7B 20% Taylor+ nota-ai/st-vicuna-v1.3-5.5b-taylor
Vicuna-v1.3-13B 21% PPL nota-ai/st-vicuna-v1.3-10.5b-ppl
Vicuna-v1.3-13B 21% Taylor+ nota-ai/st-vicuna-v1.3-10.5b-taylor

Zero-shot Performance & Efficiency Results

  • EleutherAI/lm-evaluation-harness version 3326c54
results

License

  • All rights related to this repository and the compressed models are reserved by Nota Inc.
  • The intended use is strictly limited to research and non-commercial projects.

Model Acknowledgments

Original Model Citation

@article{kim2024shortened,
  title={Shortened LLaMA: A Simple Depth Pruning for Large Language Models},
  author={Kim, Bo-Kyeong and Kim, Geonmin and Kim, Tae-Ho and Castells, Thibault and Choi, Shinkook and Shin, Junho and Song, Hyoung-Kyu},
  journal={arXiv preprint arXiv:2402.02834},      
  year={2024},
  url={https://arxiv.org/abs/2402.02834}
}
@article{kim2024mefomo,
  title={Shortened LLaMA: A Simple Depth Pruning for Large Language Models},
  author={Kim, Bo-Kyeong and Kim, Geonmin and Kim, Tae-Ho and Castells, Thibault and Choi, Shinkook and Shin, Junho and Song, Hyoung-Kyu},
  journal={ICLR Workshop on Mathematical and Empirical Understanding of Foundation Models (ME-FoMo)},
  year={2024},
  url={https://openreview.net/forum?id=18VGxuOdpu}
}
Downloads last month
74
GGUF
Model size
5.52B params
Architecture
llama

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for QuantFactory/st-vicuna-v1.3-5.5b-ppl-GGUF

Quantized
this model