Self-frankenmerge support? #7012

marcingomulkiewicz · 2024-04-30T19:41:34Z

marcingomulkiewicz
Apr 30, 2024

I've noticed that some people seem to be getting good results by interleaving models with themselves, effectively duplicating layers. As far as I understand, these are actually the same weights, no new information there - but still such a frankenmerge takes more (V)RAM than strictly neccessary. Would it make sense to implement this inside ggml lib? I'm thinking about something like cmdline parameter or perhaps another metadata in gguf file, containing information like [1,2,3,4,5,3,4,5,6,7,5,6,7,8,9,10] - this example defines 16 layer model, but in memory it would only take space of 10 distinct layers (during inference indirectly referred by the list similar to the above). Obviously performance would be on par with 16 layer model, but it'd be still possible to use it where only 10 layers model fit. What do you think?

TheTerrasque · 2024-05-06T20:53:24Z

TheTerrasque
May 6, 2024

I came here to ask the same, after seeing this reddit thread and qrios' comment there.

Since it seems these kind of layer duplications somewhat helps in some cases, it would be a good improvement to have.

0 replies

ggerganov · 2024-05-07T14:23:07Z

ggerganov
May 7, 2024
Maintainer

There was initial work on this started here: #5741

0 replies

marcingomulkiewicz · 2024-05-07T15:14:08Z

marcingomulkiewicz
May 7, 2024
Author

I might be wrong on that, but quick glance at #5741 seems to suggest that this is something diffent. It's about merging (potentially self merging) models, producing bigger (disk-space/VRAM wise) model. What I was thinking about was to keep original model's (disk/RAM) size, but using some sort of metadata evaluate it with repeating layers. Say we have a model with 5 layers, we define 'mapping' 1-4,2-5 (=8 layers in total), model gets loaded to RAM, it still takes place for 5 layers, yet those are used in the order 1-2-3-4-2-3-4-5. The whole point is to save precious (V)RAM required for inference.

2 replies

ggerganov May 7, 2024
Maintainer

Ah, yes - I see now. Should be possible, but probably not as trivial as it initially sounds. I think this was discussed also at some point. We would need to still create KV cache for the mapped layers and also need some changes to make partial offloading work properly

marcingomulkiewicz May 7, 2024
Author

'Mapped' KV cache - absolutely. When talking about partial offloading you mean GPU/CPU split? Yes, I see the problem, and I guess naive implementation would kill the performance. On the other hand if the model is too big to fit on one device, the performance is already dead...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Self-frankenmerge support? #7012

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Self-frankenmerge support? #7012

marcingomulkiewicz Apr 30, 2024

Replies: 3 comments · 2 replies

TheTerrasque May 6, 2024

ggerganov May 7, 2024 Maintainer

marcingomulkiewicz May 7, 2024 Author

ggerganov May 7, 2024 Maintainer

marcingomulkiewicz May 7, 2024 Author

marcingomulkiewicz
Apr 30, 2024

Replies: 3 comments 2 replies

TheTerrasque
May 6, 2024

ggerganov
May 7, 2024
Maintainer

marcingomulkiewicz
May 7, 2024
Author

ggerganov May 7, 2024
Maintainer

marcingomulkiewicz May 7, 2024
Author