About the idea to further enhance the performance. #78

lucasjinreal · 2024-04-19T09:30:07Z

Hi, I have conducted experiment minigemini arch to Qwen series model, it has a good performance.

However, the performance didn't strong enough compare to some SOTA small models such as MiniCPMv2 LLavaUHD etc.

Which used a very large input and slicing technology.

As such, am just wonder, how can we further pushing the boundry of mini-gemini, and make mini-gemini great again?

The currently baseline I got from qwen7b is slightly same as gemma7b's on MMMU, but this is actually not very satisfying.

Here are some thoughts to further improve on my mind:

Make larger input resolution on clip-vit, since the final token num are determined by this one, I have tried enlarge 336 -> 448, result in 1024 visual tokens, but surprisely, the result got worser;
Add a Resampler replace MLP, I tried, the loss didn't get converge.

So here is I want talk about: How should we exactly make some improvement?

Hoping for your discussion and insights, guid me on the right path.

DePengW · 2024-04-19T13:17:49Z

I also changed llm to qwen1.5, and the performance will be somewhat improved.

In the encoder section, I replaced it with deepseekvl's hybrid and there was also a small increase.
In the data section, the allava data itself is somewhat dirty, cleared a wave, and then manually translated the Chinese version, and added some interlm-xc data, and it was also increased a bit.

Feel that our directions are very similar, if interested, you can leave the contact information to communicate.

OpenJarvisAI · 2024-04-20T02:47:34Z

Allava had a Chinese version. What do u mean by deepseek's hybrid? Minigemini already a hybrid arch.
interlm-xcomposer data could be even more dirty. the sharegpt4v dataset should already be included in.

DePengW · 2024-04-20T08:08:43Z

There is a Chinese version of Allava, but both the Chinese and English versions are dirty. In the Chinese version of allava, there are many phenomena of picture-text mismatch, translation dislocation and translation hallucination. For example, grep “宁静湖畔” in allava-cn , the result is a high probability of picture and text mismatch. Therefore, it is necessary to clean allava-en and allava-cn, and the addition of allava-cn can also bring about the improvement of indicators.

Minigemini has a mixed structure, but after the experiment, deepseek-vl will be slightly better.

Interlm-xcomposer data, I specifically refer to the sft phase, such as aokvqa, okvqa, lvis data

OpenJarvisAI · 2024-04-21T03:05:03Z

How did u clean allava data and manually translate to Chinese version? Would share the data after shared? that would be very nice. Also, does internxcomposer opensourced their sft data?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About the idea to further enhance the performance. #78

About the idea to further enhance the performance. #78

lucasjinreal commented Apr 19, 2024 •

edited

DePengW commented Apr 19, 2024

OpenJarvisAI commented Apr 20, 2024 •

edited

DePengW commented Apr 20, 2024

OpenJarvisAI commented Apr 21, 2024 •

edited

About the idea to further enhance the performance. #78

About the idea to further enhance the performance. #78

Comments

lucasjinreal commented Apr 19, 2024 • edited

DePengW commented Apr 19, 2024

OpenJarvisAI commented Apr 20, 2024 • edited

DePengW commented Apr 20, 2024

OpenJarvisAI commented Apr 21, 2024 • edited

lucasjinreal commented Apr 19, 2024 •

edited

OpenJarvisAI commented Apr 20, 2024 •

edited

OpenJarvisAI commented Apr 21, 2024 •

edited