Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the idea to further enhance the performance. #78

Open
lucasjinreal opened this issue Apr 19, 2024 · 4 comments
Open

About the idea to further enhance the performance. #78

lucasjinreal opened this issue Apr 19, 2024 · 4 comments

Comments

@lucasjinreal
Copy link

lucasjinreal commented Apr 19, 2024

Hi, I have conducted experiment minigemini arch to Qwen series model, it has a good performance.

However, the performance didn't strong enough compare to some SOTA small models such as MiniCPMv2 LLavaUHD etc.

Which used a very large input and slicing technology.

As such, am just wonder, how can we further pushing the boundry of mini-gemini, and make mini-gemini great again?

The currently baseline I got from qwen7b is slightly same as gemma7b's on MMMU, but this is actually not very satisfying.

Here are some thoughts to further improve on my mind:

  1. Make larger input resolution on clip-vit, since the final token num are determined by this one, I have tried enlarge 336 -> 448, result in 1024 visual tokens, but surprisely, the result got worser;
  2. Add a Resampler replace MLP, I tried, the loss didn't get converge.

So here is I want talk about: How should we exactly make some improvement?

Hoping for your discussion and insights, guid me on the right path.

@DePengW
Copy link

DePengW commented Apr 19, 2024

I also changed llm to qwen1.5, and the performance will be somewhat improved.

  1. In the encoder section, I replaced it with deepseekvl's hybrid and there was also a small increase.
  2. In the data section, the allava data itself is somewhat dirty, cleared a wave, and then manually translated the Chinese version, and added some interlm-xc data, and it was also increased a bit.

Feel that our directions are very similar, if interested, you can leave the contact information to communicate.

@OpenJarvisAI
Copy link

OpenJarvisAI commented Apr 20, 2024

Allava had a Chinese version. What do u mean by deepseek's hybrid? Minigemini already a hybrid arch.
interlm-xcomposer data could be even more dirty. the sharegpt4v dataset should already be included in.

@DePengW
Copy link

DePengW commented Apr 20, 2024

There is a Chinese version of Allava, but both the Chinese and English versions are dirty. In the Chinese version of allava, there are many phenomena of picture-text mismatch, translation dislocation and translation hallucination. For example, grep “宁静湖畔” in allava-cn , the result is a high probability of picture and text mismatch. Therefore, it is necessary to clean allava-en and allava-cn, and the addition of allava-cn can also bring about the improvement of indicators.

Minigemini has a mixed structure, but after the experiment, deepseek-vl will be slightly better.

Interlm-xcomposer data, I specifically refer to the sft phase, such as aokvqa, okvqa, lvis data

@OpenJarvisAI
Copy link

OpenJarvisAI commented Apr 21, 2024

How did u clean allava data and manually translate to Chinese version? Would share the data after shared? that would be very nice. Also, does internxcomposer opensourced their sft data?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants