Multimodal + reasoning era: text-image-audio-video inputs and stronger reasoning models.
Model architecture families
Family
Architecture
Typical use
BERT-like
Encoder-only
Classification, NER, retrieval
GPT-like
Decoder-only
Generation, chat, coding
T5/BART-like
Encoder-decoder
Translation, summarization
Why model outputs differ across vendors
Even with similar transformer foundations, output quality differs due to data quality, alignment methods, post-training, inference stack, and tool integration.