I haven't extensively tested v3 large or the new turbo, but hallucinations have been a known problem in previous versions of the weights. Not providing previous text helps this a lot so I suspect some creative rules like 'don't provide previous text unless the last word was 0.2s or less from the end of the cut' would help. The other issue I have seen is that it has a tendency in all versions to assume conversation pauses are the end of speech. I built a custom decoder that re-transcribed blank areas and it recovered a lot of missing words/improved accuracy considerably (at the cost of speed of course). I suspect, with no proof so this is just a hunch, that the training was done with a lot of utterances with few pauses and almost no blank training. That being said, I am glad it is out there. No ASR is perfect (the whole problem is formulated wrong since conversations overlap so text isn't linear) and Whisper does a great job and has spurred a lot of great innovation in the space.