Well, there's no such thing as a proper way (every situation is different) but I suspect there are several things going on here.
First off, mixing on headphones--even the best ones--isn't a great idea and often results in a mix that sounds fine on the headphones but sounds very different on speakers. There are lots of papers out there explaining why this is (it's to do with the proximity of the headphone drivers to your ear and the lack of air between the two) but the end result is that you need to A) listen to your mix on a variety of playback systems and B) probably give more weight to your studio monitors (depending what they are) than what you hear in the 'phones.
Second, there are a few tricks you can use:
One that I play with a lot is to add some subtle EQ to the music track in the frequency range that contains most of the voice. Maybe try a 2-3dB cut between 200Hz and 2kHz--but do it by ear, not by numbers. This cuts a subtle space in the music for the voice without making the music sound particularly quieter.
Also, in a situation like this, compression is your friend. Apply a bit more comp to the voices than you might normally want and it'll help keep them standing out compared to the music. Again, exactly what setting to use will need to happen by ear. (Though I have to say that, listening to your mix, you've already got what might be sufficient compression.
Last, just play with very minor tweaks in the mix as it goes through. A small boost to the music (or the voices) as appropriate can make a big difference. I'll often use volume envelopes to make word-by-word (or even syllable by syllable) tweaks.
Hope this helps.
Okay, with my acoustician's hat on:
Headphone mixing - especially when it comes to voices in the centre of a stereo field - has always been a disaster area, and you can almost always spot when a mix has been made to play on headphones, because the vocal is invariably too quiet.
I suppose that the easiest way to explain this is to consider what happens to the central voice with loudspeakers. That voice is essentially a virtual image; there isn't a loudspeaker there to support it. So it's been created out of the off-axis responses of your loudspeakers, and is in a space that's quite a distance from your ears, compared to where it would be on headphones. So, the level of this voice depends as much on the angle of your speakers as anything, and whilst this varies considerably between different setups, it's still considerably different to the relationship that headphones have with your ears!
Traditionally in the past, it's been recommended that you should sit in an equilateral triangle with your speakers, which should be pointing towards you which means an included angle of 120 degrees between them. Despite this information being repeated all over the place since the 1950's, there's no physical basis for it at all, and most people don't have setups like that anyway. And I have to say that this really isn't good positioning for establishing a central image - that angle is too great, and you are relying on an extremely good off-axis response to achieve any level at all there.
In this day and age, what you really need is a monitoring compromise that will let you create a mix that sounds not so bad on both headphones and loudspeakers, and there are a couple of things you can do to improve the situation considerably, and get a generally better result. And FWIW, it's what I do in this situation...
The first is to alter the angle of your monitors so that the included angle is 90 degrees (a right-angle) and sit so that both of them are pointing directly at your ears. This gets you a lot closer to the monitors, admittedly, but is far more realistic as far as a compromise mix is concerned. If you do your whole mix like this, you'll find that it's a lot easier to position things in it too. And don't put anything like soft furnishings between them either - that will definitely make things worse. The second thing is that one of the important things you should always do with a mix like this, to finally establish vocal levels, is to listen to it really quietly. No, really quietly! Almost at vanishing point. What you should hear is the whole mix, but if anything is standing out (like the vocal), it will become obvious like this in a way that it simply won't when it's louder. You want it to be there, certainly - but it shouldn't be either missing or standing out too much.
Best metaphor I can think of is a portrait photograph. Make sure the music doesn't steal the focus from the voice. Volume, intensity, spectral footprint -- the voice needs to "punch" through the music at all times.
In your specific case: Work with the sibilant range (3-5 kHz) and either cut the music or boost the VO until you hear every S, K and T sound. Apply dynamic compression until you don't need to strain your hearing in the quiet parts. Don't be afraid to duck the music slightly if need be -- although roughly the same can be achieved with a multiband compressor on the master bus.
(In addition to what's already been said, of course)