When you put MFCC's into a neural net to get phonemes is it the phonemes themselves or transitions between them.?

I have looked at a lot of stuff on speech recognition and there are a lot of things that haven't been included in them. Is there an article somewhere that clarifies a lot of these things.

