VCD RoundTable | Transcript: VCD Roundtable: Episode 46

VCD Roundtable: Episode 46 – The Importance of Health Checks

February 13, 2025 / 25:37/S2024 E31

Hello and welcome to episode

number 46 of the VCD Roundtable.

We are still in Dubai at the comdivision kickoff and in

today's episode, which is episode number 46,

we are going to cover health checks.

Sascha? Yes, hello and welcome from Dubai.

So we will cover health checks.

We will speak a little bit about what's important on a

health check and why every service provider should get a

health check or should do a health check of their complete

environment every year at least.

All right, let's do health checks.

Starting with Sascha.

All right.

You still have a pulse.

Not that bad.

Hello, welcome from Dubai.

Hi guys.

Today with us is Abdel.

Yeah.

So I'm Abdel.

I am a new employee with comdivision.

Yeah.

comdivision.

I'm a senior consultant, focused on NSX and VCF and

this is my first Roundtable with the

guys here and hopefully we have fun.

Of course we will.

Yeah.

That's the plan.

Okay, Sascha.

So you started the intro into health checks.

So what's important?

Why is it important for CSPs to actually have a health

check, letting us sneak into their infrastructure and just

putting our finger into all these holes and saying,

" oh, there might be an option to do

something better or even different."

So what we saw in our experience with all of the service

providers, where we have written the architecture, where we

supported questions (also in the normal support

if they are not fast enough with the Broadcom support

or the older VMware support),

we figured out that most likely most of the service

providers problems are the same ones or coming from the

same direction, same with the enterprise stuff.

So there are a lot of points where service providers do not

have the correct architecture, have maybe updated their

environment but not changed the architecture stuff.

Because if I go to new release, I also need to take care

that all of the required changes

will be done for the new architecture.

And because of that, it's important to

get take a look at the infrastructure.

Take a look what is running like expected.

What maybe needs to be changed.

All right, guys.

And now, sorry, Sascha, pardon the interruption.

And now we just demo that it is live,

because Aya just arrived. Aya, we're already live.

Maybe a few words to introduce yourself.

Sure.

Where's the camera?

The camera is just in front of you.

So, hi, everyone.

Hi, everyone watching.

My name is Aya.

I'm a senior consultant at comdivision.

I have been working with VMware by Broadcom infrastructure

for more than three-four years now.

And now we proved that the Roundtable is actually live!

Again, pardon the interruption, Sascha.

We are discussing health checks today, and Sascha just began

to explain why health checks are important,

pointing out what an architecture should be, could be, and

what is achievable, and sticking with an architecture

and just think before you implement.

Yes.

And so we did a lot of health checks in the past.

We have currently working on reworking our

health checks, reworking on the problems.

So taking a look at all of these support tickets from

the last year and really

taking a look at how we can prevent it,

how we can see the problems before the problems really

become a problem for your environment,

for your customers as a service provider.

So I like the approach and the thinking you have shared

with us and also we take a very

standardized approach, but I think it's important to point

out that for us, a health check is

not just to run a few scripts and take the

results from the scripts and add it into a

document, the results of the scripted document in an

automated way and just come up with

some automated documentation pointing out some things.

So a big part of a health check needs to be

or is a manual approach, just really looking

into the details of that specific infrastructure to find

out what's important and what could

be done different or maybe better

with that specific infrastructure.

I think when we look at all of this, this is becoming more

important also now that service

providers start to prepare for their transition either into

VCF 5.2 or in the near future into

VCF 9 because from that perspective, customers or service

providers need to prepare and need

to be sure that they understand what needs to change

potentially in their infrastructure

before they can actually move into VCF. So

especially now, a health check is becoming

more important than it was potentially in the past. But as

you said Matthias, I think it's very

important to say that while we can collect a bunch of data

over scripts. It's always important

that a health check also involves like a manual procedure

where we go really over systems because

you can only test to a certain degree with scripts but a

perfect example I always use is

looking into the advanced settings in vSphere is always a

good idea, because no matter what you do

with the scripts, you need to ask for a very specific

setting and sometimes customers might

have a setting from a specific vendor. And I have seen

sometimes settings which have been around for

multiple years, including things which

could actually horribly go wrong, because

if you're sitting in a health check -- just to give a very

quick example -- and you identify that they

changed the isolation addresses for HA and then you ask,

" which are these systems?" and

then the answer is, "we have not used these IP addresses

for the last three years,"

which goes perfectly well as long as all your

infrastructure is well working.

But if you would actually have a major

outage that would be a lot of fun then.

I totally agree with everything you're saying. I think a

health check is very important nowadays.

It's not like something optional. I guess it should be like

a must, every customer

especially service providers should do a health check. It's

not like a monthly basis, because it

could be like you know a bit hard for them, but at least on

a quarterly basis, because customers --

depending you know on their size -- might have lots

of employees that are

using VMware infrastructure. So the thing is that they might

make changes and another team member

cannot know what these changes are.

So the thing is doing a health

check you know covering each point and not through scripts,

being customized for

each customer. I think this is a must and it would really

find things that you did not

even know existed or are applied in your

infrastructure. So I think it is definitely a must

and super helpful thing that we all need to do and

encourage you know our customers to do.

Adding to that, we have done many implementations of certain

implementations. You're just adding

some advanced settings, which make perfect sense regarding

a specialized storage system. So

it's every vendor provides, "please apply setting ABC to

better fit the solution provided by us"

and then you might change even just the storage vendor or

come up with a new cluster and the

cluster has a different storage solution and you just apply

the predefined values without thinking,

because that is the standard with your customer but with the

new storage solution it might just cause

issues. I agree because I think that

everyone can run can run scripts right. It's not

tied to a certain company or a certain

service provider. Anyone can run scripts, but

if you run the health checks with an architect or a

consultant it actually turns from a manual point running...

I mean if you run through manual points across all

the configuration it actually turns

into a discussion with the service provider and that turns

into a better discussion on how can we

enhance the environment, how can we enhance the

infrastructure, what are the missing parts here,

what are the missing parts there and then we can come up

with a complete or I would say near a

perfect infrastructure for the service provider to provide

their customers. It's even though you

have that setting, as you mentioned, you started discussion

and then the customer ideally comes

up with, "we have we made a decision based on the business

requirement and solution we're using"

that's the documentation, that's just the justification: why

this setting has been applied. Then you

come up like, "from an architectural perspective it's now

a trade-off, because that is not working

any longer because it's off"; then you can start over like,

" I need to make come up with a new decision

and justify why we're changing it in a specified way and

document it again and then we'll have it

safed for the future." Exactly, because that requirement

might have been valid three years ago but it's

not valid as of now. An architecture dument

and documentation is a living

creature, which changes over and over and over but if you do

an architecture just once, and then

rely on it and you change something in the back end, like

changing a vendor or whatever, you need to

also reconsider changing the architecture in order to

satisfy. Additionally, with all the changes

with licensing going from the old

license model to VCF, so maybe a lot of

design decisions were taken with the old license model in

the background so maybe we have a lot

more options now how to solve problems with all

licenses more or less included -- especially

with Aria Operations and Aria Logs, so we can give

much more information about how you can solve

your problems much better than you are currently doing it.

Especially from a from a pricing

perspective, license price perspective, because in the past

you didn't care about certain values

and metrics, but nowadays with the whole VCF licensing game,

a few parameters just changed like

the cores: now we charge per core and not per memory,

so that's a different ball game.

Yeah also, the point was... sorry, yeah go on.

Today we're a big

group! Everyone wants to share.

Everybody wants to be on the microphone. I think when you

look at it in so many other areas it's

like what Aya said, it's so common that we get things

checked whether it's we get our cars into

maintenance, if we go into a physician to actually get

ourselves checked out. If you go into a hospital

you don't want to go into an MRT that has actually been

there for five years and no one knows

if it runs often. Everybody potentially looks at the MRT

but does everybody look at the IT

infrastructure behind the MRT? But, honestly speaking coming

up with medical devices and how

often they update the infrastructure behind an MRT might

not be the best example. The checkups

of the devices are still tnot that often. It's not about

the software updates itself, because in many

cases those are totally isolated systems so that is not a

problem. If you look into a lot of

aviation systems you will find with Microsoft Windows

versions which are outdated for a very

long time period, but they are still in that specific use

case perfectly fine because it's

running in an isolated system. When you take both airplanes

most of these onboard systems are

still using in some cases floppy disks to

update flight management. Did you again

create the safe icon? Something like that. I think when

we look at this applying it to both

service providers and enterprises, this should become far

more a normal procedure on a regular

basis, because you're basically building your complete

company. If your infrastructure implodes

then this is a bad thing and as much as we do this,

sometimes already for security we should do this

for basic infrastructure services as well and it can be a

pretty straightforward process if

everything is fine and if you do this on a regular basis

like what Aya said, it's not a multi-week

project. This can be done in hours or maybe a few days

depending on the size of your infrastructure

and the complexity amount of products etc. but this does not

need to be a very harmful process and in

most cases we don't need to even touch anything and change

anything actively in the infrastructure.

Even if we talk about CSPs and

enterprises you mentioned, a health check for a CSP

is different compared to an enterprise because they run the

infrastructure differently, it is a

different use case it needs a different

configuration it needs to have a different

look at how that infrastructure is configured and used.

Maybe it has less products as well.

Maybe or differently configured because of that. That is an

important point. It's like if you're a

service provider out there shopping for a health check or a

validation this is what we have seen

in the past where people were just

applying. There was, in the old days, these

vSphere Optimization Assessments, which some people just

used as a health check. Some of the outcomes

were actually contradictory to the Cloud Director settings.

So, if people were to follow that

blindly, because it was just a script, that could have

completely broken Cloud Director, because

some of the systems might request and say, "ah you

have three levels of resource

pools; that is not necessarily the best idea. You have your

NSX instances in a resource pool

that is maybe not the best idea." Whereas reality is, in a

CSP environment with Cloud Director

there are good reasons why we do these things so just the

script is not not necessarily a good

idea and it needs to be specialized for the use case.

Exactly so that's why

it's very handy to have a consultant or an architect doing

that whole health check and not

just rely on some scripts. Yeah you need to have a very

long list. So I had this discussion

with a lot of service providers where we discussed why we

are doing health checks and many of our

secondary White Labels (also bigger ones) asked us to do a

health check because they said,

"we have a normal vSphere environment experience, but not

more. We currently have no information

about what's coming with/in VCF, what are the

requirements to move to VCF -- that's also

a readiness assessment health check -- and all of that

combined means you need really a partner

who has the experience in Cloud Director,

in vSphere, and in this combination, because as

Yves mentioned before, there are

some normal health checks in the market but

that's not directly covering the needs and the requirements

of Cloud Director. I think the key

thing here that we're focusing on doing with our customers

is to let them benefit from the product

that they have and all its aspects. So what we're

trying to do is maybe they only know

vSphere but we're trying to train them on the whole stack --

NSX, the whole VCF stack,

Aria everything, so we're not just offering them here like a

health check solely it's just as well

we're going to train them or at least deliver

our knowledge to them so they know how to

use this full stack. Let me introduce something

additional with... oh but if we

introduce it, if we produce like Operations Manager or VCF

Operations and you use it proactively

you could have seen an issue earlier on and react

proactively instead of having customers complaining

about performance. Exactly, exactly. Or just optimize your

infrastructure overall using

VCF Operations you can actually optimize the infrastructure

so like you said just mitigate

any error that might come up or basically just, I don't know

how to say it, but balance your resources

to a point that you feel like as a service provider that you

are getting your ROI the correct way.

Monitor an SLA for example. Yeah. Because

I've seen service providers reporting back

to their customers that we are offering an SLA and they

even have no idea if they fulfill

the reported or the agreed SLA. A health check is also

introducing additional monitoring options

to validate if the offering that the customer signed is

still true and that they fulfill everything.

So, from a legal perspective, am I able to

actually provide what the customers have signed

for. That is one of the areas; the other area, I think is

also always to look closely into the

fact, how this can help you from an

internal perspective. Not only from a management

perspective, but using a health check to potentially get

some protection and getting an idea if

everything runs as it should be, it sometimes can also be

helpful for the admin themselves to request:

"can we actually get someone in for a health check so that

we can confirm to management that

we are truly operating on best practices, that we

learn with the health check." As Aya discussed

how we should configure things or that we learn how

to utilize Aria Operations, like what

Abdel said, so there are so many different aspects you can

take out of it, and my experience has

always been that sometimes for the admins, they feel a bit

held back in the beginning.

Especially when it is ordered more or less by management,

because people feel like, "oh,

yeah, someone is coming and actually checking if i'm

really doing my job." I would always argue:

don't see these health checks as a job guarantee or

anything else. They are

really there to actually check the infrastructure and fix

issues potentially, because let's face it

the the difference is you are running your

infrastructure on a daily basis so you have

a focus, you are completely isolated on that specific

infrastructure. You might see maybe

a customer infrastructure from time to time, but you more or

less have one approach: what you always

have been doing. And the big advantage is, if

people from our team are actually doing the

health check, we see maybe in a week five-ten different

cloud provider infrastructures; we

design every quarter X amount of new Cloud Directory

infrastructures. We constantly need

to keep ourselves updated with what is the current best

practice, because this is something we have

seen very very often. It's like settings that were

best practice two or three years ago

are no longer the best practice, because people have figured

out they are not working. A few years

back we were telling people, "no, no, don't put a

hard disk or put a disk device into your

server or boot device. A usb stick is the perfect solution."

We all learned that that was not the

best idea and so this had to be turned around again.

There are so many of these examples

over the last 10-15 years and this does not only apply to

VMware, this applies to

Microsoft, to nearly every vendor. Best practices change and

also keep in mind not every best

practice applies to every infrastructure. So the only

constant in life we have is change

and sometimes an external view into an

infrastructure might just enlighten people, like "oh I

completely overlooked a certain configuration. Yes, thanks

for mentioning it." A health

check does not mean that we are coming to a CSP and just

start finger pointing: "oh that's wrong,

that's wrong, that's whatever." It's also about like,

"oh, that's a really nice configuration" or

"oh that's an interesting configuration could you please

explain why is it configured that way?"

And then you get the

explanation, like "oh that makes perfect

sense. Thank you!" That's also a validation if the

documentation is properly done, because if someone

can explain this is the justification why we have made that

decision, it shows that you have a

proper well maintained documentation. Good. If you want to

do a health check... if you don't have it included...

I mean, some of our service providers

have it included in their service

pack, so we do it with them at least on a yearly basis, some

have it via the White Label program

as a startup fee, but if you're not in any of

these programs, I mean clearly the best

solution is to become part of our service pack program, get

the support from us and all of these

things, and then you get your yearly health check and many

other things like Challenge Days, Architecture Think Tanks

and stuff like that included potentially. But if you say you

want to just actually do a health check, Sascha?

Just reach out to us, drop us a message, and we will

schedule a meeting and we can

send you and quote for a health check. It depends on

how many data centers, how much

infrastructure, what products are in use, etc..

but we have a fixed fee engagement,

so that also means we give

you one price fixed fee and then you have

the real outcome for your environment. So, an

infrastructure is like a human being, right. The longer it

lives, the older it gets, the more often

it needs a checkup. So you run to the doctor every week?

I should. All of that being said, I think

that's a good approach and maybe what we do in a few months

time, not too far, is we take,

as we are just revamping the new health checks out there, we

do a bit of a sum up story and actually

consolidate the funniest things we have found in the

last six months is health checks and

make an episode about that. See, Sascha, you

already have a topic set.

So, that being said any any final words

to say, Matthias? Yes. It was great talking

about health checks. Thanks to Aya and Abdel, showing off

in their first VCD Roundtable.

We proved that it's live. Sascha, I really need a

check up and a health check, please

take care of me. Absolutely. We'll take care of you, don't

worry. Alright now, I need to rock. Bye!

Abdel, final words?

I'm really looking forward to our CSPs to come in

and have this discussion -- I'd rather say

and come up with the best solution moving forward for them

and their customers -- of course. Good, so as

we are going to chase up Matthias now, I would say thank you

all for listening into episode number 46 on

health checks. This was live from Dubai. Next week we are...

no, we are going to be in two weeks back in our

home offices i think... Let's take a look...

We will see. Sascha and I have been on

the road for maybe 4 weeks again. Hope to see you

again soon. If you have any topic ideas,

if you have anything like that, feel free to drop them in

the comments below the section and hope

to see and hear from you soon, goodbye! Bye!

Creators and Guests

Host

Yves Sandfort

Yves Sandfort - VMware cloud and infrastructure architect and evangelist, CEO comdivision group. VCDX-CMA,VCIX-CMA, VCIX-DCV, vExpert, Nutanix NTC, pilot

VCD Roundtable: Episode 46 – The Importance of Health Checks

Broadcast by

Creators and Guests

headphones Listen Anywhere

Listen Anywhere