VCD Roundtable: Episode 46 – The Importance of Health Checks
Hello and welcome to episode
number 46 of the VCD Roundtable.
We are still in Dubai at the comdivision kickoff and in
today's episode, which is episode number 46,
we are going to cover health checks.
Sascha? Yes, hello and welcome from Dubai.
So we will cover health checks.
We will speak a little bit about what's important on a
health check and why every service provider should get a
health check or should do a health check of their complete
environment every year at least.
All right, let's do health checks.
Starting with Sascha.
All right.
You still have a pulse.
Not that bad.
Hello, welcome from Dubai.
Hi guys.
Today with us is Abdel.
Yeah.
So I'm Abdel.
I am a new employee with comdivision.
Yeah.
comdivision.
I'm a senior consultant, focused on NSX and VCF and
this is my first Roundtable with the
guys here and hopefully we have fun.
Of course we will.
Yeah.
That's the plan.
That's the plan.
Okay, Sascha.
So you started the intro into health checks.
So what's important?
Why is it important for CSPs to actually have a health
check, letting us sneak into their infrastructure and just
putting our finger into all these holes and saying,
" oh, there might be an option to do
something better or even different."
So what we saw in our experience with all of the service
providers, where we have written the architecture, where we
supported questions (also in the normal support
if they are not fast enough with the Broadcom support
or the older VMware support),
we figured out that most likely most of the service
providers problems are the same ones or coming from the
same direction, same with the enterprise stuff.
So there are a lot of points where service providers do not
have the correct architecture, have maybe updated their
environment but not changed the architecture stuff.
Because if I go to new release, I also need to take care
that all of the required changes
will be done for the new architecture.
And because of that, it's important to
get take a look at the infrastructure.
Take a look what is running like expected.
What maybe needs to be changed.
All right, guys.
And now, sorry, Sascha, pardon the interruption.
And now we just demo that it is live,
because Aya just arrived. Aya, we're already live.
Maybe a few words to introduce yourself.
Sure.
Where's the camera?
The camera is just in front of you.
So, hi, everyone.
Hi, everyone watching.
My name is Aya.
I'm a senior consultant at comdivision.
I have been working with VMware by Broadcom infrastructure
for more than three-four years now.
And now we proved that the Roundtable is actually live!
Again, pardon the interruption, Sascha.
We are discussing health checks today, and Sascha just began
to explain why health checks are important,
pointing out what an architecture should be, could be, and
what is achievable, and sticking with an architecture
and just think before you implement.
Yes.
And so we did a lot of health checks in the past.
We have currently working on reworking our
health checks, reworking on the problems.
So taking a look at all of these support tickets from
the last year and really
taking a look at how we can prevent it,
how we can see the problems before the problems really
become a problem for your environment,
for your customers as a service provider.
So I like the approach and the thinking you have shared
with us and also we take a very
standardized approach, but I think it's important to point
out that for us, a health check is
not just to run a few scripts and take the
results from the scripts and add it into a
document, the results of the scripted document in an
automated way and just come up with
some automated documentation pointing out some things.
So a big part of a health check needs to be
or is a manual approach, just really looking
into the details of that specific infrastructure to find
out what's important and what could
be done different or maybe better
with that specific infrastructure.
I think when we look at all of this, this is becoming more
important also now that service
providers start to prepare for their transition either into
VCF 5.2 or in the near future into
VCF 9 because from that perspective, customers or service
providers need to prepare and need
to be sure that they understand what needs to change
potentially in their infrastructure
before they can actually move into VCF. So
especially now, a health check is becoming
more important than it was potentially in the past. But as
you said Matthias, I think it's very
important to say that while we can collect a bunch of data
over scripts. It's always important
that a health check also involves like a manual procedure
where we go really over systems because
you can only test to a certain degree with scripts but a
perfect example I always use is
looking into the advanced settings in vSphere is always a
good idea, because no matter what you do
with the scripts, you need to ask for a very specific
setting and sometimes customers might
have a setting from a specific vendor. And I have seen
sometimes settings which have been around for
multiple years, including things which
could actually horribly go wrong, because
if you're sitting in a health check -- just to give a very
quick example -- and you identify that they
changed the isolation addresses for HA and then you ask,
" which are these systems?" and
then the answer is, "we have not used these IP addresses
for the last three years,"
which goes perfectly well as long as all your
infrastructure is well working.
But if you would actually have a major
outage that would be a lot of fun then.
I totally agree with everything you're saying. I think a
health check is very important nowadays.
It's not like something optional. I guess it should be like
a must, every customer
especially service providers should do a health check. It's
not like a monthly basis, because it
could be like you know a bit hard for them, but at least on
a quarterly basis, because customers --
depending you know on their size -- might have lots
of employees that are
using VMware infrastructure. So the thing is that they might
make changes and another team member
cannot know what these changes are.
So the thing is doing a health
check you know covering each point and not through scripts,
being customized for
each customer. I think this is a must and it would really
find things that you did not
even know existed or are applied in your
infrastructure. So I think it is definitely a must
and super helpful thing that we all need to do and
encourage you know our customers to do.
Adding to that, we have done many implementations of certain
implementations. You're just adding
some advanced settings, which make perfect sense regarding
a specialized storage system. So
it's every vendor provides, "please apply setting ABC to
better fit the solution provided by us"
and then you might change even just the storage vendor or
come up with a new cluster and the
cluster has a different storage solution and you just apply
the predefined values without thinking,
because that is the standard with your customer but with the
new storage solution it might just cause
issues. I agree because I think that
everyone can run can run scripts right. It's not
tied to a certain company or a certain
service provider. Anyone can run scripts, but
if you run the health checks with an architect or a
consultant it actually turns from a manual point running...
I mean if you run through manual points across all
the configuration it actually turns
into a discussion with the service provider and that turns
into a better discussion on how can we
enhance the environment, how can we enhance the
infrastructure, what are the missing parts here,
what are the missing parts there and then we can come up
with a complete or I would say near a
perfect infrastructure for the service provider to provide
their customers. It's even though you
have that setting, as you mentioned, you started discussion
and then the customer ideally comes
up with, "we have we made a decision based on the business
requirement and solution we're using"
that's the documentation, that's just the justification: why
this setting has been applied. Then you
come up like, "from an architectural perspective it's now
a trade-off, because that is not working
any longer because it's off"; then you can start over like,
" I need to make come up with a new decision
and justify why we're changing it in a specified way and
document it again and then we'll have it
safed for the future." Exactly, because that requirement
might have been valid three years ago but it's
not valid as of now. An architecture dument
and documentation is a living
creature, which changes over and over and over but if you do
an architecture just once, and then
rely on it and you change something in the back end, like
changing a vendor or whatever, you need to
also reconsider changing the architecture in order to
satisfy. Additionally, with all the changes
with licensing going from the old
license model to VCF, so maybe a lot of
design decisions were taken with the old license model in
the background so maybe we have a lot
more options now how to solve problems with all
licenses more or less included -- especially
with Aria Operations and Aria Logs, so we can give
much more information about how you can solve
your problems much better than you are currently doing it.
Especially from a from a pricing
perspective, license price perspective, because in the past
you didn't care about certain values
and metrics, but nowadays with the whole VCF licensing game,
a few parameters just changed like
the cores: now we charge per core and not per memory,
so that's a different ball game.
Yeah also, the point was... sorry, yeah go on.
Today we're a big
group! Everyone wants to share.
Everybody wants to be on the microphone. I think when you
look at it in so many other areas it's
like what Aya said, it's so common that we get things
checked whether it's we get our cars into
maintenance, if we go into a physician to actually get
ourselves checked out. If you go into a hospital
you don't want to go into an MRT that has actually been
there for five years and no one knows
if it runs often. Everybody potentially looks at the MRT
but does everybody look at the IT
infrastructure behind the MRT? But, honestly speaking coming
up with medical devices and how
often they update the infrastructure behind an MRT might
not be the best example. The checkups
of the devices are still tnot that often. It's not about
the software updates itself, because in many
cases those are totally isolated systems so that is not a
problem. If you look into a lot of
aviation systems you will find with Microsoft Windows
versions which are outdated for a very
long time period, but they are still in that specific use
case perfectly fine because it's
running in an isolated system. When you take both airplanes
most of these onboard systems are
still using in some cases floppy disks to
update flight management. Did you again
create the safe icon? Something like that. I think when
we look at this applying it to both
service providers and enterprises, this should become far
more a normal procedure on a regular
basis, because you're basically building your complete
company. If your infrastructure implodes
then this is a bad thing and as much as we do this,
sometimes already for security we should do this
for basic infrastructure services as well and it can be a
pretty straightforward process if
everything is fine and if you do this on a regular basis
like what Aya said, it's not a multi-week
project. This can be done in hours or maybe a few days
depending on the size of your infrastructure
and the complexity amount of products etc. but this does not
need to be a very harmful process and in
most cases we don't need to even touch anything and change
anything actively in the infrastructure.
Even if we talk about CSPs and
enterprises you mentioned, a health check for a CSP
is different compared to an enterprise because they run the
infrastructure differently, it is a
different use case it needs a different
configuration it needs to have a different
look at how that infrastructure is configured and used.
Maybe it has less products as well.
Maybe or differently configured because of that. That is an
important point. It's like if you're a
service provider out there shopping for a health check or a
validation this is what we have seen
in the past where people were just
applying. There was, in the old days, these
vSphere Optimization Assessments, which some people just
used as a health check. Some of the outcomes
were actually contradictory to the Cloud Director settings.
So, if people were to follow that
blindly, because it was just a script, that could have
completely broken Cloud Director, because
some of the systems might request and say, "ah you
have three levels of resource
pools; that is not necessarily the best idea. You have your
NSX instances in a resource pool
that is maybe not the best idea." Whereas reality is, in a
CSP environment with Cloud Director
there are good reasons why we do these things so just the
script is not not necessarily a good
idea and it needs to be specialized for the use case.
Exactly so that's why
it's very handy to have a consultant or an architect doing
that whole health check and not
just rely on some scripts. Yeah you need to have a very
long list. So I had this discussion
with a lot of service providers where we discussed why we
are doing health checks and many of our
secondary White Labels (also bigger ones) asked us to do a
health check because they said,
"we have a normal vSphere environment experience, but not
more. We currently have no information
about what's coming with/in VCF, what are the
requirements to move to VCF -- that's also
a readiness assessment health check -- and all of that
combined means you need really a partner
who has the experience in Cloud Director,
in vSphere, and in this combination, because as
Yves mentioned before, there are
some normal health checks in the market but
that's not directly covering the needs and the requirements
of Cloud Director. I think the key
thing here that we're focusing on doing with our customers
is to let them benefit from the product
that they have and all its aspects. So what we're
trying to do is maybe they only know
vSphere but we're trying to train them on the whole stack --
NSX, the whole VCF stack,
Aria everything, so we're not just offering them here like a
health check solely it's just as well
we're going to train them or at least deliver
our knowledge to them so they know how to
use this full stack. Let me introduce something
additional with... oh but if we
introduce it, if we produce like Operations Manager or VCF
Operations and you use it proactively
you could have seen an issue earlier on and react
proactively instead of having customers complaining
about performance. Exactly, exactly. Or just optimize your
infrastructure overall using
VCF Operations you can actually optimize the infrastructure
so like you said just mitigate
any error that might come up or basically just, I don't know
how to say it, but balance your resources
to a point that you feel like as a service provider that you
are getting your ROI the correct way.
Monitor an SLA for example. Yeah. Because
I've seen service providers reporting back
to their customers that we are offering an SLA and they
even have no idea if they fulfill
the reported or the agreed SLA. A health check is also
introducing additional monitoring options
to validate if the offering that the customer signed is
still true and that they fulfill everything.
So, from a legal perspective, am I able to
actually provide what the customers have signed
for. That is one of the areas; the other area, I think is
also always to look closely into the
fact, how this can help you from an
internal perspective. Not only from a management
perspective, but using a health check to potentially get
some protection and getting an idea if
everything runs as it should be, it sometimes can also be
helpful for the admin themselves to request:
"can we actually get someone in for a health check so that
we can confirm to management that
we are truly operating on best practices, that we
learn with the health check." As Aya discussed
how we should configure things or that we learn how
to utilize Aria Operations, like what
Abdel said, so there are so many different aspects you can
take out of it, and my experience has
always been that sometimes for the admins, they feel a bit
held back in the beginning.
Especially when it is ordered more or less by management,
because people feel like, "oh,
yeah, someone is coming and actually checking if i'm
really doing my job." I would always argue:
don't see these health checks as a job guarantee or
anything else. They are
really there to actually check the infrastructure and fix
issues potentially, because let's face it
the the difference is you are running your
infrastructure on a daily basis so you have
a focus, you are completely isolated on that specific
infrastructure. You might see maybe
a customer infrastructure from time to time, but you more or
less have one approach: what you always
have been doing. And the big advantage is, if
people from our team are actually doing the
health check, we see maybe in a week five-ten different
cloud provider infrastructures; we
design every quarter X amount of new Cloud Directory
infrastructures. We constantly need
to keep ourselves updated with what is the current best
practice, because this is something we have
seen very very often. It's like settings that were
best practice two or three years ago
are no longer the best practice, because people have figured
out they are not working. A few years
back we were telling people, "no, no, don't put a
hard disk or put a disk device into your
server or boot device. A usb stick is the perfect solution."
We all learned that that was not the
best idea and so this had to be turned around again.
There are so many of these examples
over the last 10-15 years and this does not only apply to
VMware, this applies to
Microsoft, to nearly every vendor. Best practices change and
also keep in mind not every best
practice applies to every infrastructure. So the only
constant in life we have is change
and sometimes an external view into an
infrastructure might just enlighten people, like "oh I
completely overlooked a certain configuration. Yes, thanks
for mentioning it." A health
check does not mean that we are coming to a CSP and just
start finger pointing: "oh that's wrong,
that's wrong, that's whatever." It's also about like,
"oh, that's a really nice configuration" or
"oh that's an interesting configuration could you please
explain why is it configured that way?"
And then you get the
explanation, like "oh that makes perfect
sense. Thank you!" That's also a validation if the
documentation is properly done, because if someone
can explain this is the justification why we have made that
decision, it shows that you have a
proper well maintained documentation. Good. If you want to
do a health check... if you don't have it included...
I mean, some of our service providers
have it included in their service
pack, so we do it with them at least on a yearly basis, some
have it via the White Label program
as a startup fee, but if you're not in any of
these programs, I mean clearly the best
solution is to become part of our service pack program, get
the support from us and all of these
things, and then you get your yearly health check and many
other things like Challenge Days, Architecture Think Tanks
and stuff like that included potentially. But if you say you
want to just actually do a health check, Sascha?
Just reach out to us, drop us a message, and we will
schedule a meeting and we can
send you and quote for a health check. It depends on
how many data centers, how much
infrastructure, what products are in use, etc..
but we have a fixed fee engagement,
so that also means we give
you one price fixed fee and then you have
the real outcome for your environment. So, an
infrastructure is like a human being, right. The longer it
lives, the older it gets, the more often
it needs a checkup. So you run to the doctor every week?
I should. All of that being said, I think
that's a good approach and maybe what we do in a few months
time, not too far, is we take,
as we are just revamping the new health checks out there, we
do a bit of a sum up story and actually
consolidate the funniest things we have found in the
last six months is health checks and
make an episode about that. See, Sascha, you
already have a topic set.
So, that being said any any final words
to say, Matthias? Yes. It was great talking
about health checks. Thanks to Aya and Abdel, showing off
in their first VCD Roundtable.
We proved that it's live. Sascha, I really need a
check up and a health check, please
take care of me. Absolutely. We'll take care of you, don't
worry. Alright now, I need to rock. Bye!
Abdel, final words?
I'm really looking forward to our CSPs to come in
and have this discussion -- I'd rather say
and come up with the best solution moving forward for them
and their customers -- of course. Good, so as
we are going to chase up Matthias now, I would say thank you
all for listening into episode number 46 on
health checks. This was live from Dubai. Next week we are...
no, we are going to be in two weeks back in our
home offices i think... Let's take a look...
We will see. Sascha and I have been on
the road for maybe 4 weeks again. Hope to see you
again soon. If you have any topic ideas,
if you have anything like that, feel free to drop them in
the comments below the section and hope
to see and hear from you soon, goodbye! Bye!
Creators and Guests

