Resume

SRI KRISHNA G

Employment History

Autodesk - Principal Engineer from November 2020 - July 2021

Autodesk has software products and services for the architecture, engineering, construction, media, education, and entertainment industries. The company has moved to a subscription-based licensing for the products that changed the way the internal IT team manages and provides availability to the customers. As a lead engineer in the Observability team, I am responsible for:

  • Designing and maintaining the alert, metric, and log pipelines.

  • Provide the application and operations team with insights and visibility with dashboards on Grafana from sources like Cloudwatch, Dynatrace, etc.

  • Liaise with counterparts on creating self-serve tools for the Observability pipelines to make the onboarding of applications user-friendly.

Visa - Lead software engineer from August 2019 to November 2020.

Visa is a payments company that strives to be the best way to pay and be paid. For this Visa needs a platform to validate transactions and bring out products using analytics from the data that is generated. The Data Platform of Visa builds the tooling around the platform and supports the systems. My responsibilities were:

  • To bring in DevOps culture in a financials-based organization to reduce the deployment and test times for the developers.

  • to streamline the lower non-production environments which will give the developers a consistent testing space. This gives a simpler view of the deployment to the teams who can address the issues effectively.

Zeotap - DevOps Engineer III from November 2018 to August 2019.

Zeotap provides data to the digital marketing ecosystem for better mobile targeting and insights. The company data has proven six times more accurate than current market benchmarks, solving a fundamental quality problem in the industry. As a lead DevOps Engineer I was responsible for:

  • Identified and reduced operational heavy tasks by automation of infrastructure provisioning with Terraform, application setup with Ansible, handling event-driven tasks with Stackstorm. This reduces the error-prone manual tasks and gives a platform that can be reproduced any number of times in a consistent manner.

  • Setup an observability stack that includes metrics, monitoring at the infrastructure level, the status of deployments from build systems like CircleCI, the success rate of deployments to the production environment, and application exception tracking ensuring better utilization and reduced costs.

  • Define and create CI/CD pipelines for both infrastructure and application workloads. This gives the developers a faster way to ship the code through various testing stages and eventually to production.

Ola Cabs - Lead DevOps Engineer from January 2016 to November 2018.

Ola is the most popular cab booking service in India. It has its presence in over 100 cities and caters to million+ bookings per day. My responsibilities were:

  • Key member in successfully migrating Ola Money services to Azure from AWS cloud. I worked on identifying the bottlenecks that come from moving across the cloud providers and providing an acceptable solution.

  • Cloud-agnostic infrastructure provisioning with Terraform. Worked in building a self-service portal where the users can get resources without worrying about the cloud-specific overhead.

  • Setup and maintain DNS infrastructure to handle close to million record requests without depending on cloud provider services.

Fireeye - Senior Engineer Virtualization from August 2015 to January 2016.

Fireeye is an enterprise cybersecurity company that provides products and services to protect against advanced cyber threats and is very well known for the prevention of zero-day vulnerability.

  • Planning and implementation of an Openstack cloud as an alternative for the more expensive VMware environment.

  • Setup, upgrade, and provision VCloud environment at various geographical locations.

Cisco - Senior Software Engineer from September 2010 to August 2015.

The BU dealt with satellite and IP broadcast systems, developed both the server and client-side services required for the proprietary CA systems (Conditional Access / DRM). My scope of work was:

  • Ensuring technical excellence in all things Linux, VMware, and storage for the development environment to meet the functional and business requirements.

  • Setup and maintain development environments and CI pipelines that will be branched into customer products

  • Hardware capacity planning, provisioning, and maintenance

Osprey Soft - Systems Administrator from December 2007 to April 2010.

Ospreysoft was a startup working on mobile game development and developing 3D animation work for television and advertising.

I worked in providing day-to-day support for small office networks, conceptualized and implemented automated OS installations and upgrades with the PXE setup.

Technical Skills

  • Programming Languages: Python, Shell scripting (bash)

  • Administration: Linux-based systems, Docker, Mesos Marathon Clusters, VMware (vCenter, VCS), Openstack private cloud, and Kubernetes.

  • Cloud computing:
    • Present proficiency - Amazon Web Services, Azure

    • Past proficiency - VMware, Openstack

  • Infrastructure as code and configuration management
    • Present proficiency - Ansible, Chef, Terraform, Serverless

    • Past proficiency - Puppet

  • Monitoring: Prometheus, Sysdig, Telegraf, Sentry for exception handling

  • Tools: Kafka, Elasticsearch, Graylog, cadvisor

  • Web framework: Django

Projects

  • Serverless pipeline HA stack on AWS - Autodesk

Cloud-based tools offered by providers are region-specific that have less HA across regions and this setup is to make the pipeline available across regions. I used an API gateway that handles the incoming requests which are in turn pushed to a Kinesis stream which in turn are consumed by Step functions and Lambda functions as per the project needs. Calculated health checks are configured on Route53 that switch the active load balancer to a secondary region to provide availability. The setup has an availability of about 99.9% which is in line with the project’s SLA keeping MTTR as low as possible.

  • Event driven Infrastructure - Zeotap/Visa

Event-driven systems help in running an action based on an event in the workflow. In a physical datacenter, Stackstorm an event-driven system provides a way to define the workflows. It offers a mechanism to integrate various disjoint systems like provisioning, monitoring, and build systems. The setup was used to provision Kafka, Hadoop clusters. Based on the scale requirements (as feedback from the monitoring system), a trigger can be set to provision new nodes with the required configuration so that the cluster scales upon demand without any manual intervention. The triggers and workflow chains are saved in the Stackstorm library that can be used to further create new workflows. With this system in place, a user can request any cluster service that will be deployed and managed according to the company standards and security requirements.

  • Mesos/Marathon/Chronos cluster - Ola cabs

Involved in setup, automation, and design of the Mesos/Marathon/Chronos cluster for production-grade container orchestration. This serves as the backbone for powering most micro-services at Ola. Migrate about 500+ applications to a container-based environment and provide dashboards to get insights into various systems. This was initially a setup on AWS EC2 machines and then later split across to Azure.

  • On demand agent nodes - Ola cabs

The development/testing team needs increased capacity than normal to run load testing cases. As these are short-term requirements, AWS spot fleet groups were picked up. A new fleet group is provisioned with Terraform and added to the existing clusters. The Dev/QA team will use the new group tag for deployments and can terminate the group if they choose to or the group is deleted as per the pre-set expiry schedule. This reduced the cost of the additional infrastructure by 50%.

  • Chef server upgrades - Ola cabs

The chef cookbooks used for provisioning various servers within the organization were initially used with AWS Opsworks - a chef-zero implementation offering. As the adoption to use multi-cloud environments started, we had to update the cookbooks to use the chef server and the existing flow with Opsworks. The cookbooks had to be updated to work with the Chef DSL resource changes that were made from Chef 11 to Chef 13.

  • Hulk - server consolidation - Cisco

A centralized resource cluster with tiered storage backends with HP FC SAN, Netapp NAS, and Gluster based NAS, VMware Vcenter providing with the compute resources. Storage design is based on varied workloads such as virtualization, media streaming, and backups. This setup provided a way to look at utilization and provide availability at every resource level. The clusters across regions were later connected to form an organization-wide setup where the pre-built machines can be quickly replicated across regions.

  • P2v and V2V migrations - Cisco

The individual systems running on standalone and out of support hardware were migrated to a centralized infrastructure. As the systems were set up semi-manual way, the configuration had to be kept intact along with the databases running on a few systems. There were about 200+ machines that were moved to VMs via Physical-to-Virtual (P2V) and Virtual-to-Virtual (V2V) migrations.

  • CI flow for server and client releases - Cisco

The client software versions were dependent on the server component versions. The CI system runs test cases against a given set of client and server versions. To facilitate CI and regression testing, the customer spec 20+ server components for about 10 different clients need to run in parallel. To facilitate this, an Openstack cluster was set up and each customer infrastructure configuration is deployed with Heat templates and application provisioning is done with Puppet-based manifests. Once this is deployed, the clients are set up with the compatible versions and tested.

  • iThrottle - Cisco

iThrottle is a web-based app that runs on the HTTP servers that host HLS/IP-based streaming media. A user can set traffic rules like packet drop, throttle, reorder, or a combination of all per client. This helped in testing the mobile clients like the iPad.

  • Videoscape Express - Cisco

Videoscape Express is a managed virtual machine-based setup targeted at small and medium customer-based cable operators. This provides the server-side conditional access system for the clients. The setup was hosted centrally and each operator is onboarded as a tenant on the system. VMware Vcenter provides the computing power along with HP SAN handling the storage requirements.

Education And Certifications

  • Bachelor of Technology in Mechanical Engineering from GEC - 2005

  • VMware ICM (Install, Configure, and Manage) vCenter.

  • Red Hat Certified Engineer.

  • Puppet Fundamentals organized by Puppet Labs.