MySQL DBA (Database Administrator) Opening at AOL

August 31st, 2009

Full details: http://bit.ly/19Ea7I

Contact: Carl.Coppadge at corp.aol.com or AIM: carlcoppadge11

Location: Dulles, VA

MySQL Database Administrator

AOL’s People Networks Operations, which operates AIM and ICQ, has an opening for a Sr. MySQL Database Administrator. This position would be responsible for managing large, highly scaled production MySQL databases as well as pre-production (QA/Dev/Staging) environments.

Key Responsibilities:

  • All aspects of MySQL deployment, operation,and design to ensure high reliability and performance
  • Collaborate with developers and architects to create high-performance, cost-effective designs
  • Participate in an on-call rotation as part of a 24×7 operations team to resolve urgent production issues
  • Ensuring data integrity with good process and a keen eye for detecting errors and misuses of data

    Desired Skills for this Position:

  • Knowledge of database architecture concepts as well as MySQL-specific implementations
  • Significant experience with MySQL in a high-volume production environment
  • Experience with open source ETL packages and methods for large scale data migration
  • Diverse technical background with awareness of concepts in networking, Linux, and storage
  • Bachelor-level degree in Engineering, Computing, or Sciences orequivalent experience
  • Replication: managing replication delay, scaling replication throughput, and designing resilient systems
  • Configuration: tuning InnoDB, managing memory usage, and tuning file systems for maximum throughput
  • Reliability: backups under load, failover strategies, and recovery from replication issues
  • MySQL: familiarity with Percona and Google patches

    Our perfect candidate can manage many rapidly changing projects while maintaining professionalism and poise. Our environment is both highly exciting and highly demanding — those who thrive here adopt a “work smarter, not harder” attitude.

How NOT to Inform Your Customers of an Outage

December 8th, 2008

There are a number of different ways to inform your customers of an outage. I’ve previously discussed how 365main and Amazon Web Services did this fairly well in the past. Unfortunately, Limelight Networks customers are hearing about issues with their CDN via GigaOM.

Read the rest of this entry »

Complexity and the 4 a.m. test

September 14th, 2008

 

With most technology, it’s a given that there’s almost always More Than One Way To Do It (unless you worship Python). There are always those situations where choices must be made, and different people use different yardsticks to decide. Some try to minimize “cost,” either up-front development cost or long-term engineering cost. The smarter ones have recognized the concept of “Technology Debt” as addressed by several observers. As a leader in Operations, however, I tend to subscribe to my own rule: the 4 a.m. rule.

Read the rest of this entry »

The Art of the Post-Mortem

July 26th, 2008

I’ve mentioned in the past that the failure of complex systems is an inevitable fact of nature. The corresponding act of human inquisition into the reasons for that failure are equally inevitable. Where I work — and almost every other large installation I’ve seen or been part of — the learnings from these inquisitions are shared for educational reasons. The name for this differs from company to company: some call it a RFO (reason for outage) or an After-Action Report, but for whatever reasons the name for this at AOL is a Post-Mortem.

Read the rest of this entry »

Velocity and Structure08

June 21st, 2008

A whole lot of conferences are happening this week, and I’ll be attending two of them. On Monday and Tuesday of this week I’ll be attending O’Reilly’s Velocity conference, where I’ll be moderating a panel entitled “Everything You Ever Wanted to Know about CDNs (but were afraid to ask).” I’m hoping that seems to be fun, but there ought to be a lot of other interesting people I’d like to see while there as well, including two other very smart folks from AOL (Mandi Walls and Eric Goldsmith). I’ve been thinking about this as “Web 2.0 Expo without all that boring UI and Business Stuff”. 


Velocity, the Web Performance and Operations Conference 2008

The second event I’ll be at will be GigaOM’s Structure 08. Cloud computing is really leveling the playing field, giving small start-ups access to world-class operational assets… which to me only underscore the importance of having brilliant Ops folks to run those systems. I’m eager to see what sort of discussions emerge.

If you happen to be at either, give me a buzz in the comments, and I’ll try and catch up with you. 

Really Big Data Centers for Lease

October 21st, 2007

This past Friday, DuPont Fabros Technology (DFT) raised $640 million in an IPO. DFT is a Real Estate Investment Trust (REIT) which specializes in large-scale commercial data centers. More to the point, they specialize in the sort of facilities which are desired by the largest technology companies. I’ve mentioned before that building and operating facilities is often desirable for larger players, but when it isn’t, they increasingly turn to DFT.

Read the rest of this entry »

Be nice

October 16th, 2007

It’s been well-reported that AOL made cuts today. While I wasn’t among those affected, naturally with any event this large, quite a few people I knew and worked with were amongst those impacted. Read the rest of this entry »

Things Fall Apart, Datacenter Edition

August 2nd, 2007

The relentless pursuit by Operations staff of 100% uptime has always struck me as something more than just a job, but a battle against the relentless forces of nature. Everything ultimately breaks down — systems, buildings, even people — and attempting to maintain 100% availability is the Ops equivalent of trying to cheat death. Sooner or later, despite our best efforts, our number will ultimately be up. Most recently in the news, self-proclaimed World’s Finest Data Center operator 365 Main suffered an approximately 45 minute power outage at their San Francisco facility. Much to their credit., and unlike most of their competitors, 365 Main has been extremely open about their investigation. I’ll examine this a bit today, as it’s a rare public glimpse into what goes on inside a large data center facility.
Read the rest of this entry »

PRESENTATION: Geographic Distribution for Global Web Application Performance

April 17th, 2007

As promised, the presentation from Geographic Distribution for Global Web Application Performance. This was presented today at Web 2.0 Expo.

Geographic Distribution for Global Web Application Performance

April 16th, 2007

I’m pleased to announce that on Tuesday, April 17th, I’ll be presenting a brief discussion of Geographic Distribution at Web 2.0 Expo in San Francisco. As the web matures, performance has become a tremendous issue, especially when deploying an application for a global audience. One important way to improve performance is the geographic distribution of application delivery. Join me at 8:30am tomorrow in 2018, or check out my slides, which will be posted shortly after the discussion.