HomeProjectsInternships

Internships Project

Published Sep 15, 2024
Updated Jan 29, 2026
6 minutes read
📈 From contributing #7 highest commits to building a 5.2M+ interaction processing engine that revolutionized how students find internships.

The Beginning

In May 2024, I had already established myself as a significant contributor to the internship tracking ecosystem. Having made the #7 highest number of commits to the previous year's Simplify repository, I had developed a deep understanding of the challenges and opportunities in the space.

While I was actively contributing to Simplify, Ouckah had started his own internships repository with a vision for something different. When I saw the potential for this newer platform to become the definitive internships resource for students, I decided to reach out:

"Afternoon boss, Happy to help as a contributor and handle the approval/duplication labelings for the internships that come in, had #7 highest internship commits with last year's simplify one 🙂"

The response from Ouckah was immediate and enthusiastic:

"holy crap 😭 i would love to have you as a contributor ill send an invite now 🫡"

What started as an offer to help with basic approval and duplication labeling would soon evolve into something much bigger.

The Handover

As my contributions grew and the original maintainer recognized my commitment and technical capabilities, the inevitable happened. Around the 1,000 star milestone, the repository was officially handed over to me.

I inherited more than just code. I inherited a community of thousands of students relying on this platform for their career opportunities. The responsibility was both exciting and daunting. The existing system, while functional, was relatively simple. I could see the potential to transform it into something truly revolutionary.

The Challenge

The manual process of handling internship submissions was becoming unsustainable. Every day brought dozens of new submissions, edits, and closures through GitHub issues. The existing workflow required:

With thousands of students depending on timely updates and the repository growing exponentially, I knew automation wasn't just an optimization. It was a necessity.

⚡ The Challenge: Transform manual processes into intelligent automation while maintaining data quality and community trust.

Building the Engine

The most significant technical challenge was building a complete URL shortening and analytics infrastructure to handle millions of job interactions efficiently.

URL Service

I built a dedicated microservice architecture using TypeScript and Express to handle URL generation and redirects:

@Controller()
export class CreateController {
    public xata = Container.get(XataService);
 
    @Post('/create')
    async createUrl(@BodyParam("url") url: string) {
        const id = crypto.randomBytes(10).toString("hex");
        const newUrl = await this.xata.createUrl(id, url);
        return { data: newUrl, message: 'redirect created successfully' };
    }
}

The service generates cryptographically secure short URLs using Node.js crypto module, storing them in a Xata database for lightning-fast lookups.

Redirects

The redirect engine handles millions of clicks with minimal latency:

@Controller()
export class RedirectController {
    @Get('/to/:urlId')
    async redirectTo(@Param("urlId") urlId: string, @Res() response: any) {
        const urlTarget = await this.xata.fetchUrlTarget(urlId);
        response.redirect(urlTarget);
        return response;
    }
}

Each redirect is tracked, analyzed, and logged, providing comprehensive analytics on job application patterns.

Form Processing

The GitHub repository automation handles form submissions through intelligent parsing:

# Form field mapping - the foundation of automation
LINES = {
    "url": 1,
    "company_name": 3,
    "title": 5,
    "locations": 7,
    "season": 9,
    "sponsorship": 11,
    "active": 13,
    "email": 15,
    "email_is_edit": 17
}
 
def getData(body, is_edit, is_close, username):
    lines = [text.strip("# ") for text in re.split('[\n\r]+', body)]
    data = {"date_updated": int(datetime.now().timestamp())}
 
    # Handle different form types with specialized logic
    if is_close:
        # Process internship closure requests
        if "no response" not in lines[CLOSE_LINES["company_name"]].lower():
            data["company_name"] = lines[CLOSE_LINES["company_name"]].strip()
        if "no response" not in lines[CLOSE_LINES["role_title"]].lower():
            data["role_title"] = lines[CLOSE_LINES["role_title"]].strip()

This system could intelligently parse form submissions, handling edge cases like "no response" fields, multiple location formats, and various sponsorship options.

Database

The Xata service layer provides robust data management with built-in analytics:

@Service()
export class XataService {
    public async fetchUrlTarget(urlId: string): Promise<any> {
        const response = await getXataClient().db.links
            .filter({ url_id: urlId }).getFirst();
        if (response?.url_target === undefined)
            throw new NotFoundError(`url does not exist.`);
        
        const url = new URL(response.url_target);
        console.log('Host:', url.host);
        console.log('Pathname:', url.pathname);
        console.log('Search Params:', url.searchParams);
        return url;
    }
 
    public async createUrl(urlId: string, urlTarget: string): Promise<any> {
        return getXataClient().db.links.create({
            url_id: urlId,
            url_target: urlTarget
        });
    }
}

Integration

The Python automation scripts integrate the URL shortening service with GitHub repository management:

def add_https_to_url(url):
    if not url.startswith(("http://", "https://")):
        url = f"https://{url}"
    return url
 
# UTM parameter cleaning to prevent duplicates
utm = data["url"].find("?utm_source")
if utm == -1:
    utm = data["url"].find("&utm_source")
if utm != -1:
    data["url"] = data["url"][:utm]

This two-tier architecture separates concerns: the TypeScript service handles high-performance redirects while Python scripts manage repository automation and data processing.

Architecture Insight: The microservice approach allows independent scaling of URL redirection (high-frequency, low-latency) and repository management (lower-frequency, data-intensive) operations.

Data Architecture

Duplicate Detection

Building a robust duplicate detection system was critical. The algorithm needed to identify potential duplicates while allowing for legitimate multiple positions at the same company:

if listing_to_update := next(
    (item for item in listings if item["url"] == data["url"]), None
):
    if new_internship:
        util.fail("This internship is already in our list. See CONTRIBUTING.md for how to edit a listing")
    for key, value in data.items():
        listing_to_update[key] = value

The system uses URL-based matching as the primary key, ensuring that the same posting can't be submitted multiple times while allowing companies to have multiple different positions.

Listing Management

The closing mechanism became particularly sophisticated, handling various edge cases:

# Find matching listings for closure
candidates = []
for item in listings:
    if (item["company_name"].lower() == company_name.lower() and 
        item["title"].lower() == role_title.lower()):
        candidates.append(item)
 
# If URL provided, filter by URL for precision
if job_url and candidates:
    url_matches = [item for item in candidates if item["url"] == job_url]
    if url_matches:
        candidates = url_matches
 
if not candidates:
    util.fail(f"No internship found matching company '{company_name}' and role '{role_title}'")
elif len(candidates) > 1:
    util.fail(f"Multiple internships found. Please provide the job URL to specify which one to close.")

This intelligent matching system prevents accidental closures while making the process seamless for legitimate requests.

README Generation

Table Creation

One of the most complex challenges was automatically generating the README tables that thousands of students view daily. The system needed to:

def create_md_table(listings):
    table = ""
    table += "| Company | Role | Location | Application/Link | Date Posted |\n"
    table += "| ------- | ---- | -------- | ---------------- | ----------- |\n"
 
    curr_company_key = None
    curr_date = None
    for listing in listings:
        # Smart company grouping with ↳ symbol
        company_key = listing['company_name'].lower()
        if curr_company_key == company_key and curr_date == date_posted:
            company = "↳"
        else:
            curr_company_key = company_key
            curr_date = date_posted
 
        table += f"| {company} | {position} | {location} | {link} | {date_posted} |\n"

Locations

Handling location data required special formatting logic to maintain readability:

def getLocations(listing):
    locations = "</br>".join(listing["locations"])
    if len(listing["locations"]) <= 3:
        return locations
    num = str(len(listing["locations"])) + " locations"
    return f'<details><summary>**{num}**</summary>{locations}</details>'

This creates expandable location lists for companies with many offices while keeping the table clean for those with fewer locations.

Analytics Infrastructure

UTM Tracking

To track the effectiveness of our platform, I implemented comprehensive UTM tracking:

def getLink(listing):
    if not listing["active"]:
        return "🔒"
    link = listing["url"]
    # Add tracking parameters to every link
    if "?" not in link:
        link += "?utm_source=github-vansh-ouckah"
    else:
        link += "&utm_source=github-vansh-ouckah"
    
    return f'<a href="{link}"><img src="{APPLY_BUTTON}" width="118" alt="Apply"></a>'

This tracking system now processes over 5.2 million job interactions, providing unprecedented insights into how students discover and apply to internships.

📊 Analytics Impact: 5.2M+ interactions tracked | 50% storage reduction | Zero downtime deployment

CI Pipeline

Validation

Every submission triggers a comprehensive validation pipeline:

def checkSchema(listings):
    props = ["source", "company_name", "id", "title", "active", 
             "date_updated", "is_visible", "date_posted", "url", 
             "locations", "season", "company_url", "sponsorship"]
    for listing in listings:
        for prop in props:
            if prop not in listing:
                fail("ERROR: Schema check FAILED - object with id " +
                      listing["id"] + " does not contain prop '" + prop + "'")

The system ensures data integrity while automatically generating commit messages and updating the repository.

Results at Scale

Processing Volume

MetricValueImpact
Job Interactions5.2M+Real-time analytics and insights
Monthly SubmissionsThousandsFully automated processing
Storage Optimization50% reductionColumnar compression
Manual InterventionMinimalWorking towards full automation

Community Impact

What started as a manual process has become a fully automated platform serving:

Students: Thousands discover internships daily through our streamlined interface

Companies: Hundreds post opportunities with instant visibility

Contributors: Active community maintains data quality through GitHub

Ecosystem: Real-time updates accessible to everyone, everywhere

Technical Architecture

Core Technologies

URL Shortening Service:

Repository Automation:

Design Principles

The system was built with several key principles:

  1. Automation First - Minimize manual intervention
  2. Data Integrity - Comprehensive validation at every step
  3. User Experience - Fast, reliable, accessible interface
  4. Scalability - Handle exponential growth gracefully
  5. Community-Driven - Empower contributors while maintaining quality

Future Vision

The internships platform has evolved from a simple tracking tool to a comprehensive career discovery engine. With the foundation now in place, the platform continues to innovate in areas like:

Visit Internships Board →