Internships Project
📈 From contributing #7 highest commits to building a 5.2M+ interaction processing engine that revolutionized how students find internships.The Beginning
In May 2024, I had already established myself as a significant contributor to the internship tracking ecosystem. Having made the #7 highest number of commits to the previous year's Simplify repository, I had developed a deep understanding of the challenges and opportunities in the space.
While I was actively contributing to Simplify, Ouckah had started his own internships repository with a vision for something different. When I saw the potential for this newer platform to become the definitive internships resource for students, I decided to reach out:
"Afternoon boss, Happy to help as a contributor and handle the approval/duplication labelings for the internships that come in, had #7 highest internship commits with last year's simplify one 🙂"
The response from Ouckah was immediate and enthusiastic:
"holy crap 😭 i would love to have you as a contributor ill send an invite now 🫡"
What started as an offer to help with basic approval and duplication labeling would soon evolve into something much bigger.
The Handover
As my contributions grew and the original maintainer recognized my commitment and technical capabilities, the inevitable happened. Around the 1,000 star milestone, the repository was officially handed over to me.
I inherited more than just code. I inherited a community of thousands of students relying on this platform for their career opportunities. The responsibility was both exciting and daunting. The existing system, while functional, was relatively simple. I could see the potential to transform it into something truly revolutionary.
The Challenge
The manual process of handling internship submissions was becoming unsustainable. Every day brought dozens of new submissions, edits, and closures through GitHub issues. The existing workflow required:
- Manual parsing of GitHub issue forms
- Hand-coding each entry into the JSON database
- Manual duplicate detection
- Time-consuming README regeneration
- Inconsistent data validation
With thousands of students depending on timely updates and the repository growing exponentially, I knew automation wasn't just an optimization. It was a necessity.
⚡ The Challenge: Transform manual processes into intelligent automation while maintaining data quality and community trust.Building the Engine
The most significant technical challenge was building a complete URL shortening and analytics infrastructure to handle millions of job interactions efficiently.
URL Service
I built a dedicated microservice architecture using TypeScript and Express to handle URL generation and redirects:
@Controller()
export class CreateController {
public xata = Container.get(XataService);
@Post('/create')
async createUrl(@BodyParam("url") url: string) {
const id = crypto.randomBytes(10).toString("hex");
const newUrl = await this.xata.createUrl(id, url);
return { data: newUrl, message: 'redirect created successfully' };
}
}The service generates cryptographically secure short URLs using Node.js crypto module, storing them in a Xata database for lightning-fast lookups.
Redirects
The redirect engine handles millions of clicks with minimal latency:
@Controller()
export class RedirectController {
@Get('/to/:urlId')
async redirectTo(@Param("urlId") urlId: string, @Res() response: any) {
const urlTarget = await this.xata.fetchUrlTarget(urlId);
response.redirect(urlTarget);
return response;
}
}Each redirect is tracked, analyzed, and logged, providing comprehensive analytics on job application patterns.
Form Processing
The GitHub repository automation handles form submissions through intelligent parsing:
# Form field mapping - the foundation of automation
LINES = {
"url": 1,
"company_name": 3,
"title": 5,
"locations": 7,
"season": 9,
"sponsorship": 11,
"active": 13,
"email": 15,
"email_is_edit": 17
}
def getData(body, is_edit, is_close, username):
lines = [text.strip("# ") for text in re.split('[\n\r]+', body)]
data = {"date_updated": int(datetime.now().timestamp())}
# Handle different form types with specialized logic
if is_close:
# Process internship closure requests
if "no response" not in lines[CLOSE_LINES["company_name"]].lower():
data["company_name"] = lines[CLOSE_LINES["company_name"]].strip()
if "no response" not in lines[CLOSE_LINES["role_title"]].lower():
data["role_title"] = lines[CLOSE_LINES["role_title"]].strip()This system could intelligently parse form submissions, handling edge cases like "no response" fields, multiple location formats, and various sponsorship options.
Database
The Xata service layer provides robust data management with built-in analytics:
@Service()
export class XataService {
public async fetchUrlTarget(urlId: string): Promise<any> {
const response = await getXataClient().db.links
.filter({ url_id: urlId }).getFirst();
if (response?.url_target === undefined)
throw new NotFoundError(`url does not exist.`);
const url = new URL(response.url_target);
console.log('Host:', url.host);
console.log('Pathname:', url.pathname);
console.log('Search Params:', url.searchParams);
return url;
}
public async createUrl(urlId: string, urlTarget: string): Promise<any> {
return getXataClient().db.links.create({
url_id: urlId,
url_target: urlTarget
});
}
}Integration
The Python automation scripts integrate the URL shortening service with GitHub repository management:
def add_https_to_url(url):
if not url.startswith(("http://", "https://")):
url = f"https://{url}"
return url
# UTM parameter cleaning to prevent duplicates
utm = data["url"].find("?utm_source")
if utm == -1:
utm = data["url"].find("&utm_source")
if utm != -1:
data["url"] = data["url"][:utm]This two-tier architecture separates concerns: the TypeScript service handles high-performance redirects while Python scripts manage repository automation and data processing.
Architecture Insight: The microservice approach allows independent scaling of URL redirection (high-frequency, low-latency) and repository management (lower-frequency, data-intensive) operations.
Data Architecture
Duplicate Detection
Building a robust duplicate detection system was critical. The algorithm needed to identify potential duplicates while allowing for legitimate multiple positions at the same company:
if listing_to_update := next(
(item for item in listings if item["url"] == data["url"]), None
):
if new_internship:
util.fail("This internship is already in our list. See CONTRIBUTING.md for how to edit a listing")
for key, value in data.items():
listing_to_update[key] = valueThe system uses URL-based matching as the primary key, ensuring that the same posting can't be submitted multiple times while allowing companies to have multiple different positions.
Listing Management
The closing mechanism became particularly sophisticated, handling various edge cases:
# Find matching listings for closure
candidates = []
for item in listings:
if (item["company_name"].lower() == company_name.lower() and
item["title"].lower() == role_title.lower()):
candidates.append(item)
# If URL provided, filter by URL for precision
if job_url and candidates:
url_matches = [item for item in candidates if item["url"] == job_url]
if url_matches:
candidates = url_matches
if not candidates:
util.fail(f"No internship found matching company '{company_name}' and role '{role_title}'")
elif len(candidates) > 1:
util.fail(f"Multiple internships found. Please provide the job URL to specify which one to close.")This intelligent matching system prevents accidental closures while making the process seamless for legitimate requests.
README Generation
Table Creation
One of the most complex challenges was automatically generating the README tables that thousands of students view daily. The system needed to:
- Group listings by company intelligently
- Handle location display for multiple offices
- Show sponsorship information clearly
- Maintain chronological ordering
- Generate proper markdown formatting
def create_md_table(listings):
table = ""
table += "| Company | Role | Location | Application/Link | Date Posted |\n"
table += "| ------- | ---- | -------- | ---------------- | ----------- |\n"
curr_company_key = None
curr_date = None
for listing in listings:
# Smart company grouping with ↳ symbol
company_key = listing['company_name'].lower()
if curr_company_key == company_key and curr_date == date_posted:
company = "↳"
else:
curr_company_key = company_key
curr_date = date_posted
table += f"| {company} | {position} | {location} | {link} | {date_posted} |\n"Locations
Handling location data required special formatting logic to maintain readability:
def getLocations(listing):
locations = "</br>".join(listing["locations"])
if len(listing["locations"]) <= 3:
return locations
num = str(len(listing["locations"])) + " locations"
return f'<details><summary>**{num}**</summary>{locations}</details>'This creates expandable location lists for companies with many offices while keeping the table clean for those with fewer locations.
Analytics Infrastructure
UTM Tracking
To track the effectiveness of our platform, I implemented comprehensive UTM tracking:
def getLink(listing):
if not listing["active"]:
return "🔒"
link = listing["url"]
# Add tracking parameters to every link
if "?" not in link:
link += "?utm_source=github-vansh-ouckah"
else:
link += "&utm_source=github-vansh-ouckah"
return f'<a href="{link}"><img src="{APPLY_BUTTON}" width="118" alt="Apply"></a>'This tracking system now processes over 5.2 million job interactions, providing unprecedented insights into how students discover and apply to internships.
📊 Analytics Impact: 5.2M+ interactions tracked | 50% storage reduction | Zero downtime deploymentCI Pipeline
Validation
Every submission triggers a comprehensive validation pipeline:
def checkSchema(listings):
props = ["source", "company_name", "id", "title", "active",
"date_updated", "is_visible", "date_posted", "url",
"locations", "season", "company_url", "sponsorship"]
for listing in listings:
for prop in props:
if prop not in listing:
fail("ERROR: Schema check FAILED - object with id " +
listing["id"] + " does not contain prop '" + prop + "'")The system ensures data integrity while automatically generating commit messages and updating the repository.
Results at Scale
Processing Volume
| Metric | Value | Impact |
|---|---|---|
| Job Interactions | 5.2M+ | Real-time analytics and insights |
| Monthly Submissions | Thousands | Fully automated processing |
| Storage Optimization | 50% reduction | Columnar compression |
| Manual Intervention | Minimal | Working towards full automation |
Community Impact
What started as a manual process has become a fully automated platform serving:
Students: Thousands discover internships daily through our streamlined interface
Companies: Hundreds post opportunities with instant visibility
Contributors: Active community maintains data quality through GitHub
Ecosystem: Real-time updates accessible to everyone, everywhere
Technical Architecture
Core Technologies
URL Shortening Service:
- TypeScript - Type-safe backend development
- Express.js - High-performance web framework
- Routing Controllers - Decorator-based API development
- Xata Database - Serverless, scalable data storage
- Docker - Containerized deployment
Repository Automation:
- Python - Backend automation and data processing
- GitHub Actions - CI/CD pipeline and automation triggers
- JSON - Lightweight, fast data storage
- Markdown - Dynamic README generation
- Regular Expressions - Advanced text parsing and validation
Design Principles
The system was built with several key principles:
- Automation First - Minimize manual intervention
- Data Integrity - Comprehensive validation at every step
- User Experience - Fast, reliable, accessible interface
- Scalability - Handle exponential growth gracefully
- Community-Driven - Empower contributors while maintaining quality
Future Vision
The internships platform has evolved from a simple tracking tool to a comprehensive career discovery engine. With the foundation now in place, the platform continues to innovate in areas like:
- Advanced filtering and search capabilities
- Integration with application tracking systems
- Real-time notification systems
- Enhanced analytics and insights
- Mobile-optimized experiences